Having worked with 30+ clients using the IBM PureData for Analytics (formerly Netezza) SQL server system, we’ve learned a lot about the typical mistakes made leading to unimpressive performance. Here are a few of our thoughts. Contact us if you’d like to skip all this and get some advice right away.
Optimization isn’t 100% necessary, but it will provide an impressive pay-off!
Anybody who has gone through the Proof of Concept/Tool Selection cycle with a SQL server database vendor knows the pitch…
No configuration necessary!
No maintenance required!
Out-of-the-box performance!
The claim: No configuration required
The reality: This is a tricky one! What exactly do we mean by configuration? Certainly Netezza doesn’t need you to configure redo logs, size tablespaces, or do any RAID setup – the flipside of this statement is that you can’t customize these things. Not to worry, the system defaults are quite good as-is!
The claim: No maintenance required
The reality: While you could skip the maintenance tasks, you’ll see the system quickly running out of steam, especially on systems that are UPDATE/DELETE intensive.
The claim: Out-of-the-box performance
The reality: It’s certainly true that the Netezza system should outperform your previous system with little or no setup. After all, your new Massively Parallel Processing (MPP) system probably has many times the number of CPU cores, RAM, and disks as your previous Symmetric Multi-Processing (SMP) system! That being said, you could achieve enormous gains by sticking to a few simple rules…
Rule #1 of Netezza SQL Server
You do not talk about Netezza (Sorry, couldn’t help myself!)
Let’s try that again. Rule #1 is:
Move Less
That is, move less data between the different servers that compose the Netezza system. Data movement is a slow and expensive operation, so let’s keep it to a minimum! The Netezza system has to perform data movement operations whenever data isn’t in the “right place” to accommodate the running query. Examples of queries requiring data movement include JOIN and GROUP by operations.
We reduce data movement by getting the Table Distribution correct (DISTRIBUTE ON clause in the table DDL). Here are some guidelines ordered (approximately) from most-to-least important. Strictly speaking, some of these examples are more about avoiding processing skew than data movement, but we’re trying to keep the explanations simple.
Pick the most common join or group-by column for each table. For your dimension table this is probably the primary key (as-was model with as-was reporting) or natural key (as-was model with as-is reporting). By “common”, we might mean the column joined/group-by’ed most often, or the one involved in the longest running queries. Personally, I mine the pg.log and query history database to put together metrics on the best column(s) to use and when they bring the most value (remember, you might even want to change the distribution throughout the day/month/quarter depending on query patterns).
Pick a column with high cardinality and low skew (comparable number of records for each value of the column). What does “high” mean in this context? Let’s say at least a few hundred (lower for small Netezza systems, higher for large ones).
Avoid narrow restrictions. In other words, don’t use a distribution key that commonly appears in the restriction clause of a query (e.g. WHERE COLUMN_1=’X’) with only one or a few values appearing in place of “X”. What does “narrow” mean here? The same thing that “high” means in guideline 2 above!
Use efficient data-types. Anyone who has taken one of my courses has had the “theory” behind which data-type to choose and why explained at length – you’ll never forget it! I’ll spare the rest of you the details; lets just say we prefer to use INTEGER, BIGINT values, we avoid using VARCHAR if possible, and we never use FLOAT, REAL, or DOUBLE PRECISION for anything (no, not even currency exchange rates).
Rule #2 of Netezza SQL Server
Scan Less
Data scanning is disk, CPU, and network intensive operation and is best avoided if possible. The tips for query writers and analysts are simple:
Only SELECT the columns you need in your query.
RESTRICT data in your SQL rather than pulling it all back and filtering in a downstream application – in other words, don’t pull back billions of rows of data if you’ll only use thousands.
Of course you already knew those tips above. There’s nothing to be done if your query is SELECT * FROM TABLE! Let’s talk about what the technical architect and DBA can do to optimise system scans when queries contain column subsets and restrictions.
Scanning comes in two-dimensions, depth (rows) and width (columns). Anyone who has taken an advanced course in applied computing or finds themselves wandering the ACM publications regularly is probably aware that the depth/width distinction is a human one, nevertheless it is a useful construct for us to use here.
Reducing depth scans using Zone Maps and Cluster-Based-Tables (CBTs)
The Netezza appliance is optimised for selecting large datasets (as opposed to individual records), zone maps are a built-in feature of Netezza that helps minimize depth scanning in the presence of restrictions. Here’s how we can help make sure that the zone maps are optimised.
Zone maps
Alter your ETL routines to load data ordered by your most common restriction column (the column in the queries you most want to optimise).
Tip #1: You probably don’t need to alter your ETL to order on date, in a typical EDW environment, our daily loads are already restricted to a single date, so pick the next common column!
Tip #2: To caveat tip #1, you probably will still need to have the DATE first in your data load order on the initial migration to a new system.
What do we do if we sometimes restrict (filter) on column_1, and sometimes on column_2?
Cluster-Based-Tables (CBT)
Because the vast majority of the scan-reduction benefits accrue to the first restriction column in your ETL data load order, Netezza has built CBTs to accommodate situations with multiple varying restriction column sets, just include the ORGANIZE ON clause in your table DDL and schedule your GROOM TABLE operations regularly.
Pick the up to four of the most common restriction columns in your queries, ordering doesn’t matter, and put them in the ORGANIZE ON clause of your table DDL. Common choices might include DATE, REGION, MARKET_SEGMENT, and PRODUCT_TYPE.
Schedule a regular GROOM TABLE operation on this table (ORGANIZE ON gives the instruction on what to optimise, GROOM actually does the optimisation).
But watch out!
If you only have one common restriction, you’re better off optimising your Zone Maps.
If you use CBTs, you cannot optimise width scanning.
Reducing width scans using Materialized Views
Let me start by saying that the Netezza concept of Materialized Views has nothing to do with the Oracle concept. I suspect the name confusion comes about because the back-end implementations are likely similar (pointers to columns). Here are some tips for setting up your MViews.
The point is to include the columns that appear commonly in your queries (or a subset of your queries). Imagine your Table A has columns 1 – 100. Perhaps the 2 most common queries use columns 1 – 5 and 1 – 6; in that case, we would probably create a Materialized View containing columns 1 – 6 in order to optimise both queries.
Your MViews should contain all of the columns contained for a particular table in a query in order for the system optimiser to use it – that includes columns in the SELECT, WHERE, JOIN, ORDER BY, GROUP BY, HAVING, or anywhere else.
MViews contain only one table and are limited to the keywords SELECT, FROM, and optionally ORDER BY. In other words, no functions, joins, restrictions, aggregates, analytics… well you get the point.
You can have multiple MViews per underlying table.
You need not reference the MView in your query, the system will automatically do it for you if it provides a performance gain. In fact, you should not reference it, that way you can maintain some flexibility and change things around as needed, take it offline, and so on. The query optimiser will simply use the underlying table if the Materialized View is unavailable!
Remember how we said that if we use a CBT, we cannot use an MView? That’s why we have the capability of including an ORDER BY clause in your MView. Your Order By clause should include the most common restriction column, if you don’t include the ORDER BY clause, the system uses the order of the underlying data.
Remember to SUSPEND your MViews during ETL loads, and REFRESH them afterwards.
Set a system DEFAULT REFRESH THRESHOLD for Materialized Views of 5-10%.