Tuning Parallel Data Warehouse Queries

March 10, 2015

Performance tuning in regular SQL Server can be addressed in a number of ways. This can involve looking at what’s going on with the disk configuration, the memory configuration, the wait stats, the parallelism settings, indexing, and so much more. But if you have a Parallel Data Warehouse (PDW) environment, then there are a lot of things that are taken out of your hands.

When you buy the Analytics Platform System (APS) appliance, which hosts PDW, you’re buying a complete system. It’s an appliance, in the same way that you buy appliances for the kitchen. When you go camping and you want to cook some food, you have to build a fire, and find some way of suspending the food over the fire. Building a fire isn’t exactly easy in itself – you need to consider what you’re going to do with kindling, you need larger pieces of wood for building heat, you need a spark – and then working out how to suspend the food at an appropriate height is another matter again! There are plenty of opportunities to tune this environment, and I’m sure many people have this down to a fine art. For most of us, though, we would rather just have an appliance (an oven) and just put the food in and turn it on.

The PDW environment is like the oven rather than the campfire. Disk comes preconfigured. Network comes preconfigured. Placement of files and filegroups is preconfigured, as is the configuration of tables and databases across these filegroups. High availability decisions are made for us, and we are left to design the database.

But the design of the tables within the database can dramatically affect the performance of the system – in ways that regular SQL Server environments don’t even start to have to consider. And this is because of the concept of RAM across multiple servers.

To be able to see a row of data, ready to be able to do something with it – whether that be to return it to the user who ran the query, to manipulate it in some way, or to join it to another row from another table – it must exist in RAM. DBAs understand this, and make sure that their SQL environments have as much RAM as possible, to avoid having to pull that row into RAM from disk mid-query. But when two rows that need to be joined exist on different servers, the memory spaces in RAM that can hold those rows exist on different servers too. In PDW, we have just that situation – lots of servers, with the data spread out across them. It looks like a single server, but is actually made up of lots of servers in a scaled-out architecture.

Therefore, to join data in PDW, we have to be confident that any two rows that need to be joined exist on the same server. This can be done in one of three ways:

1. At least one of the rows can be stored on every server.

This is done by configuring one of the tables as a Replicated table rather than a Distributed table. This is very common, and is a common feature for dimension tables that don’t change particularly often. If the entire list of Stores is stored (ha!) on every server, then our Sales data can be easily joined to it without fear that the Store in question for a particular Sales row might not be on the same server as said Sales row.

2. The two tables can be distributed using the same key, with that key being part of the join conditions.

Another common method. Suppose we have a large number of Products, which are frequently being changed and versioned, and we have a large amount of Sales. Both are good candidates for being Distributed rather than Replicated – we probably wouldn’t want a full copy of our large and quickly-growing Products table on every server. But if both tables are distributed on ProductKey and this is part of the join condition (p.ProductKey = s.ProductKey), then the system will know that any two rows that are a successful match according to the join conditions must also be on the server.

3. Data can be moved mid-query.

In regular SQL, if a row isn’t in RAM already, we need to pull a copy of it into RAM off the disk. Similarly, in PDW, if the system can’t be sure that the row exists on the right server, it can pull a copy of it onto the right server. This is known as Data Movement, and just as regular DBAs try to minimise the amount of disk activity for performance tuning, PDW DBAs try to minimise Data Movement.

Data Movement comes in a variety of flavours, but I want to mention two of them here. One is broadcasting, which is turning the smaller of the two tables (or at least, the data which is identified as being of interest once other row and column filters have been applied) into a temporary replicated table, putting a copy of that data on every server. This turns our scenario into situation 1, and the query can continue. Another is shuffling, which is distributing some of the data on a different key, which is part of the join conditions – putting us into situation 2.

To satisfy a single query, there might be a lot of data movement required.

If you consider a typical data warehouse query, there are a lot of joins, across a lot of tables, and the join conditions are different for every join. Unless you start to think about PDW queries in a very different way to which you think about non-PDW queries, you can very easily suffer from poor performance simply down to unnecessary data movement.

Thinking outside the box can provide some useful mechanisms for tuning PDW, which can see the performance improve massively.

Denormalising is a method that many data warehouse designers are very familiar with, but when you start to consider the pain of joins in a PDW environment, it starts to become a lot more significant. It’s often not appropriate to denormalise too much, but if you have been keen on normalising significantly in the past, you should potentially reconsider this useful method.

Redundant predicates are something which many developers don’t seem to appreciate. I’ve heard developers say that any piece of redundant code should be removed, because it’s going to create extra work. But in the PDW world, these predicates could help explain to the PDW Query Optimizer that data movement isn’t actually necessary, and could save you a large amount of effort.

Views are often maligned in the regular SQL world, as people write views that query views that query views. I’m not suggesting that you fall into that trap, but you should certainly consider views to encapsulate join logic, encouraging the use of those redundant predicates that I just mentioned.

Table splitting seems counter-intuitive to avoiding joins, but when split horizontally (having some rows in one table and others in another), gives extra options around how to distribute the data. Some data could be distributed according to one column, and other data could be distributed on another. This has the added benefit of avoiding skew, but can also benefit from using views when accessing these tables, as PDW does not support check constraints. Splitting tables vertically is rarer, but can allow dimensions to get wider and allow for additional denormalisation before hitting the maximum row width of 32kB (in-row data is limited to 8kB, but data movement buffers are only 32kB, which makes for a much harder limit).

These methods don’t have the same kind of impact in your ordinary SQL Server environments, but can have a massive effect in PDW. You can see examples of how these things can come into play by looking at a previous post of mine http://sqlblog.com/blogs/rob_farley/archive/2014/12/05/minimising-data-movement-in-pdw-using-query-optimisation-techniques.aspx, where you should be able to spot my use of all of these methods.

Oh, and this piece about performance tuning fits into this month’s T-SQL Tuesday topic, hosted by Russ Thomas (@sqljudo) on the topic of Performance.