Function Invertability for SARGability

My good friend Simon Sabin used the term ‘invertability’ on a Connect item he logged today.

Essentially, Simon’s noticed that there are lots of people that use year(someDate), but that the system doesn’t understand that this function doesn’t affect the order of the items in the index. month(someDate) does, but if you’re already using year(someDate), then the combination of the two doesn’t change.

This is one of the keys to SARGability, which I’ve written about before, like at http://bit.ly/sargability. I’ve also raised a Connect item myself about it.

However, the term ‘invertability’ is interesting, and ties into the Inverse Predicates concept that I’ve also used before, like at http://bit.ly/inversepredicates. The idea is that you might have applied a function to a column, creating a predicate that isn’t sargable, but if you (or the system) can tell how to invert it, then you can make a predicate that can be easily handled by the Query Optimizer. Currently, the system doesn’t understand the invertability of all the functions (even easy ones like the YEAR function), but it’s something which would make SQL a lot faster if it did.

I’m going to let you read those various posts yourself, and encourage you to vote for Simon’s connect item. But as well as that, I’m going to encourage you to consider the SARGability of the predicates in your query.

Edit: Simon’s written a post on this now.

Probe Residual when you have a Hash Match – a hidden cost in execution plans

No, this post has nothing to do with airport security, and nothing to do with marijuana.

Being honest, this post is only half about Hash Matches and Probe Residuals. It’s more about the types of predicates you can see mentioned in a SQL Server Execution Plan (or Query Plan, but I prefer to call them Execution Plans) – but you may well see some described as a Probe Residual when you look at the properties of a Hash Match operator.

The main point of this post is: Some of these predicates can be really bad, even if they’re part of things which seem really good (like Seeks or Merge Joins).

Let’s consider a join. Two streams of data, from which matching rows must be found. These matching rows will be the ones that satisfy the join conditions, expressed through predicates listed in the ON clause and/or the WHERE clause. In fact, these predicates might only involve one side of the join, such as OrderDate <= ‘2011-03-22T00:00:00.000’. There are plenty of times when a join condition will incorporate a one-sided predicate like this – imagine a scenario in which matching rows can be easily located in an index, but only those that match an additional condition are allowed to be included.

In a join there are these two streams of data, joined by one of a few different operators. This operator could be a Hash Match; it could be a Merge Join. It could even be a Nested Loop, although then the predicates are (generally) handled in the second (lower) data source. In fact, let’s start with that scenario.

When a Seek is performed, there are two main kinds of predicates that can be included. One is the Seek Predicate, and one is simply listed as Predicate. I prefer to call this second one the Residual Predicate. It’s the leftover one, after the Seek has been performed. This often happens when SARGability isn’t possible. SARGability is about being able to use an index effectively (with a seek), so if you have a predicate which doesn’t allow SARGability (for example, being able to consider the last letter of a character string.

image image

You’ll see here that for each ProductSubcategory, we look up the Products for it. We have [s].[ProductSubcategoryID] before the Seek is done (the Nested Loop calls the Seek operator using each one), but although it can quickly seek to the right row(s) involved, it performs an additional check, making sure that the Product Name ends in a ‘y’.

To get this plan, I used the query:

The index lets the system immediately search for the rows needed. It can seek to the rows in rf_ix_Product_Subcat_inc that match the SubcategoryID, but this is only half the story. Having applied this Seek Predicate, there’s still the matter of the last letter of the Product Name, which isn’t something that can be checked easily with the index. The values are there, but each row that the Seek finds must be checked individually, with this leftover, or Residual, Predicate.

Hopefully this helps show why I want to call the Predicate here a Residual Predicate.

But this Residual Predicate must be tested on every row that is fetched by the Seek. That could well be a lot of rows, if the Seek isn’t particularly selective.

The Seek might feel really nice, and might be returning very few rows. But the effort could be a lot larger than you expect if most of the work is being done in the Residual Predicate.

The Nested Loop operator pulls rows from the first stream of data, and passes the required values down to the next row of operators, pulling a stream of data which is then simply joined. All the rows that come in from the second stream of data are known to be matches with the row that provided the values, so that there is relatively little work in doing the actual join.

In a Merge Join or Hash Match, things are slightly different. The predicate checking happens in the actual join operator. However, there are still two types of predicates – the main ones and the residuals.

In a Merge Join, the two data streams are ordered by columns appropriate for the join, but there could still be a leftover predicate.

Consider this query (but I’ve dropped the extra index at this point). It’s a very similar query to before, but I’m forcing a Merge Join with a Join Hint, and I’ve thrown in a predicate which is non-SARGable from the perspective of either table. This is part of the clue to the problem – it’s a non-SARGable predicate. Mind you, it could seem perfectly SARGable. It might simply be that the stream of data being used for the Merge Join isn’t ordered by all the columns involved in the join.

When hovering over the Merge Join operator in this query’s plan, we see this tooltip. You’ll notice that it has a section called “Where (join columns)” which shows that the join is being done on the ProductSubcategoryID.

image

However, the other predicate isn’t mentioned. It’s nowhere in the tooltip. Hitting F4 shows the Properties window, and this is where we find it, in a property called Residual.

Interestingly, it re-checks that the ProductSubcategoryID columns match, but the important thing is that it’s here (and only here) that Residual Predicate is tested.

This Residual Predicate must be tested on every combination of rows that match the ‘Where (join columns)’ predicate. That could well be a lot of rows, if those rows aren’t particularly selective.

With a Hash Match, the join is done by first applying a Hash function to columns involved in the join, using the resultant Hash Key to populate the data (including all the required columns) into a Hash Table. Once that has been done for one stream of data, the second stream is pulled in, and the Hash function applied to the columns from the second stream. The result of each row from the second stream is used in a Probe of the Hash Table. However, predicates which don’t work nicely with the Hash Key concept (such as the ‘blah’ predicate I used earlier) are considered residual.

image

So when candidate rows are identified via the Probe, the Probe Residual still needs to be tested. Just like the Predicate after the Seek Predicate was done, and the Residual after the Where (join columns) are handled.

This Probe Residual must be tested on every combination of rows that satisfies the Hash Key Probe. That could well be a lot of rows, if that probe isn’t particularly selective.

When you’re trying to tune your query, you need to consider how many rows are being matched by each section of the join.

Imagine with me that you have a Merge Join (or a Hash Match), in which you have a predicate such as p1.ListPrice – p2.Listprice = 0

(Incidentally, this query could use any of the three joins, depending on indexes and other filters. Put an index on ProductSubcategoryID including ListPrice, and then run the query with either no WHERE clause (Hash Match), a WHERE clause for ProductSubcategoryID < 2 (Nested Loop) or < 5 (Merge Join).)

The predicate featuring the ListPrice column is always going to be treated as residual. It’s something that can only be tested once both values are known, and is considered a non-SARGable predicate.  Regardless of what type of join is done, the ListPrice predicate is handled as a Residual.

For this query, the answer is hopefully obvious. Rewriting the predicate as p1.ListPrice = p2.ListPrice will resolve it nicely, but an example you have might not be so straightforward.

Residual predicates can be expensive, and the bottleneck of your query might not be obvious from looking at the plan. The mere fact that a Residual in a Merge Join is not shown in the tooltip could mean you miss it significantly. Don’t worry – you’ll be in good company. Plenty of proper experts miss this.

Luckily, the answer is simple. Look at your Seek Predicates, your Where (join columns) and your Hash Keys Probes, and compare this to the Residuals. If the Residuals are needing to be checked for a lot more rows that you’d like, then you have a tuning opportunity you can leverage. Ideally, Residuals only need to be applied on a tiny number of rows.

Remember, a Residual Predicate will feel like a Scan, because it’s not using the Index nicely. Scanning a tiny table might be fine, but scanning a large one could be horrible.

@rob_farley

How many people will be with you during 24HOP?

In less than a week, SQLPASS hosts another 24 Hours of PASS event, this time with an array of 24 female speakers (in honour of this month being Women’s History Month).

Interestingly, the committee has had a few people ask if there are rules about how the event can be viewed, such as “How many people from any one organisation can watch it?” or “Does it matter if a few people are crowded around the same screen?”

From a licensing and marketing perspective, there is value in knowing how many people are watching the event, but there are no restrictions about how the thing is viewed.

In fact – if you’re planning to watch any of these events, I want to suggest an idea:

Book a meeting room in your office with a projector, and watch 24HOP in there.

If you’re planning to have it streaming in the background while you work, obviously this makes life a bit trickier. But if you’re planning to treat it as a training event (a 2-day conference if you like) and block out a bit of time for it (as well you should – there’s going to be some great stuff in there), then why not do it in a way that makes it so that other people can see that you’re watching it, and potentially join you.

Lecture Hall Seats

When an event like this runs, we can see how many different ‘people’ are attending each LiveMeeting session. What we can’t tell is how many actual people there are represented. Jessica Moss spoke to the Adelaide SQL Server User Group a few weeks ago via LiveMeeting, and LiveMeeting told us there were less than a dozen people attending. Really there were at least three times that number, because all the people in the room with me weren’t included.

I’d love to imagine that every LiveMeeting attendee represented a crowd in a room, watching a shared screen.

So there’s my challenge – don’t let your LiveMeeting session represent just you. Find a way of involving other people. At the very least, you’ll be able to discuss it with them afterwards. Now stick a comment on this post to let me know how many people are going to be joining you. 🙂

If you’re not registered for the event yet, get yourself over to the SQLPASS site and make it happen.

The blocking nature of aggregates

I wrote a post recently about how query tuning isn’t just about how quickly the query runs – that if you have something (such as SSIS) that is consuming your data (and probably introducing a bottleneck), then it might be more important to have a query which focuses on getting the first bit of data out. You can read that post here. 

In particular, we looked at two operators that could be used to ensure that a query returns only Distinct rows.

image and image

The Sort operator pulls in all the data, sorts it (discarding duplicates), and then pushes out the remaining rows. The Hash Match operator performs a Hashing function on each row as it comes in, and then looks to see if it’s created a Hash it’s seen before. If not, it pushes the row out. The Sort method is quicker, but has to wait until it’s gathered all the data before it can do the sort, and therefore blocks the data flow.

But that was my last post. This one’s a bit different.

TSQL2sDay150x150This post is going to look at how Aggregate functions work, which ties nicely into this month’s T-SQL Tuesday.

I’ve frequently explained about the fact that DISTINCT and GROUP BY are essentially the same function, although DISTINCT is the poorer cousin because you have less control over it, and you can’t apply aggregate functions.

Just like the operators used for Distinct, there are different flavours of Aggregate operators – coming in blocking and non-blocking varieties. The example I like to use to explain this is a pile of playing cards.

If I’m handed a pile of cards and asked to count how many cards there are in each suit, it’s going to help if the cards are already ordered. Suppose I’m playing a game of Bridge, I can easily glance at my hand and count how many there are in each suit, because I keep the pile of cards in order. Moving from left to right, I could tell you I have four Hearts in my hand, even before I’ve got to the end. By telling you that I have four Hearts as soon as I know, I demonstrate the principle of a non-blocking operation.

cards 001This is known as a Stream Aggregate operation. It requires input which is sorted by whichever columns the grouping is on, and it will release a row as soon as the group changes – when I encounter a Spade, I know I don’t have any more Hearts in my hand.

image

Alternatively, if the pile of cards are not sorted, I won’t know how many Hearts I have until I’ve looked through all the cards. In fact, to count them, I basically need to put them into little piles, and when I’ve finished making all those piles, I can count how many there are in each. Because I don’t know any of the final numbers until I’ve seen all the cards, this is blocking. This performs the aggregate function using a Hash Match. Observant readers will remember this from my Distinct example.

cards 002

image

You might remember that my earlier Hash Match operation – used for Distinct Flow – wasn’t blocking. But this one is. They’re essentially doing a similar operation, applying a Hash function to some data and seeing if the set of values have been seen before, but before, it needs more information than the mere existence of a new set of values, it needs to consider how many of them there are.

A lot is dependent here on whether the data coming out of the source is sorted or not, and this is largely determined by the indexes that are being used. If you look in the Properties of an Index Scan, you’ll be able to see whether the order of the data is required by the plan. A property called Ordered will demonstrate this.

image

image

In this particular example, the second plan is significantly faster, but is dependent on having ordered data.

In fact, if I force a Stream Aggregate on unordered data (which I’m doing by telling it to use a different index), a Sort operation is needed, which makes my plan a lot slower.

image

This is all very straight-forward stuff, and information that most people are fully aware of. I’m sure you’ve all read my good friend Paul White (@sql_kiwi)’s post on how the Query Optimizer chooses which type of aggregate function to apply.

But let’s take a look at SQL Server Integration Services.

SSIS gives us a Aggregate transformation for use in Data Flow Tasks, but it’s described as Blocking. The definitive article on Performance Tuning SSIS uses Sort and Aggregate as examples of Blocking Transformations.

image

I’ve just shown you that Aggregate operations used by the Query Optimizer are not always blocking, but that the SSIS Aggregate component is an example of a blocking transformation. But is it always the case? After all, there are plenty of SSIS Performance Tuning talks out there that describe the value of sorted data in Data Flow Tasks, describing the IsSorted property that can be set through the Advanced Editor of your Source component.

And so I set about testing the Aggregate transformation in SSIS, to prove for sure whether providing Sorted data would let the Aggregate transform behave like a Stream Aggregate. (Of course, I knew the answer already, but it helps to be able to demonstrate these things).

A query that will produce a million rows in order was in order. Let me rephrase. I used a query which produced the numbers from 1 to 1000000, in a single field, ordered. The IsSorted flag was set on the source output, with the only column as SortKey 1. Performing an Aggregate function over this (counting the number of rows per distinct number) should produce an additional column with 1 in it.

image

If this were being done in T-SQL, the ordered data would allow a Stream Aggregate to be used. In fact, if the Query Optimizer saw that the field had a Unique Index on it, it would be able to skip the Aggregate function completely, and just insert the value 1. This is a shortcut I wouldn’t be expecting from SSIS, but certainly the Stream behaviour would be nice.

Unfortunately, it’s not the case.

imageimage

As you can see from the screenshots above, the data is pouring into the Aggregate function, and not being released until all million rows have been seen. It’s not doing a Stream Aggregate at all.

This is expected behaviour.

(I put that in bold, because I want you to realise this.)

An SSIS transformation is a piece of code that runs. It’s a physical operation. When you write T-SQL and ask for an aggregation to be done, it’s a logical operation. The physical operation is either a Stream Aggregate or a Hash Match. In SSIS, you’re telling the system that you want a generic Aggregation, that will have to work with whatever data is passed in.

I’m not saying that it wouldn’t be possible to make a sometimes-blocking aggregation component in SSIS. A Custom Component could be created which could detect whether the SortKeys columns of the input matched the Grouping columns of the Aggregation, and either call the blocking code or the non-blocking code as appropriate. One day I’ll make one of those, and publish it on my blog. I’ve done it before with a Script Component, but as Script components are single-use, I was able to handle the data knowing everything about my data flow already.

As per my previous post – there are a lot of aspects in which tuning SSIS and tuning execution plans use similar concepts. In both situations, it really helps to have a feel for what’s going on behind the scenes. Considering whether an operation is blocking or not is extremely relevant to performance, and that it’s not always obvious from the surface.

In a future post, I’ll show the impact of blocking v non-blocking and synchronous v asynchronous components in SSIS, using some of LobsterPot’s Script Components and Custom Components as examples. When I get that sorted, I’ll make a Stream Aggregate component available for download.