The SSIS tuning tip that everyone misses

I know that everyone misses this, because I’m yet to find someone who doesn’t have a bit of an epiphany when I describe this.

When tuning Data Flows in SQL Server Integration Services, people see the Data Flow as moving from the Source to the Destination, passing through a number of transformations. What people don’t consider is the Source, getting the data out of a database.

Remember, the source of data for your Data Flow is not your Source Component. It’s wherever the data is, within your database, probably on a disk somewhere. You need to tune your query to optimise it for SSIS, and this is what most people fail to do.

I’m not suggesting that people don’t tune their queries – there’s plenty of information out there about making sure that your queries run as fast as possible. But for SSIS, it’s not about how fast your query runs. Let me say that again, but in bolder text:

The speed of an SSIS Source is not about how fast your query runs.

If your query is used in a Source component for SSIS, the thing that matters is how fast it starts returning data. In particular, those first 10,000 rows to populate that first buffer, ready to pass down the rest of the transformations on its way to the Destination.

Let’s look at a very simple query as an example, using the AdventureWorks database:

image

We’re picking the different Weight values out of the Product table, and it’s doing this by scanning the table and doing a Sort. It’s a Distinct Sort, which means that the duplicates are discarded.

It’ll be no surprise to see that the data produced is sorted. Obvious, I know, but I’m making a comparison to what I’ll do later.

image

Before I explain the problem here, let me jump back into the SSIS world…

If you’ve investigated how to tune an SSIS flow, then you’ll know that some SSIS Data Flow Transformations are known to be Blocking, some are Partially Blocking, and some are simply Row transformations.

Take the SSIS Sort transformation, for example. I’m using a larger data set for this, because my small list of Weights won’t demonstrate it well enough.

image

Seven buffers of data came out of the source, but none of them could be pushed past the Sort operator, just in case the last buffer contained the data that would be sorted into the first buffer. This is a blocking operation.

Back in the land of T-SQL, we consider our Distinct Sort operator. It’s also blocking. It won’t let data through until it’s seen all of it.

If you weren’t okay with blocking operations in SSIS, why would you be happy with them in an execution plan?

The source of your data is not your OLE DB Source. Remember this. The source of your data is the NCIX/CIX/Heap from which it’s being pulled.

Picture it like this… the data flowing from the Clustered Index, through the Distinct Sort operator, into the SELECT operator, where a series of SSIS Buffers are populated, flowing (as they get full) down through the SSIS transformations.

image

Alright, I know that I’m taking some liberties here, because the two queries aren’t the same, but consider the visual.

The data is flowing from your disk and through your execution plan before it reaches SSIS, so you could easily find that a blocking operation in your plan is just as painful as a blocking operation in your SSIS Data Flow.

Luckily, T-SQL gives us a brilliant query hint to help avoid this.

OPTION (FAST 10000)

This hint means that it will choose a query which will optimise for the first 10,000 rows – the default SSIS buffer size. And the effect can be quite significant.

First let’s consider a simple example, then we’ll look at a larger one.

Consider our weights. We don’t have 10,000, so I’m going to use OPTION (FAST 1) instead.

image

You’ll notice that the query is more expensive, using a Flow Distinct operator instead of the Distinct Sort. This operator is consuming 84% of the query, instead of the 59% we saw from the Distinct Sort. But the first row could be returned quicker – a Flow Distinct operator is non-blocking.

The data here isn’t sorted, of course. It’s in the same order that it came out of the index, just with duplicates removed.

image

As soon as a Flow Distinct sees a value that it hasn’t come across before, it pushes it out to the operator on its left. It still has to maintain the list of what it’s seen so far, but by handling it one row at a time, it can push rows through quicker. Overall, it’s a lot more work than the Distinct Sort, but if the priority is the first few rows, then perhaps that’s exactly what we want.

The Query Optimizer seems to do this by optimising the query as if there were only one row coming through:

image

This 1 row estimation is caused by the Query Optimizer imagining the SELECT operation saying “Give me one row” first, and this message being passed all the way along. The request might not make it all the way back to the source, but in my simple example, it does.

I hope this simple example has helped you understand the significance of the blocking operator. Now I’m going to show you an example on a much larger data set.

This data was fetching about 780,000 rows, and these are the Estimated Plans. The data needed to be Sorted, to support further SSIS operations that needed that.

First, without the hint.

image

…and now with OPTION (FAST 10000):

image

A very different plan, I’m sure you’ll agree. In case you’re curious, those arrows in the top one are 780,000 rows in size. In the second, they’re estimated to be 10,000, although the Actual figures end up being 780,000.

The top one definitely runs faster. It finished several times faster than the second one. With the amount of data being considered, these numbers were in minutes. Look at the second one – it’s doing Nested Loops, across 780,000 rows! That’s not generally recommended at all. That’s “Go and make yourself a coffee” time. In this case, it was about six or seven minutes. The faster one finished in about a minute.

But in SSIS-land, things are different.

The particular data flow that was consuming this data was significant. It was being pumped into a Script Component to process each row based on previous rows, creating about a dozen different flows. The data flow would take roughly ten minutes to run – ten minutes from when the data first appeared.

The query that completes faster – chosen by the Query Optimizer with no hints, based on accurate statistics (rather than pretending the numbers are smaller) – would take a minute to start getting the data into SSIS, at which point the ten-minute flow would start, taking eleven minutes to complete.

The query that took longer – chosen by the Query Optimizer pretending it only wanted the first 10,000 rows – would take only ten seconds to fill the first buffer. Despite the fact that it might have taken the database another six or seven minutes to get the data out, SSIS didn’t care. Every time it wanted the next buffer of data, it was already available, and the whole process finished in about ten minutes and ten seconds.

When debugging SSIS, you run the package, and sit there waiting to see the Debug information start appearing. You look for the numbers on the data flow, and seeing operators going Yellow and Green. Without the hint, I’d sit there for a minute. With the hint, just ten seconds. You can imagine which one I preferred.

By adding this hint, it felt like a magic wand had been waved across the query, to make it run several times faster. It wasn’t the case at all – but it felt like it to SSIS.

Slide-decks from recent Adelaide SQL Server UG meetings

The UK has been well represented this summer at the Adelaide SQL Server User Group, with presentations from Chris Testa-O’Neill (isn’t that the right link? Maybe try this one) and Martin Cairney. The slides are available here and here.

I thought I’d particularly mention Martin’s, and how it’s relevant to this month’s T-SQL Tuesday. TSQL2sDay150x150

Martin spoke about Policy-Based Management and the Enterprise Policy Management Framework – something which is remarkably under-used, and yet which can really impact your ability to look after environments. If you have policies set up, then you can easily test each of your SQL instances to see if they are still satisfying a set of policies as defined.

Automation (the topic of this month’s T-SQL Tuesday) should mean that your life is made easier, thereby enabling to you to do more. It shouldn’t remove the human element, but should remove (most of) the human errors. People still need to manage the situation, and work out what needs to be done, etc. We haven’t reached a point where computers can replace people, but they are very good at replace the mundaneness and monotony of our jobs. They’ve made our lives more interesting (although many would rightly argue that they have also made our lives more complex) by letting us focus on the stuff that changes.

Martin named his talk Put Your Feet Up, which nicely expresses the fact that managing systems shouldn’t be about running around checking things all the time. It must be about having systems in place which tell you when things aren’t going well. It’s never quite as simple as being able to actually put your feet up, but certainly no system should require constant attention.

It’s definitely a policy we at LobsterPot adhere to, whether it’s an alert to let us know that an ETL package has run successfully, or a script that generates some code for a report.

If things can be automated, it reduces the chance of error, reduces the repetitive nature of work, and in general, keeps both consultants and clients much happier.