Medians pre-SQL 2012

SQL 2012 was a big release for working out the median in SQL Server, with the advent of the function PERCENTILE_CONT(). It’s a very elegant way of working out the median (hint, that’s the 0.5 point), even though it’s not actually an aggregate function, as I’ve written before.

Plus – it doesn’t even perform well. About a year ago, Aaron Bertrand (@aaronbertrand) wrote a fantastic post about different methods for getting medians, and showed that PERCENTILE_CONT() is actually one of the slowest methods, and that the best method is to use an idea from Peter Larsson (@SwePeso) that uses an OFFSET-FETCH clause to grab the rows of interest before doing an average of them.

Except that the OFFSET-FETCH clause was also new in 2012. So if you’re stuck on SQL 2008 R2 and earlier, you’re a bit more stuck.

All the pre-SQL 2012 methods that Aaron showed used ROW_NUMBER() except one – which used a combination of MIN/MAX over each half of the data. But one method that Aaron didn’t explore in his post was to simulate OFFSET-FETCH in earlier versions. Let me show you…

Here’s the OFFSET-FETCH method. Notice that it fetches either 1 or 2 rows (depending on whether the overall count is 1 or 2), but offsets by just under half of the set.

What my pre-2012-compatible version does is to fetch slightly MORE than the set first, and then get the top 1 or 2 but in DESC order.

With OFFSET-FETCH, we’re grabbing the rows we want by skipping over the rows we’re not interested in until we find the ones that we are interested in. In the TOP/TOPDESC, we’re identifying the rows we want by the fact that they’re the bottom of the top slightly-more-than-half set.

Other than that, the idea is exactly the same. The results are identical, but what about the performance?

First, let’s give you the code to set up your environment (as found in Aaron’s post) – I used a clustered index.

What I want to do to evaluate this is to look at the query plans. Once I’ve done that, I’ll make a comment about the performance and where it fits into the mix from Aaron’s post.

So those plans… OFFSET-FETCH method first, followed by the TOP/TOPDESC method. I’m using a clustered index on the data – a nonclustered index gives the same shape but with nonclustered index operations instead of clustered index operations. Heaps are a different story that I’m not exploring here.

image

As you’d expect, there’s a lot of similarity. Both use Nested Loops, grabbing the Counts from a Scan on the outer branch, with a Seek on the inner branch. And both inner branches have a Top Operator pulling the data out of a Seek. But the TOP/TOPDESC method has TWO Top operators, with a Sort in between. This is because of the ‘TOPDESC’ bit. If we had a ‘Bottom’ operator, then that would avoid the need for a Sort, but no such animal exists, and it does ‘Bottom’ by doing a Top of re-Sorted data. It’s very disappointing. The Top operator in the OFFSET-FETCH method has a new property called OffsetExpression, which it uses to skip over as many rows as it needs – it’s simply not supported pre-2012.

image

(Quick side note: the arrow between the Compute Scalar and the right-most Top operator in both plans is quite thin – much thinner that you might expect. This is only a quirk of the plan because the Actuals haven’t been reported here. MSDN (https://technet.microsoft.com/en-us/library/ms178082.aspx) says: “Compute Scalar operators that appear in Showplans generated by SET STATISTICS XML might not contain the RunTimeInformation element. In graphical Showplans, Actual Rows, Actual Rebinds, and Actual Rewinds might be absent from the Properties window when the Include Actual Execution Plan option is selected in SQL Server Management Studio. When this occurs, it means that although these operators were used in the compiled query plan, their work was performed by other operators in the run-time query plan.” Therefore, the arrow coming out of the Compute Scalar operator is the width of the estimated number of rows, because it doesn’t have the actual number of rows. But it’s a Compute Scalar – it’s not going to change the number of rows, and you should consider the width of the arrow as being the width of the arrow going into it.)

Of course, this TOP/TOPDESC method is slower than OFFSET-FETCH. If we had a ‘Bottom’ operator, I think it wouldn’t be very much slower, but here we have a Sort operator! And those things are bad. The plans estimated that the cost of the Sort would be 27% of the total query, and that the ratio between the two queries would be 58:42, which is 1.38:1. But remember that those Cost percentages are based on estimated values, and we know those estimates are quite a long way out.

So instead, we use a more empirical method, which is to run them against each other.

On my machine (a Surface Pro 2), with a warm cache, the OFFSET-FETCH method took around 380ms, compared to around 570ms for the TOP/TOPDESC. It’s definitely slower – no surprises there. It’s a good 50% slower, if not more. But this still makes it faster than any of the pre-SQL 2012 versions that Aaron used.

I’m sure you’re going to point out that I’m clearly running this on a version of SQL Server that is at least 2012… so I ran it on a SQL 2008 R2 box as well, and found that the plan was identical as shown here, and that it was about 30% faster than the “2005_3” version from Aaron’s post with an index applied.

So if you’re using SQL 2008 R2 (or earlier) still, then don’t dismiss the best-performing median function from Aaron’s post (thanks again, Peso!), but instead, consider coming up with a 2008R2-compatible version, as I’ve done here.

Update: Another method is to consider simply filtering on ROW_NUMBER(), which isn’t included in Aaron’s post either. It’s still good, but doesn’t quite perform as quickly as the TOP/TOPDESC method on the million-row set, because it has to figure out the ROW_NUMBER() for a lot of rows. The OffsetExpression property in the Top operator of SQL 2012+ is your friend.

@rob_farley

APS / PDW Monitoring

When you get a Analytics Platform System (APS) – the appliance that hosts a copy SQL Server Parallel Data Warehouse Edition (EDW) and potentially a Hadoop cluster as well – one of the things that you get is an Admin Console. It’s quite handy to be able to look at it and see if your appliance is healthy or not.

The trouble with the Admin Console, though, is what I’ve just said – you look at it to assess the health of your appliance. And this isn’t something that you really want to do. What are you going to do – look at it every morning and then get on with your day? If this is what you’re doing to monitor your environment, the chances are you won’t be watching in the moment you start to actually have a problem, and really you’ll be getting the alert when your phone rings, or at least after you’ve picked it up and got an earful of ‘angry customer’.

You need a better solution – something which is going to actually let you know if a problem is detected, and which is going to store some historical information so that you can do some troubleshooting. You know – just like you have on your other SQL boxes. The thing is that PDW doesn’t come with something which can send emails. And it doesn’t come with a scheduler like SQL Agent which can be used to kick off jobs to collect data, or to check perfmon counters in case there’s an issue.

So how do you monitor PDW?

The answer is: from outside it.

Ideally, your organisation has centralised monitoring anyway. Some server(s) that run around checking that everything is healthy. They check the health of the Windows boxes, they see how much disk space is free on all the volumes, they make sure that backups have been taken (not just SQL backups, but backups of whole servers), and run tests like DBCC, restores, and pings. If there’s a problem with the monitoring servers, this is recognised and understood.

Now, you can roll your own one of these, and have it query DMVs like PDW’s sys.dm_pdw_component_health_alerts, sys.dm_pdw_errors, and sys.dm_pdw_os_performance_counters. Collecting this information (and the contents of many other DMVs) can help provide excellent coverage when troubleshooting, and also highlight potential issues before they arrive. Running DBCC PDW_SHOWSPACEUSED from time to time is definitely important, to be conscious of how skew is looking – the list of things to keep an eye on goes on and on.

Something that shouldn’t be overlooked is the usefulness of System Center Operations Manager (even if I keep wanting to type ‘Centre’ instead of ‘Center’). There are SCOM Management Packs available to cater for PDW, HDInsight (another component within APS) and APS itself. If SCOM is part of your organisation, then configuring it to monitor your APS appliance is definitely worth doing. I’ve lifted the image here from the APS help file – if you’re at all familiar with SCOM, you’ll recognise it and see that you have good coverage of your APS environment with it. It should never fully replace using queries to look at the metadata within (for assessing skew, etc.), but you should definitely be using SCOM with APS if you can.

image

I mentioned that this image is part of the APS help file – it goes into quite some detail about setting up SCOM to work with APS, so if you’re considering APS, you should be able to reassure your IT staff that they will be able to use their existing SCOM environment to monitor the appliance still.

Don’t neglect monitoring your APS box. When we get an appliance, it’s easy to let it just sit there and do its stuff, assuming that everything is going to be okay because it’s an appliance. We don’t monitor our kettles at home, but our businesses don’t depend on the health of the kettles (maybe the coffee machine, but that’s a different story). Monitoring doesn’t have to be hard work, but it does have to happen. Luckily, we get a bunch of tools to help us make that happen.

TSQL2sDay150x150And this is for yet another T-SQL Tuesday. The first for 2015, and the 62nd in all – hosted this time by Robert Pearl (@pearlknows).

@rob_farley

Four SQL MVPs at LobsterPot – including three in Australia

Today LobsterPot Solutions sets a new first. We are the only company to ever employ three current Australian SQL MVPs, giving us four awardees in total. Congratulations to Martin Cairney who joins Julie Koesmarno (AUS), Ted Krueger (USA) and me (AUS) as recipients of this prestigious award. This demonstrates LobsterPot’s ongoing commitment to the SQL Server community, that show that our consultants are truly influential in the SQL world.MVP_FullColor_ForScreen

From Microsoft’s website about MVPs:
Microsoft Most Valuable Professionals, or MVPs are exceptional community leaders who actively share their high-quality, real-world deep technical expertise with the community and with Microsoft. They are committed to helping others get the most out of their experience with Microsoft products and technologies.
Technical communities play a vital role in the adoption and advancement of technology—and in helping our customers do great things with our products. The MVP Award provides us with an opportunity to say thank you and to bring the voice of community into our technology roadmap.

This fits very closely with LobsterPot’s desire to help people with their data story. We help with the adoption and advancement of SQL Server, and help customers do great things with data. It’s no surprise that we see a high proportion of LobsterPot consultants are MVP awardees.