SQL 2012 was a big release for working out the median in SQL Server, with the advent of the function PERCENTILE_CONT(). It's a very elegant way of working out the median (hint, that's the 0.5 point), even though it's not actually an aggregate function, as I've written before.
Plus – it doesn't even perform well. About a year ago, Aaron Bertrand (@aaronbertrand) wrote a fantastic post about different methods for getting medians, and showed that PERCENTILE_CONT() is actually one of the slowest methods, and that the best method is to use an idea from Peter Larsson (@SwePeso) that uses an OFFSET-FETCH clause to grab the rows of interest before doing an average of them.
Except that the OFFSET-FETCH clause was also new in 2012. So if you're stuck on SQL 2008 R2 and earlier, you're a bit more stuck.
All the pre-SQL 2012 methods that Aaron showed used ROW_NUMBER() except one – which used a combination of MIN/MAX over each half of the data. But one method that Aaron didn't explore in his post was to simulate OFFSET-FETCH in earlier versions. Let me show you…
Here's the OFFSET-FETCH method. Notice that it fetches either 1 or 2 rows (depending on whether the overall count is 1 or 2), but offsets by just under half of the set.
SELECT d.SalesPerson, w.Median FROM ( SELECT SalesPerson, COUNT(*) AS y FROM dbo.Sales GROUP BY SalesPerson ) AS d CROSS APPLY ( SELECT AVG(0E + Amount) FROM ( SELECT z.Amount FROM dbo.Sales AS z WHERE z.SalesPerson = d.SalesPerson ORDER BY z.Amount OFFSET (d.y - 1) / 2 ROWS FETCH NEXT 2 - d.y % 2 ROWS ONLY ) AS f ) AS w(Median);
What my pre-2012-compatible version does is to fetch slightly MORE than the set first, and then get the top 1 or 2 but in DESC order.
SELECT d.SalesPerson, w.Median FROM ( SELECT SalesPerson, COUNT(*) AS y FROM dbo.Sales GROUP BY SalesPerson ) AS d CROSS APPLY ( SELECT AVG(0E + Amount) FROM ( SELECT TOP (2 - d.y % 2) Amount FROM ( SELECT TOP (d.y / 2 + 1) z.Amount FROM dbo.Sales AS z WHERE z.SalesPerson = d.SalesPerson ORDER BY z.Amount ) AS t ORDER BY Amount DESC ) AS f ) AS w(Median);
With OFFSET-FETCH, we're grabbing the rows we want by skipping over the rows we're not interested in until we find the ones that we are interested in. In the TOP/TOPDESC, we're identifying the rows we want by the fact that they're the bottom of the top slightly-more-than-half set.
Other than that, the idea is exactly the same. The results are identical, but what about the performance?
First, let's give you the code to set up your environment (as found in Aaron's post) – I used a clustered index.
CREATE TABLE dbo.Sales(SalesPerson INT, Amount INT); GO --CREATE CLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount); --CREATE NONCLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount); --DROP INDEX x ON dbo.sales; ;WITH x AS ( SELECT TOP (100) number FROM master.dbo.spt_values GROUP BY number ) INSERT dbo.Sales WITH (TABLOCKX) (SalesPerson, Amount) SELECT x.number, ABS(CHECKSUM(NEWID())) % 99 FROM x CROSS JOIN x AS x2 CROSS JOIN x AS x3;
What I want to do to evaluate this is to look at the query plans. Once I've done that, I'll make a comment about the performance and where it fits into the mix from Aaron's post.
So those plans… OFFSET-FETCH method first, followed by the TOP/TOPDESC method. I'm using a clustered index on the data – a nonclustered index gives the same shape but with nonclustered index operations instead of clustered index operations. Heaps are a different story that I'm not exploring here.
As you'd expect, there's a lot of similarity. Both use Nested Loops, grabbing the Counts from a Scan on the outer branch, with a Seek on the inner branch. And both inner branches have a Top Operator pulling the data out of a Seek. But the TOP/TOPDESC method has TWO Top operators, with a Sort in between. This is because of the 'TOPDESC' bit. If we had a 'Bottom' operator, then that would avoid the need for a Sort, but no such animal exists, and it does 'Bottom' by doing a Top of re-Sorted data. It's very disappointing. The Top operator in the OFFSET-FETCH method has a new property called OffsetExpression, which it uses to skip over as many rows as it needs – it's simply not supported pre-2012.
(Quick side note: the arrow between the Compute Scalar and the right-most Top operator in both plans is quite thin – much thinner that you might expect. This is only a quirk of the plan because the Actuals haven't been reported here. MSDN (https://technet.microsoft.com/en-us/library/ms178082.aspx) says: "Compute Scalar operators that appear in Showplans generated by SET STATISTICS XML might not contain the RunTimeInformation element. In graphical Showplans, Actual Rows, Actual Rebinds, and Actual Rewinds might be absent from the Properties window when the Include Actual Execution Plan option is selected in SQL Server Management Studio. When this occurs, it means that although these operators were used in the compiled query plan, their work was performed by other operators in the run-time query plan." Therefore, the arrow coming out of the Compute Scalar operator is the width of the estimated number of rows, because it doesn't have the actual number of rows. But it's a Compute Scalar – it's not going to change the number of rows, and you should consider the width of the arrow as being the width of the arrow going into it.)
Of course, this TOP/TOPDESC method is slower than OFFSET-FETCH. If we had a 'Bottom' operator, I think it wouldn't be very much slower, but here we have a Sort operator! And those things are bad. The plans estimated that the cost of the Sort would be 27% of the total query, and that the ratio between the two queries would be 58:42, which is 1.38:1. But remember that those Cost percentages are based on estimated values, and we know those estimates are quite a long way out.
So instead, we use a more empirical method, which is to run them against each other.
On my machine (a Surface Pro 2), with a warm cache, the OFFSET-FETCH method took around 380ms, compared to around 570ms for the TOP/TOPDESC. It's definitely slower – no surprises there. It's a good 50% slower, if not more. But this still makes it faster than any of the pre-SQL 2012 versions that Aaron used.
I'm sure you're going to point out that I'm clearly running this on a version of SQL Server that is at least 2012… so I ran it on a SQL 2008 R2 box as well, and found that the plan was identical as shown here, and that it was about 30% faster than the "2005_3" version from Aaron's post with an index applied.
So if you're using SQL 2008 R2 (or earlier) still, then don't dismiss the best-performing median function from Aaron's post (thanks again, Peso!), but instead, consider coming up with a 2008R2-compatible version, as I've done here.
Update: Another method is to consider simply filtering on ROW_NUMBER(), which isn't included in Aaron's post either. It's still good, but doesn't quite perform as quickly as the TOP/TOPDESC method on the million-row set, because it has to figure out the ROW_NUMBER() for a lot of rows. The OffsetExpression property in the Top operator of SQL 2012+ is your friend.
SELECT d.SalesPerson, w.Median FROM ( SELECT SalesPerson, COUNT(*) AS y FROM dbo.Sales GROUP BY SalesPerson ) AS d CROSS APPLY ( SELECT AVG(0E + Amount) FROM ( SELECT z.Amount, rn = ROW_NUMBER() OVER (ORDER BY z.Amount) FROM dbo.Sales AS z WHERE z.SalesPerson = d.SalesPerson ) AS f WHERE f.rn BETWEEN (d.y + 1) / 2 AND (d.y + 2) / 2 ) AS w(Median);