Revisiting GROUP BY v DISTINCT

February 13, 2018

It’s been nearly eleven years since I compared GROUP BY and DISTINCT in a blog post. So when Aaron Bertrand (@aaronbertrand) challenged people to review his list of bad SQL habits, and I saw a post from last year at the top, I thought I’d have to have a look.

And I loved his post! It used almost exactly the same example that I had, and came up with the same results, ten years later (I don’t mean that his query took ten years to return results, just that his post is from 2017 and mine is 2007). It’s great to see the ‘next generation’ of blog posts coming up with the same conclusions. (I get to pick on Aaron like this because he was an MVP almost ten years before me (despite being born in the same year as me!), and I therefore consider him a different generation of SQL expert. An ‘elder statesman’ if you like.)

What I want to explore in this post is the particular example that we both used… to bring an important point that could be missed because of the similarity of our examples.

You see, we both happened to use a FOR XML concatenation query, looking back at the same table. We did this to simulate a practical GROUP BY – somewhere that you might feel like GROUP BY is useful, but you know that you’re not using an aggregate function like SUM or MAX, but there isn’t one available. Ok, for Aaron he could’ve used the really new STRING_AGG, but for the old-timer like me, having to use SQL Server 2005, that wasn’t available.

The upshot was that we both used a sub-query, one example using GROUP BY…

select
id,
stuff((
select ',' + t.[name]
from t1 t
where t.id = t1.id
order by t.[name]
for xml path('')
),1,1,'') as name_csv
from t1
GROUP BY id
;

select

id,

stuff((

select ',' + t.[name]

from t1 t

where t.id = t1.id

order by t.[name]

for xml path('')

),1,1,'') as name_csv

from t1

GROUP BY id

;

…and one using DISTINCT instead.

select DISTINCT
id,
stuff((
select ',' + t.[name]
from t1 t
where t.id = t1.id
order by t.[name]
for xml path('')
),1,1,'') as name_csv
from t1
;

select DISTINCT

id,

stuff((

select ',' + t.[name]

from t1 t

where t.id = t1.id

order by t.[name]

for xml path('')

),1,1,'') as name_csv

from t1

;

And, we both demonstrated that GROUP BY was quicker, because the QO didn’t need to do all the string concatenation for every row, only for every unique WhateverWeWereGroupingBy.

Go read those posts to see the results. I’m not going to repeat them here.

Back again? Ok, cool.

Now, this isn’t particularly different to any other sub-query, like this one in the AdventureWorks database,

select od.ProductID,
(select Name
from Production.Product p
where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od
group by od.ProductID
;
select distinct od.ProductID,
(select Name
from Production.Product p
where p.ProductID = od.ProductID) as ProductName
from Sales.SalesOrderDetail od
;

select od.ProductID,

(select Name

from Production.Product p

where p.ProductID = od.ProductID) as ProductName

from Sales.SalesOrderDetail od

group by od.ProductID

;

select distinct od.ProductID,

(select Name

from Production.Product p

where p.ProductID = od.ProductID) as ProductName

from Sales.SalesOrderDetail od

;

, with the corresponding plans shown here.

Both posts demonstrate that using DISTINCT, the Query Optimizer comes up with a worse plan, because it doesn’t seem to know that the sub-query will return the same values on subsequent calls, figuring that it should see what comes back each time before checking for distinct values. It’s like it worries that DISTINCT option could produce different results. Using GROUP BY it only calls the sub-query for the distinct values in the GROUP BY clause, which could be massive savings.

Now, I’m sure you’re thinking “Hang on – I could’ve written that query another way…”, and you’re right. You probably would’ve written it like this:

select distinct od.ProductID, p.Name as ProductName
from Sales.SalesOrderDetail od
join Production.Product p on p.ProductID = od.ProductID
;

select distinct od.ProductID, p.Name as ProductName

from Sales.SalesOrderDetail od

join Production.Product p on p.ProductID = od.ProductID

;

…and the query plan for this one looks like this, which is the shape as the GROUP BY query – despite using DISTINCT.

But if I use a LEFT JOIN instead of an INNER JOIN, then the behaviour changes significantly, and it goes back to joining the results of the aggregate, rather than aggregating the results of the join.

select distinct od.ProductID, p.Name as ProductName
from Sales.SalesOrderDetail od
left join Production.Product p on p.ProductID = od.ProductID
;

select distinct od.ProductID, p.Name as ProductName

from Sales.SalesOrderDetail od

left join Production.Product p on p.ProductID = od.ProductID

;

I’m a big fan of LEFT JOIN – it makes it very clear that I’m not filtering my base table when I do the join. So I don’t want to hear that the performance is worse.

Luckily, this sounds like a thing about GROUP BY. After all, I only want to group by od.ProductID – there can only ever be one Product Name per ProductID. And yet,

select od.ProductID, max(p.Name) as ProductName
from Sales.SalesOrderDetail od
left join Production.Product p on p.ProductID = od.ProductID
group by od.ProductID
;

select od.ProductID, max(p.Name) as ProductName

from Sales.SalesOrderDetail od

left join Production.Product p on p.ProductID = od.ProductID

group by od.ProductID

;

it doesn’t help at all.

Hmm.

There is definitely an advantage to using GROUP BY over DISTINCT when using sub-queries. But the story isn’t actually as clear as all that, and is affected by a bunch of other scenarios.

The trick is to look at where the aggregation is going on, and to remember that an aggregate operator reduces the number of rows in play. Anything which reduces the number of rows being pulled through the plan should be pushed as far to the right as possible. If your aggregate will do that the most, do what you can (query rewrites, indexes, constraints, preferably not hints) to get that operator as far right as possible. But you could do the same with a join operator, or a Filter operator, whatever is reducing the rows.

Understand that GROUP BY is generally better than DISTINCT, because it’s a technique for pushing the operator to the right hand side of a join. But remember it’s only part of the story.

@rob_farley

PS… I feel like I want to expand on this a bunch more, but I need to get this published tonight for T-SQL Tuesday. It’s the 99th one, and I don’t want to miss one now, just because I could type about this for hours.