I'm now doing two sessions at the SQL Saturday event in Portland. I had been scheduled to do a single session (on indexes), but got an email yesterday asking if I could do another one as well. So now I'm going to do a session earlier in the day about Joins.
Yes, JOINs. Nice co-incidence to find that this month's T-SQL Tuesday, hosted by Stuart Ainsworth and a week early because of the PASS Summit, is on that same topic. Plus the fact that I gave a presentation on it to the Adelaide SQL User Group last week.
So let's jump in…
There are three types of joins that we write with the JOIN keyword – INNER, OUTER and CROSS. I'm imagine if you've read this far through this post, you're very much aware of that. But there are other types of joins as well, that don't use the JOIN keyword (and I'm not counting the comma short-hand for CROSS JOIN that we all used back in the 90s).
A join describes the way that two sets are related to each other within a query (and for the purposes of this post, I'm not going to count joins that concatenate two sets, such as using the UNION keyword). If you're using the JOIN keyword, the relationship between the tables is shown in the ON clause – although if you're using CROSS JOIN, the relationship is that every row in one set is related to every row in the other set.
If we don't use the JOIN keyword, then where do we see joins? Well, in correlated sub-queries. I showed this when demonstrating that the APPLY keyword applies a join between one set an another in another recent blog post.
But every correlated sub-query, whether using APPLY or not must perform a join.
Consider the following:
SELECT * , (SELECT COUNT(*) FROM Production.Product AS p WHERE p.ProductSubcategoryID = s.ProductSubcategoryID) AS ProdCount FROM Production.ProductSubcategory AS s;
There's a join here, between Production.ProductSubcategory and Production.Product. The ON clause equivalent is the WHERE clause of the correlated sub-query. But what kind of join is it?
Let's think… There's clearly a relationship between the two tables, so it's not a CROSS JOIN. Also, there's nothing that would stop a row from the ProductSubcategory table to be returned, so it's not an INNER JOIN (which only returns rows that match). It's an OUTER JOIN. And we can prove this by looking at the execution plan.
The plan says "Right Outer Join". Don't be upset by the fact that it's a Right Join rather than a Left Join – it's just that Product table is coming from the top row and ProductSubcategory is from the bottom row. The 'side' of an Outer Join is just a matter of perspective. A Right Join is no different to a Left Join, it's simply a question of which side of the desk you're on.
Another type of join with JOIN uses EXISTS or IN. These two act very similarly, and I'm going to use EXISTS in my examples.
Consider the following query:
SELECT * FROM Production.ProductSubcategory AS s WHERE EXISTS (SELECT * FROM Production.Product AS p WHERE p.ProductSubcategoryID = s.ProductSubcategoryID);
and it's converse:
SELECT * FROM Production.ProductSubcategory AS s WHERE NOT EXISTS (SELECT * FROM Production.Product AS p WHERE p.ProductSubcategoryID = s.ProductSubcategoryID);
Just as before, this performs a join between the two tables. But it's not actually an Outer Join we see going on here. After all, we see that rows can be filtered out of the ProductSubcategory table. Furthermore, we don't have access to any information in the Product table, not even the count of rows. But it does Filter. This is neither an Inner Join or an Outer Join. The EXISTS form is a Semi Join, and the NOT EXISTS form is an Anti Semi Join. These operations simply filter one set based on whether a match is present or not. We see this operation in the two plans involved.
The fact that the Query Optimizer chose to implement one with a Hash Match and one with a Nested Loop is irrelevant. We're looking at the Logical aspects of the plan, not the Physical.
And so we see there are plenty of times that a Join can appear without the JOIN keyword.
But let me show you something interesting about the Anti Semi Join for a moment…
Many people don't like NOT EXISTS, and would rather write my earlier query like this:
SELECT * FROM Production.ProductSubcategory AS s LEFT OUTER JOIN Production.Product AS p ON p.ProductSubcategoryID = s.ProductSubcategoryID WHERE p.ProductID IS NULL;
It's relatively easy to show that these two queries should be identical. Both of them find rows where a match doesn't exist. Both perform a logical Anti Semi Join. And yet many people will tell you that doing an Outer Join plus a WHERE clause is better – from a performance perspective. They will generally tell you that NOT EXISTS is logically equivalent.
We've seen plenty of times when the Query Optimizer will realise that two queries are identical and produce the same plan, but this doesn't seem to be the case here.
This plan shows very much what we asked for. An Outer Join plus a Filter, as opposed to the Anti Semi Join.
And this has very little reason to be any faster than an Anti Semi Join operation. In fact, the interesting thing here is that the Outer Join completes. It tracks every successful match and passes that up to the Filter. Our Anti Semi Join operation doesn't do that.
I see no reason to use an Outer Join plus Filter over NOT EXISTS. It may have been quicker at some point if the Anti Semi Join operation was done poorly in earlier versions, but I'd really like to hear if anyone can show me the Outer Join plus Filter method being quicker.
PS: This Saturday (Oct 8th) I'll be presenting this kind of stuff in my "Deeper JOINs" session at SQL Saturday #92 in Portland, and in my pre-conference seminar a couple of days later (Monday 10th) in Seattle at the PASS Summit. You can still register for these events, so if you find this kind of thing interesting, get yourself over to the Pacific North West and I'll see you there!