Join Effects With UPDATE

A lot of people don’t like UPDATE with a FROM clause. I’m going to explore what’s going on, both logically and within the query plan. And I’m doing this for T-SQL Tuesday #74, hosted by Robert Davis (@sqlsoldier). TSQL2sDay150x150[3]

I’m going to use a fresh copy of AdventureWorks2012. I’m using SQL 2012 SP3, but the things I’m looking at should apply to most versions. I’m going to use Production.Product and Production.ProductSubcategory. The Product table has a ProductSubcategoryID column with a foreign key in place, although this column allows nulls, as not every product must be within a subcategory.

Our standard UPDATE query doesn’t have a FROM clause. It’s just “UPDATE … SET … WHERE …”, and if we need to hook into other tables, we use sub-queries. Let’s look at why this is significant.

The WHERE clause filters rows. A sub-query in the WHERE clause still only filters rows, either by being a scalar expression used in one of the predicates, or being a single-column table expression used in an IN predicate, or a table expression used in an EXISTS clause. Any other tables used in sub-queries in the WHERE clause can only be used to help filter the table being updated – they can’t affect the SET clause at all, or cause a row to be updated multiple times.

Some examples are like this:

Using other tables in the SET clause generally means something that returns a scalar value, although this could become more complex using CASE. Still though, the logical impact on the overall query is notable. Something in the SET clause cannot be used to filter the values being updated, or to update a value multiple times. The SET clause is a list of “column = <scalar expression>” clauses, in which each column must come from the table (or table expression) being updated (which means I don’t think it should ever have a table alias), and cannot be listed multiple times. As the expression is scalar, it can’t produce multiple rows or columns.

Now, all UPDATE statements could be written like this. As an update statement cannot change the number of rows in a table, the net effect on any table is a single row of change (I know triggers could be used to have a larger effect, but that’s a separate topic). I’ve met plenty of people over the years who will argue for never using a FROM clause in an UPDATE clause.

You see, a FROM clause can have more of an effect than these sub-queries.

Let’s think about what introducing extra tables via a FROM clause can do. For now, let’s start with what’s going on in a SELECT query, when you turn a FROM clause into a FROM … JOIN clause.

The particular combination of these will affect the type of join performed by your SELECT query – such as a Semi Join which does number 2, but none of the others. And if it does none, then the join is redundant and won’t appear in the query plan at all.

So how does this work in an UPDATE statement?

There are two possible ways that a FROM clause can work – one is to include the table being updated in the FROM clause, and the other is to have it NOT included in the FROM clause. If it doesn’t appear in the FROM clause, then predicates to define the matching criteria must be included in the WHERE clause to avoid updating every row. If it does appear in the FROM clause, then I would recommend using the table alias in the UPDATE clause rather than the name of the table itself.

(Interestingly PDW does not support “UPDATE … FROM … JOIN”, although “UPDATE … FROM … WHERE” is fine.)

So this is fine:

As is this:

But please be careful about:

It works, but I don’t consider it safe. Because you have the potential to update a table which isn’t mentioned in the FROM clause, you could find yourself inadvertently updating every row in Production.Product. There are safeguards to prevent it happening – this next example gives an error:

, although this one doesn’t, and updates every row in the table – after all, we have a CROSS JOIN going on, because I’ve listed the wrong table.

If I’m writing queries, it’s generally fine. But if there’s a system which produces dynamic SQL, I start to worry. I’d rather update the alias, and be completely clear about what’s going on.

So let’s go with the idea of using the table alias in the UPDATE clause when using the FROM clause, and choosing to always include the table being updated in the FROM clause. Unless we’re using PDW, of course.

But the impact of those join effects… let’s look at them.

Earlier, we saw this query. An inner join between Product and ProductSubcategory.

Now, because s.ProductSubcategoryID is known to be unique (it’s the primary key on s), there is no way that this can cause ‘multiple updates’ to Product. Things are okay here, but filtering could certainly apply. A join is done to get the values from ProdcutSubcategory, and the rows are fed into the Clustered Index Update operator.

image[12]

Filters are okay here. UPDATE is happy with filters, whether they’re implemented using the WHERE clause or via an ON clause.

But what if the unique index weren’t there? Then we might see duplicate rows – the next join effect.

Now what does the plan look like – we should see a Table Scan instead of a Clustered Index Scan because we just dropped the PK, but what other differences?

image[16]

It looks very similar, but now throws a Distinct Sort in there. You see, an Update isn’t going to do multiple updates. It won’t allow it. So it does a Distinct Sort on the PK of the Product table, and uses whichever value it cares to for the update.

Another option it could’ve used would’ve been to use an Aggregate operator (because GROUP BY and DISTINCT are essentially the same thing), in which case it would’ve needed to apply an aggregate function to s.ModifiedDate while grouping by the Product PK. Which aggregate? The ANY() aggregate, of course – because it doesn’t care which value to use, it just has to be a valid one. I can get this plan by using an OPTION (FAST 1) query hint, because that will avoid doing the Sort, as a Sort is blocking. It also turns the Hash Match into a Nested Loop, because it really wants to get that first row through as quickly as possible. It’s a slower query, but lets us see the ANY() aggregate.

image[25]

So we can see that if a multiple rows are going to be returned by the FROM clause, this will get shrunk down to a single one. This is how that third ‘join effect’ is handled.

Be really careful about this. It’s a bad thing, and the reason why purists don’t like to see a FROM clause in an UPDATE statement.

The next (and final – yay!) join effect is to have NULLs introduced.

Let’s start by putting those constraints back in:

…and let’s put a RIGHT JOIN in there (for purely academic reasons – I know you’d never do this in real life, although you might put a LEFT JOIN in with the base table second). This means that our FROM clause will return an extra row for each ProductSubcategory that has no Products. There aren’t any of them in AdventureWorks2012, but the Query Optimizer doesn’t know that.

Before I go any further, let’s quickly make something clear. With a right outer join, the result set of the join contains rows that don’t exist in the base table. Obviously we can’t update those – there’s nothing in the base table for those rows. But we’re going to look at how the query plan handles this situation.

There are two things of note here in the query plan (apart from the fact that it has put the base table second and used a left join):

image[29]

For a start, we still see a Distinct Sort! I can assure you that the unique constraint is in there. If I remove the keyword ‘RIGHT’ I go back to my original version without any distinctifying operator. But the thing that’s new here is that Filter. Let’s look at what the Filter is doing:

image[36]

It’s filtering on “[IsBaseRow1001] IS NOT NULL”. That value is coming from the Product table, and is simply checking to see whether the row coming out of the Join operator is a match or not. It’s testing to see if we actually have a row in the base table to update. It could’ve tested ProductID for NULL for this, like we would in a query, but I guess it’s quicker to test IsBaseRow for being NULL than to test ProductID. I don’t know much about IsBaseRow1001, but I can tell that it’s not more than a single byte. The Estimated Row Size on the Scan of Product was 15B in the original query, and is 16B in this query. But I’m just guessing here. Theoretically it’s not needed at all, of course, and for testing, could have been a single bit.

Or the Query Optimizer could have turned the join into an inner join. After all, we’re not interested in updating a non-row. As much as it’s interesting to see IsBaseRow1001 coming through, I can’t help but think that turning that join operator into an inner join would’ve done the trick. But as we don’t see LEFT JOIN + NULL being turned into an Anti Semi Join either, I’m not too surprised that this translation isn’t happening either.

Because there could be multiple Subcategories without Products, there is a possibility of non-unique ProductIDs – the NULLs – coming out of the Join operator. But these are the only ones that could be duplicates, because each Product has at most a single matching ProductSubcategoryID in s. Therefore, once the NULLs have been removed by the Filter, the QO should be able to know that the data coming out of the Filter is unique on ProductID, but it doesn’t use this information, and needs a distinctifying operator to be sure.

The Distinct Sort is still on ProductID, but a FAST 1 hint turns it into a Hash Aggregate this time instead of a Stream Aggregate. The reason for this is that a Nested Loop over the Product table isn’t going to find the ProductSubcategories that don’t have Products (although it could if it understood the translation to Inner Join). Therefore, it still performs a Hash Aggregate, does the filter for IsBaseRow1001, and then does a Hash Match (Flow Distinct) on ProductID.

It’s interesting to see that we have a Build Residual here on ProductID, despite ProductID being an integer.

image[47]

You see, normally in a Hash Match on an integer we wouldn’t see a residual because the hash function produces an integer. It’s because ProductID could have been NULL. The nullability of the column coming through obviously wasn’t change by the Filter (and let’s face it – it didn’t test ProductID for NULL, it tested IsBaseRow1001).

Quick interlude for showing that the hash function produces a 32-bit value, and doesn’t need a residual check when hashing on a non-nullable integer (while a nullable integer needs more than 32 bits):

Compare the plans of these two queries. The Hash Match operator in the first one doesn’t have a Probe Residual, because s.ProductCategoryID doesn’t allow NULLs. The Hash Match operator in the second does have a Probe Residual, because p.ProductSubcategoryID does allow NULLs, and a nullable integer can cause clashes in the hash table.

Also consider the Hash Match operator in the following query:

, where we see a Probe Residual on a non-nullable bigint (a lot more than 32 bits). This tells me that bigints can have clashes in the hash table, despite non-nullable integers not showing this.

Oh yes, we were looking at the Build Residual.

If we went back to an Inner Join with FAST 1, where we got a Stream Aggregate, and turn that into a Hash Match on the non-nullable ProductID, we can see that our Build Residual disappears.

image[40]

Let’s do a bit of a summary…

If you’re doing an UPDATE, you can only update each row in that table one time, no matter what kind of impact your FROM clause might want to have on the base table. You may feel like it’s going to update some rows multiple times, but that’s not how it works 

Each of the join effects is either applied (in the case of a filter) or mitigated (in the case of duplicates or NULLs), so that you can access the data in other tables without fear of having a bad effect on your UPDATE, but don’t do it! Because you can’t tell which value gets picked up by the ANY() aggregate (or first row by a distinctifying operator), you should avoid duplicates completely, so that your UPDATE finds a single value to update each row with.

And I would encourage you to use table aliases in your UPDATE clause if you use a FROM clause with a JOIN – but if you don’t use a JOIN, then make sure you include the match in your WHERE clause.

@rob_farley