The Query Optimizer’s handling of Relationships for T-SQL Tuesday #003

February 9, 2010

I’m feeling the pressure for this month’s T-SQL Tuesday, probably because I’m also the host. I’ll be posting a roll-up for it soon too, which I’m sure will be great fun researching.

Given that the topic is on relationships, and the main SQL Server database is a relational engine, relationships are incredibly relevant to databases. The idea behind RDBMSs is that keys are used to refer to entities. This leads to foreign keys, such that a particular column(s) is constrained to values which appear in another table, thus referential integrity is enforced. And yet there are so many database designs out there that do not have relationships defined between tables. I shudder when I find them, but that doesn’t make them any less commonplace.

In data warehouses, designed primarily for reporting systems rather than transactional systems, tables are often designed in a more denormalized manner, to avoid needing to perform so many joins to access the data required for reports. This involves having tables with many extra columns, containing data that would otherwise be stored in other tables. Fewer tables are used, and therefore the system has fewer relationships.

I sometimes wonder how this should affect relational database design. I have written before about the fact that the Query Optimizer can leverage foreign key relationships to be able to simplify queries by noticing that joins can be redundant and simplified out of plans, but to summarise:

A foreign key relationship is between a column (or set of columns) in a table to a unique key (typically the primary key) in another table (which could even be the same one). Because of this uniqueness, the relationship can map to at most one record on the other side. And because the foreign key relationship enforces referential integrity, it must map to exactly one record on the other side (a caveat being that the foreign key column(s) could be configured to allow NULLs, which won’t map). Therefore, a join that doesn’t actually select data from any of the columns in the second table might be able to be simplified out of the query completely, as if the query didn’t have the join at all. But read my other post for more on that.

Thinking about many of the queries that I’ve written over the years, I know that I often only want one field from the table I’m joining. For example, I might want to get a ProductName when I have a ProductID, or I might want the Login from a UserID. A standard Lookup situation, which in a data warehouse would often be handled by storing the Name in a dimension table rather than the ID.

So my surmising leads me to this question:

If there is a unique index on the field that I typically want to lookup from a table, does this make it a better candidate for foreign key relationships than the primary key, so that the system can avoid needing as many joins?

This screams against everything I ever learned about relational databases. It would obviously make for a larger row, but if this were offset by performance gains in querying, could it be worthwhile? Maintaining referential integrity based on a string field may be more costly than on an integer field, but I’m wondering if the impact on SELECT queries through the reduction of the number of joins required might not make it a worthy consideration.

Please note: I’m not saying it’s necessarily a good idea – I’m surmising here, and would love to hear comments about whether or not other people have tried it, what circumstances they were trying to handle, and whether or not the idea worked.

But back to the Query Optimizer…

The QO needs information to be able to work out how to run your query. It needs statistical information about the columns that are filtered; it needs to be able to figure out which parts of the query can utilise indexes effectively (see my recent posts about SARGability); but also very significantly, it needs to have information about the relationships between tables.

Any time you write a query that involves a join (which I imagine most of your queries do), the system can use information about the relationship between the two tables. If it is defined as a foreign key relationship that doesn’t allow NULL values, then the system knows that each record in the ‘child’ table must match exactly one record in the other table. The Query Optimizer can use this information, and it should be available. Any time you’re thinking of not putting enforced relationships in your system, consider the fact that you’re blinding the Query Optimizer and ultimately making it do more work.

Put your relationships into your database design. Put constraints where they can apply, mark items as unique. The Query Optimizer will thank you for it, expressing its thanks as better performance.

This Post Has 4 Comments

Jeremiah Peschka February 10, 2010 Reply

I’m guessing you’re referring to the difference between using a surrogate key and a superkey. I make that guess because if your PK is on a natural candidate key, it wouldn’t be a question.

As long as the unique index has all included columns for queries (effectively making it a second clustered index), I would say that it’s probably a better candidate for the FK than the artificial PK.

While we’re frequently told we should use surrogate PKs, isn’t it more important to balance logic and reason (referencing the natural candidate key) with convenience (referencing the surrogate key). A pure relational theorist, of course, would say to use the natural key. I’ve used with both and haven’t seen appreciable differences in application performance. These days the choice comes down to what my predecessor has done.
Joe Celko February 14, 2010 Reply

FYI

Sybase and other SQLs implement the PK-FK relationships as pointer chains that connect the FK to the single PK value in the entire schema. They think of the schema as a whole. You can see the advantages of this for CASCADE, JOIN and other operations.

SQL Server and other lesser SQLs implement each table separately as if they were files, so the PK value is repeated over and over in the FK columns. To get good join performance, you usually index FKs in such products.

SQLs that use hashing actually want longer keys because they hash with fewer collisions.
Rob Lobbe February 14, 2010 Reply

Foreign Keys are a DESIGN tool. And should be treated as such. As far as the Query Optimizer goes, it should NEVER be an issue in my production database. FK’s should be removed, as they only cause overhead to day-to-day processing.
It the feature of an application is to ‘display an error for every FK breach’ the vendor’s going to get an earful for a shoddy product. Likewise if the product is relying on a cascade delete, what else are they ‘guessing’ at – especial if I want audit’s and history retention.
In a development, or testing environment, sure put them in. They make sure the application doesn’t break the rules, but from an optimizer standpoint I’m not concerned with this overhead – the final product will be that much better, without them.
Rob Farley February 14, 2010 Reply

@Joe – Thanks for that. I didn’t realise that, and it makes for some interesting thoughts. I do wish that SQL Server did it that way – perhaps at some point it might, or maybe that’s just a pipe dream.

@RobLobbe – Sorry, we’ll have to disagree on that one. I’ve seen too many situations where the Query Optimizer has been able to run a query so much better because a Foreign Key is in place. I’ve also seen too many situations where a database without FKs has referential integrity issues, and I would always want my Test and Production systems to be identical. Thanks for commenting though – it’s always good to hear other perspectives.