Tuning slow spatial queries in SQL Server

Even with the right indexes in place, spatial queries in SQL Server are often too slow – but they needn’t be.

Two of the most commonly found patterns of query in the spatial world are when you’re looking for the nearest thing to where you are (which I’ve written about before), and when you’re looking for the points that are within a particular boundary. This latter example is where I want to spend most of my time in this post.

A spatial index splits the world (for geography, or area for geometry) into a series of grids. The index essentially stores whether each point is inside each grid section or not, or for more complex types, whether it completely covers the grid section, whether it’s not in the grid section at all, or whether it’s partially in it. If we know that a particular grid section is completely covered by a polygon / multipolygon (I’m going to refer to polygons from here on, but it all applies to multipolygons too), then every point that is within that grid section must also be in that polygon. The inverse applies to a grid section which doesn’t overlap with a polygon at all – none of the points in that grid can be in that polygon.

More work is required if the overlap of the polygon and grid section is only partial.

Let’s assume we’ve got to the finest level of the grid system, and the polygon that we’re considering is partially overlapping the grid section. Now, we need to do some calculations to see if each point is within the polygon (or at least, the intersection of the grid section and the polygon). These calculations involve examining all the sections of the polygon’s boundary to see if each point is inside or outside.

Imagine trying to see if a particular point is within the coastline of Australia or not, and just rely on our eyes to be able to judge whether an answer is ‘easy’ or ‘hard’.

image

If we’re zoomed out this much, we can easily tell that a point near the border of South Australia and Northern Territory (near where it says ‘Australia’ in red) is within the coastline, or that a point half way between Tasmania and Victoria is not. To know whether some arbitrary point near Adelaide is within the coastline or not – I can’t tell right now. I’m going to need to zoom in more.

image

Now, if my point of interest is near Murray Bridge, or in the sea between Kangaroo Island and the mainland, then I can tell my answer. But somewhere slightly northwest of Adelaide, and again, we need to zoom. But notice here that we can see a lot more detail about the coastline than we could see before. Parts of the coastline that seemed like relatively simple now show a complexity we hadn’t started to consider.

If we zoom again, we see that the coastline to the northwest of Adelaide is frustratingly complex.

image

And this doesn’t stop as we keep zooming.

image

 

image

image

The area around the Port River here is full of man-made structures, which can be defined relatively easily at some zoom level. A natural coastline is trickier (go back a few images to see the zoomed out version of this next one).

image

Here we see a complexity that could just keep going. Each of these bays would have their own level of complexity, and as we zoom, we would see that the complexity just keeps going. The crinkliness of coastlines becomes like a fractal, that can keep zooming further and further, like in the animated GIF here (which is on a loop, but I think demonstrates the point well enough). It’s supposedly an illusion, but it’s really a basic property of fractals.

INFINITE(Image from http://www.moillusions.com/infinite-zoom-coast-illusion/)

Benoit B. Mandelbrot’s work in fractals shows a pattern which is probably a more familiar ‘infinite zoom’ concept.

Mandelbrot color zoom.gif(Image from http://commons.wikimedia.org/wiki/File:Mandelbrot_color_zoom.gif)

Coastlines don’t actually let us zoom in infinitely, because we can’t go far part the detail in atoms, but we do need to consider how much detail we choose to define in our polygons.

If I consider a dataset I have of South Australian spatial data, and I query the UnionAggregate of the different postcode areas, I get a geography value of the whole state.

image

There are over 26,000 points defining this area (I didn’t count them, I used the STNumPoints method). If I convert it into a string (using .ToString()), I get a result which is over a million characters long. This is going to take a lot of effort to use in real queries.

But how much detail do I actually need?

One of the spatial methods is called Reduce(). It simplifies the shapes involved.

If I use Reduce(50), then it reduces the number of points by about half. But at the zoom level we can see at this point, we might not care at all.

image

If I Reduce(250), the number of points comes down to about 4000.

image

This isn’t necessarily good though. Reducing even further can start to distort the borders so much that we end up making mistakes. A point that is within the Reduced polygon might not be in our original one, and vice versa.

At Reduce(10000) I have under 400 points.

image

At Reduce(50000) you can clearly see changes. Kangaroo Island has been simplified too much. The peninsulas have become remarkably triangular, and we would be making mistakes.

image

Mistakes will come as soon as we Reduce the number of points at all. Despite the fact that at the level shown above, we could see no discernible different between the original set and Reduce(50), if we zoom in we can see that even Reduce(50) is a problem.

image

You can clearly see that the crinkliness has been reduced, but that this could make the results of any query change. It would perform better, but it would be wrong.

However – it would only be wrong near the borders. We can use this…

Let me tell you about three other methods – STBuffer(@d), BufferWithCurves(@d), BufferWithTolerance(@d, @t, @r). These put a buffer around our polygon. If we have buffered enough, and then we Reduce, we can make sure that we get no false negatives from our check. So that anything that outside this area is definitely outside the original.

BufferWithCurves produces a buffer shape using curves rather than points, which can offer some nice optimisations, but buffering coastlines using BufferWithCurves is very memory-intensive, and you may even find you are unable to produce this buffer.

BufferWithTolerance is like having a Reduce built into the buffer – which sounds ideal, but I found that even with a very high tolerance I was still getting too many points for my liking. With a buffer of 1000, and tolerance of 250, I still had nearly 20,000 points. The things that look like curves in the image below are actually made up of many points.

So I’m choosing a Reduced STBuffer instead.

With a large enough value for STBuffer, and sufficiently small value for Reduce, I can make a simple border, and can use STDifference to confirm that it completely surrounds my original polygon. With not much effort, I can find some values that work enough. Notice that there are only seven points in the image below, whereas the BufferWithOverflow version has hundreds.

select @SApolygon union all select @SApolygon.STBuffer(500).Reduce(500)

image

The query below confirms that no part of the original land is outside my buffered and reduced area, showing zero points in the STDifference, and is what I used to find appropriate values for buffering and reducing.

select @SApolygon.STDifference(@SApolygon.STBuffer(@reducefactor).Reduce(@reducefactor)).STNumPoints();

image

Having defined an area that can have no false negatives, we can flip it around to eliminate false positives. We do this by buffering and reducing the inverse (which we find using ReorientObject()). I actually found that buffering by 1000 was needed to make sure there was no overlap (the second query below confirms it – the first query is just to overlay the two images).

select @SApolygon union all select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject();

select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject().STDifference(@SApolygon).STNumPoints(); –must be be zero

image

With a zero for the second query above, we have defined an area for which there are no false positives.

Now I know that points within the ‘no false positives’ area are definitely within my original polygon, and that points within the ‘no false negatives’ area are definitely not. The area between the two is of more interest, and needs checking against the complex coastline.

image

Now let’s play with some better queries to show how to do this.

Let’s create a table of areas. I’m going to use computed columns (which must be persisted to be useful) for the ‘no false negatives’ and ‘no false positives’ areas, although I could’ve done this manually if I wanted to experiment with buffers and tolerances on a per-area basis. Let’s also create a table of points, which we’ll index (although I’ll also let you know how long the queries took without the index).

And we’ll put some data in. Our SA outline in the areas, and the points from AdventureWorks2012’s list of addresses – 19614 points.

Putting the data in took some time. The 19,614 points took over about 5 seconds to get inserted, but the 1 area took 25 seconds. STBuffer() takes a while, and this is why having these computed columns as persisted can help so much.

The query that we want to solve is:

This returns 46 rows, and it takes 4 seconds (or 3.5 minutes without the index).

…whereas this query takes less than a second (or 20 seconds without the index):

But it only returns 44 rows, because we’re only including the points that are in the “no false positives” area.

This query returns 47 rows. We’re counting the points that are in the “no false negatives” area, which is a larger area than we want.

It also runs in under a second (or 20 seconds without the index).

By now hopefully you see where I’m getting at…

What I need is the list of points that are inside ‘nofalsepos’, plus the ones that inside ‘nofalseneg’, but not inside ‘nofalsepos’, that have also been checked against the proper area.

This query takes approximately 1 second (or 40 without the index).

But it’s complex – and that’s unfortunate, but we can figure this out using a view.

, which we can query using much simpler code such as:

Coastlines, or any complex spatial values, can make for poor performance, even with indexes in place. To improve performance, you could consider Reduce() to have it run quicker, although you’ll need to take some additional steps to guarantee correctness.

And in case you were wondering what the “B.” in “Benoit B. Mandelbrot” stands for… it stands for “Benoit B. Mandelbrot”.

@rob_farley

SHOWPLAN permission denied even if the database isn’t actually used

To view a query plan, you need SHOWPLAN permission on the database level at least. You have this if you have CONTROL DATABASE, or CONTROL SERVER, or if you have ALTER TRACE at the instance level. I know this last one because it’s mentioned in Books Online on the ‘Database Permissions’ page, not because it’s particularly intuitive.

As a consultant, I sometimes deal with customers who are reluctant to grant me the kind of access level that I would like to have to work on queries. SHOWPLAN is something that I will almost always request, though, and generally it’s considered harmless enough. I want to be able to show plans, so SHOWPLAN is part of what I like to have when writing any kind of query. Actually, I often find myself requesting ALTER TRACE, because it covers SHOWPLAN across all databases. Without it, you can find yourself in a situation where you sometimes get this error

image

, because a view, function, or stored procedure accesses a database that you haven’t been granted access to. Maybe it contains sensitive information – maybe you don’t have security clearance, for example, but there is a table in that database that is referenced for part of of process you need to look into. I’m not going to get into the why, or the reasons why you could request better access, or anything like that – that’s not the point of this post. The point of this post is to talk about something which I learned about SHOWPLAN across databases that aren’t actually used in query. And it’s part of this month’s T-SQL Tuesday, hosted by Mike Donnelly (@SQLMD).TSQL2sDay150x150

I was thinking about this situation though – having cross-database queries and not having SHOWPLAN on all of the referenced databases – and about the fact that views often contain more information than you necessarily need. This got me back to my Redundant Joins material (which I should reblog about, as I haven’t written about it properly on sqlblog.com), and that the Query Optimizer can simplify out joins which aren’t actually used at all.

Something occurred to me which I didn’t know the answer to, so I did a bit of research, found the answer, making it something I wanted to write about for this T-SQL Tuesday about  new things learned.

Imagine a view, a table-valued function, a sub-query, just some table expression, which references (joins to) a lookup table but doesn’t need to. If we’re not interested in the data in the lookup table, this join is only needed if it’s matching multiple rows, or being used as a filter (which can’t happen if it’s a left outer join), or if it’s a right outer or full outer join (and therefore wanting to return all the rows in the lookup table, even those not mentioned in the left-hand set). If it’s not used, it’s redundant, and won’t be part of the query plan.

Annoyingly, the simplification phase, when redundant joins are removed, is done AFTER permissions are checked. This is easy to demonstrate. Let’s consider a user which has VIEW DEFINITION rights on a table, but not SELECT rights. This user can run sp_help, and see all the metadata associated with the table. This user can query sys.columns and see the rows there, one for each column in the table. But to run the query SELECT TOP (0) * FROM dbo.someTable; , which is purely metadata, permission is denied.

The reason I know it’s only metadata is because running as a more-privileged user, the query plan shows me this (as shown here, using AdventureWorks2012.Production.Production instead of dbo.soimeTable).

image

This query does not select data from the table. If it did, we’d see a Seek or a Scan here. This query never needs to access the table. It is explicitly told to fetch no rows from it. The only thing we use here is metadata, and we do have permission to get that.

And yet the less-privileged user can’t run this query. Metadata isn’t a problem, but the permissions are tested first, and the query is rejected.

image

Permissions are checked once the query has been parsed. If an object is used in the query, then SELECT permission is required. If an object is updated, then UPDATE permission is needed, even if it’s logically impossible to update any actual rows (try WHERE 1=2 if you need to check).

Now once a plan is in cache, VIEW SERVER STATE is needed to be able to view it. And if you have VIEW SERVER STATE, then you can view the plans that are in cache, even if you don’t have permissions to run the queries.

…which brings me to SHOWPLAN.

SHOWPLAN is different to VIEW SERVER STATE – it doesn’t apply to the plan cache. The plan cache is an instance-level thing, and a database permission like SHOWPLAN isn’t going to cut it.

To view the plan of a query that’s not in cache, you need SHOWPLAN permission. And you need to be able to run the query – even if the query isn’t actually going to touch the tables. I wouldn’t mind being able to look at plans to offer tuning advice without having to have permission to run the query, but this is just one of those things.

Sadly, it extends to databases. If a database is referenced by a query, even if it’s not used, then you need to have SHOWPLAN permission on that database (or ALTER TRACE at the instance level, as I mentioned earlier).

So if a view references a database for a lookup, and your query uses that database, you won’t be able to see the query plan of any query that uses it. You can have SHOWPLAN permission in the database where your data is, and with another user, you could verify that your plan doesn’t even touch the other database. But if it mentions it at all, you need SHOWPLAN on that database.

The script below will let you reproduce this if you want.

@rob_farley

 

 

T-SQL v MDX: Pick the hierarchy and populate the report parameter values

When dealing with report parameters, you want the experience to be as seamless as possible for the end user. In SQL Server Reporting Services, we have the tremendously useful feature of having dynamic parameters, in which you can set a particular report parameter based on the choice of a previous one. For example, if you have parameters such as Country and City, you can set it up so that you pick the Country as Australia, and get the City choices filtered down to Adelaide, Melbourne, and Canberra, but if you choose the UK as your Country, your City choices become London, Brighton, and Dover.

In T-SQL, we do this by making the City parameter use (for its Available Values set) a dataset which is dependent on the Country parameter. Something like “select City from Cities where Country = @country” – it’s quite straight forward.

But when you want to be able to pick something entirely different, the story becomes more complex.

Suppose you want to be able to filter a report either by Size or by Type. Typically you’d do this with two parameters, with an Any or All option near the top of each list. This makes a lot of sense, and even caters for the scenario of filtering the report by both Size and Type. But sometimes this doesn’t quite suit the situation.

Imagine you’re filtering a report by date. You can either filter by the Financial Year, or by the Calendar Year. Or in fact, by the Week, Month, or Quarter. One option is to just pick the Start Date and End Date, and let the user work out when the Financial Year should be – another option is to let the user choose the type of period, and then populate the second parameter with a bunch of choices accordingly.

In T-SQL, we could have a query which leverages a table of numbers to produce a bunch of reporting periods based on the type. Something like this:

In this scenario, we basically choose which of the queries we’re going to run using a variable, and a query parameter in SSRS. It works nicely, and thanks to Startup Expression Predicates in Filter operators, only actually runs one of the queries.

image

, with the results for ‘month’ being something like:

image

You may be thinking about the fact that SSRS will only let us have a single value for a parameter entry – not two. This can be overcome by concatenating the two values, and then splitting them in the query, or by having these values stored in a separate table. To concatenate them, just use:

And then in the actual report query, doing something like:

But in Analysis Services, things are done a little differently. We have hierarchies instead, and need to refer to them somehow. We need to be able to pull out items such as [Transaction Date].[Fiscal Calendar].[Fiscal Month].&[201501], or [Transaction Date].[By Week].[Week].&[20141122].

We’re not supposed to create MDX tuples using T-SQL. You could, but it’s not the right approach. It’s almost never the right approach.

I’m going to start by letting the user select a @ReportPeriodType, which is going to be one of the following values:

[Transaction Date].[Fiscal Calendar].[Fiscal Month]
[Transaction Date].[Fiscal Calendar].[Fiscal Quarter]
[Transaction Date].[Fiscal Calendar].[Fiscal Year]
[Transaction Date].[Calendar].[Year]
[Transaction Date].[By Week].[Week]

And then use a query like this:

And if I want to filter this to only things from the last 2 years, I could use:

Let’s look at what I’m doing here. I’m passing in a parameter which is a string, which is the level of a hierachy. I’m going to use the members of this level on the Rows axis. To construct that, STRTOSET(@ReportPeriodType + ".Members") does the trick nicely. I could just leave it at that, and use the MEMBER_CAPTION and MEMBER_UNIQUE_NAME properties of each dimension member, but the field name will change according to the level, and I don’t want to have to work out the impact of this in SSRS. It’s much easier to have a couple of calculated members to grab the Level’s Hierarchy’s CurrentMember’s Name and Unique_Name, which I throw onto the Columns axis, and I’m sorted. My WHERE clause works just fine, so long as I’m not grabbing data out of the [Date] hierarchy, for filtering the list of parameters down to a manageable size. If I had a hidden hierarchy, I could easily pass that in with confidence that my report writers aren’t going to try to ask for hierarchy members from it.

Notice that the main part of my query has no mention of the dimension name, or hierarchy name, or anything. This can be used for the members of ANY hierarchy, in ANY dimension. I simply need to pass in the Dimension.Hierarchy.Level string that I want. Here are some of the results if I pass in [Transaction Date].[Fiscal Calendar].[Fiscal Month].

image

The dynamic nature of cubes is remarkable, and can be leveraged to make reusable data sets very easily.

@rob_farley