Supermarket Seeks and Scans

At the PASS Summit 2015, I was giving a presentation about Query Plan Operators, and Kalen Delaney (@sqlqueen) was in the audience. She's kind of a big deal in the SQL world – I still remember the first time I met her. It was 2007 and she came up to me and said "I read your blog". I was a little star-struck, but we've been good friends ever since.

In that presentation, I was explaining Seeks and Scans, as I often seem to, and was reminded about the times I wander round the supermarket holding a list of the things I need to get. Because what I'm doing is essentially a Join between my list and the stock in the supermarket. And the way that I implement that join highlights some important ideas in the database world.

Kalen seemed to like my analogy. So much so that over a year later she casually mentioned it on Twitter.

I figured that it was about time that I explained more about this.

Plus, as the topic for this month's T-SQL Tuesday is analogies, hosted by Rob Volk (@sql_r), it's definitely a good time to write about it.

When I'm sent to the supermarket to pick up a lettuce, I know where I'm going. It's in the fruit and vegetables section. I'm good with that. I'll go straight there, pick up the lettuce, and I'm out. I'm not going to wander around – I'm not going wander down the confectionary aisle – I'm just grabbing the lettuce and leaving. This is somewhat like a Seek.

In fact, it's more like a Seek with TOP 1, because there are probably lots of lettuces, and I'm only going to get a single one. That's taking the analogy a little further, but it still works. It's one of the nice things about good analogies, and I totally think this is one of those. If I want to get a lettuce that is a particular quality of lettuce, then I might have to check a few of them before grabbing one (because the supermarket doesn't sort the lettuces into the good ones v the ones that look like they've been there a while), and that's like having to deal with a Residual Predicate. The more fussy I am, the more I might have to look through, and I risk getting no lettuce even if they have some fairly ordinary ones. If I want to specifically get the best lettuce they have (even if it's awful), then I need to do a Top N Sort on all the lettuces. That might be an expensive operation if there are a lot of lettuces.

I mentioned a minute ago that I wasn't going to go down the confectionary aisle. Good thing too, if there's a problem there. I'm sure we can all imagine the times when there's a problem down a particular aisle… analogous to a page corruption in a database, but if I didn't have to go there, then I can still do what I need to without being affected.

What if there's some sort of a crisis going on and I need to buy all I can get of something (I'm not meaning like all the toilet paper – in a crisis, other people might need some too). Like all the Ham & Pineapple Pizzas, because we've been asked to cater for a classroom of kids, and those kids don't understand the world yet. But the supermarket understands the world and only ever stocks like, three of them. I'm totally fine with grabbing all three pizzas and putting them in my shopping basket.

But what if that day they have over-ordered and they have fifty? Suddenly I'm needing more memory – I mean, a bigger basket – and I might need to do something differently. I kinda hope that never happens.

Back to when I have a shopping list, rather than a single item. At this point, I'm wanting to join between everything on my list with the things that match in the supermarket. If it's a short list, it might be best to find one thing, then the next, then the next, and so on. Even if I grab a lettuce and then grab a cabbage, which is right next to the lettuces! If my list is short enough, then that's fine.

When my list is quite long, I'm going to use a different strategy. There comes a time when it's going to be quicker to just walk through the aisles looking for things that are on my list. At first glance that sounds like the "tipping point" with a Seek+Lookup turning into a Scan, but I want to point out that this means we're anticipating having a bunch of rows being pulled into a Nested Loop operator and then doing a Lookup for each one, and that's a Join. Sure, we might decide not to do the join, but I'm looking at the join part for my supermarket analogy.

So if I have a long list I might not want to grab each item individually. Let's think about other options.

One option is to sort the list in my hand into aisle order, which is essentially "section". I know the sort order of the supermarket, so this is fine. I can start with aisle 1, and walk through, keeping my eye out for the things in my list in order. Brilliant. This is a Merge Join. It really is.

And this works pretty well, except that I need to order my shopping list first. That's one of the drawbacks of a Merge Join.

Plus, there are times when I might have picked something up, gone to move to the next section of the supermarket, but then I need to grab something else from that section. So if my sort wasn't down to the point where it's unique in the list, I might need to backtrack, which is really annoying and takes time. Now I'm basically doing a many-to-many join, and a whole ton of efficiency is lost.

Another option is to make sure I can see my whole shopping list, and walk up and down going "Do I need this? Is this on my list?" for every item I come across. At this point I'm doing a Hash Match. It can work, but I need to have that shopping list spread out, and I'm asking myself that question (creating the hash value and doing the probe) about everything.

One nice thing though, is that scenario where I don't know how long the list is because I'm getting text messages as I'm walking in. So I can start spreading out the list, thinking that a Hash Match might work out well, bracing myself for a long walk up and down all the aisles, and then when it turns out the list is short, I can decide to go to each item individually. That's Adaptive, and it's really handy when you don't know how much data you're going to be dealing with.

Shopping in a supermarket is obviously very different to querying a database. But the underlying concepts behind how we pull the right goods from the shelves definitely have some strong similarities, as I hope I've shown here. Analogies can help you learn principles by hanging them on concepts you already know. Maybe next time you go to the supermarket, you'll get a little better at understanding how your queries run.

@rob_farley

Why the PASS Virtual Summit 2020 is ridiculously good value for money

As a user group leader, I've probably mentioned to the people in my user group over a hundred times that the PASS Summit is excellent value, even if you have to pay to fly to America from Australia, stay in a hotel, and lose a week of billable time. The benefits you can get from spending time with the biggest names in the Microsoft data community are huge.

Obviously it's harder to spend time with people from the community when you're just interacting with them through a computer screen (although why not get used to that – if you can get the hang of chatting to these people through your screen, that can carry on all year!), but this is only part of the story I give as to why the PASS Summit is such good value.

The main reason why it's excellent value is the SQL Clinic (known these days as the Azure Data Clinic).

The clinic was always a great reason to have the PASS Summit in Seattle – it was simply easier for Microsoft to have a bunch of the people that already live in Seattle don white coats and hang out at the Summit around whiteboards, just so that attendees could wander up and get free consulting time (okay, they still need to pay to be at the Summit, but with no extra cost). I remember seeing former clients of mine there, who flew to Seattle from Sydney to attend the Summit and didn't sit in a single session because they (two of them) spent the whole three days in front of a whiteboard working through problems with two or three people from the CAT team and product group.

For two people to fly to Seattle from Sydney, stay in hotels, and pay for the Summit entrance fee, the cost would've been several thousand dollars. But the value of the consulting they got for that would've been significantly more.

Fast forward to 2020, and the Summit is virtual. So there are no flights to buy. No hotels to use. And the entrance fee is much lower.

But the clinic is still happening. It's mentioned right there on the "Microsoft at Summit" page.

The biggest pain might be the time zone, because I'm guessing those Microsoft people might not be available around the clock. But if I want that free consulting, I'm going to sacrifice the wee small hours of the morning (sorry, there's an instrumental cover version of that song playing while I write this) for it. These opportunities don't happen every week, and it's going to be worth sacrificing some sleep if I have some stuff to solve.

I've heard people complaining that the cost of the PASS Virtual Summit is really high, considering that it's an online event. But I don't think those people have noticed what you can get for the US$599.

I think the conversation goes like this: "Hey boss – for US$599 I can get access to Microsoft people for three days, as well as being able to attend all the conference sessions."

I suspect your boss will have stopped listening before you reach the "as well as…" bit of that.

So… I'll see you at the Summit?

@rob_farley

Tell me who to give PASS Pro memberships to

A couple of weeks ago I said I'd bought some PASS Pro memberships to give away. If you haven't read about, take a moment to follow that link, and then email me the details of someone who could use it. If you're worried I haven't received it, you could send me a DM on Twitter if you like, or via LinkedIn. Just let me know.

I don't care where in the world you are – tell me who you think deserves to get one of these regardless.

Now, after some negotiating with PASS HQ, I've arranged the year for these memberships to begin counting from the end of September (rather than when I paid for them in August). But don't think about the timeframe, just let me know who.

@rob_farley

That Script button in SSMS / ADS

Elizabeth Noble (@sqlzelda) asks us to write about automation this month. I'm a big fan of automation – I know that if I just rely on muscle memory to do things the same way, I'll make mistakes. We all make mistakes from time to time, and I'm certainly not immune.

So automation is a really big deal, but even more so, using scripts to do things instead of using a user interface.

User interfaces are great, but I simply don't want to have to remember to do everything the same way each time.

To that end, I want to wax lyrical for a moment about the Script button on most dialog boxes in SQL Server Management Studio (SSMS), and make a quick mention of what things (currently) look like in Azure Data Studio (ADS). (I say "currently" because ADS is still under very active development and could have changed even by the time I publish this.)

First SSMS…

By default, if you want to drop a database in SSMS, you go to the Object Explorer, find the database in question, right-click on it, choose Delete (seriously – that should say "Drop", as Delete is a data manipulation term, not a data definition term), pick the odd option in the dialog, and hit OK. You could try it now, but preferably not in a production environment.

And then if you're teaching someone how to do this, you explain to them what to choose for the various options, and hope they do it correctly. Restoring a database (which you'll want to do if you just dropped a database you shouldn't have – sorry about that) is similar. You get a dialog box, you pick the options you want, and you hit OK.

The hassle is that if you use a different set of options you get a very different behaviour. If you're dropping a database and you don't use the "Close existing connections" option, you might have to wait for a while for the drop to happen. Or if you use it when you didn't mean to, some users might not appreciate having their transactions roll back.

Drop database dialog

Conversely, if you're doing a restore database and you forget to choose between RECOVERY / NORECOVERY / STANDBY, then you might inadvertently interrupt a recovery process and lose precious time as you start it all over again.

Instead, you should have used the Script button. I'm not a fan of the OK button on these dialogs at all. People should use Script and then Cancel, and use the Script you've run.

Now, I get a useful script which I can then run, and I can keep it for next time too.

Drop Database Script

But ADS does it differently again. If I right-click on a database there, I don't get a Delete option at all (although there are extensions available to provide that). But if I go to an individual table, there's an option to immediately produce a Script for Drop.

Drop Table script

I'm sure in time ADS will provide all the functionality of SSMS, including something to generate a Drop Database script complete with all the options that SSMS provides. In the meantime, I will continue to use dialogs to create scripts (enhancing them as much as I feel I need), and then holding onto them for 'next time', or for when I get someone else to run it.

…or for when I automate it.

@rob_farley

PASS Pro Memberships for deserving people

I just bought seven PASS Pro Memberships, and will give most of them away.

I will be inviting my community (starting with the Adelaide Data & Analytics User Group that I run, but open to everyone) to nominate anyone they feel should benefit by my giving them a year of PASS Pro Membership. You can't nominate yourself – I want you to think of people who have been made redundant recently, or who generally have a harder time in life than others. Lots of people have lots of struggles, and if you know someone, then I invite you to nominate them. Do that by emailing me at "rob at lobsterpot.com.au" and making case for them. I have to get you to make a case for them, because I don't know how many nominations I will get, and I can't afford to buy one for everyone. I will choose who gets them, and it will be completely subjective – I have no idea what criteria I will use to select the people yet.

There will be no strings attached in either direction on these giveaways. If someone is unable to receive one because they are a government employee or whatever, then it's up to them to turn it down. If there are taxes or something owed because of it, then that's also on the recipient. Also, PASS have their own terms and conditions on having a PASS Pro Membership, which the recipient would need to agree to, and they will need to become a PASS member (the free level) before they can receive it.

The idea for this came from a conversation I was in with some people who were looking into giving away a ticket to this year's PASS Virtual Summit to someone in their community. I mentioned that my preferred way of finding a deserving person is via a nomination process. And I figured that as I would normally be sponsoring my local group by buying pizza for the attendees, I could spend the money in a different way, and direct it to people who need it, rather than just people who want pizza. And I think the PASS Pro Membership is at a level where I can give to more people.

So have a think, and email me about people, telling me why they should get one. Maybe put "PASS Pro Membership nomination" and the person's name in the subject line, so that I can spot them more easily.

@rob_farley

SQL Community time capsule

Tamera Clark (@tameraclark) is hosting T-SQL Tuesday this month (thanks!), and wants us to nominate something to put into a SQLCommunity time capsule. She says "What do you want others to know about the #SQLCommunity and #SQLFamily? What do you want the finders of this time capsule to know about our current goings-on? What would you like the finder to know about you, how would you like to be remembered? Even if it's a piece of rockin' code that you think should live on forever, include it."

I had to think about this one. As much as I like the double OUTER APPLY (SELECT TOP (1)…) thing that I do, especially for protecting my base tables in warehouse loads, I'm not about to suggest that it would be worthy of a time capsule. However, when I think about the general SQLCommunity side of things, my mind goes to conferences and the things that go on there.

I've been speaking at conferences overseas for over ten years, and at conferences within Australia even longer. The learning that goes on there is fun, but people also make an effort to let their individuality shine through. I think of the costume parties at SQLBits (officially starting with the Robin Hood theme at Nottingham, although there were Viking hats at York – the photo below was taken by Martin Bell, and managed to capture the fact that I was keeping a doughnut for later), and the community area at PASS Summit, where people have been known to dress up in various things.

I haven't been to too many costume parties at SQL events. The only SQLBits event I've been to since the themes really kicked in was the Magic one, and I learned a few tricks, rather than coming in a costume. That was fun. I suspect there should be something relating to costumes in a SQLCommunity time caspule.

But one thing that I often think about regarding PASS Summit is the ribbons. I had never seen conference ribbons before 2010 and my first PASS Summit, when I was given ribbons to indicate that I was an MVP, a Chapter Leader, a Speaker, and a FirstTimer. I don't remember if there were others, but my mind immediately wondered why people didn't just bring their own.

So I did.

The next year, I got a whole bunch of custom ones of mine own. Not to give away, just to wear. They were mostly red to go with my company branding, and I wrote anything I could think of on them. PASS gave me seven to wear that year, because I was on the PASS Board, delivering a precon, and was an author on the MVP Deep Dives book, but I added a bunch new ones each day of the conference. By the Friday it was down to my knees, and it was becoming hard to keep it from breaking. I only have a picture from early on though.

The idea seemed to catch on and by the following year lots of people brought their own, including plenty to give away. Nowadays it's quite common for people to have a whole range of different ribbons attached to their conference badges, but I only wear the official ones. I figure it was my thing in 2011, and now my thing is to not have custom ones.

People (and sponsors) do a great job with what they put on the custom ribbons they make, and I think they've really become part of the culture of the PASS Summit. I see people collecting as many as they possibly can, and the creativity around them is terrific. I hear people saying "Ooh, where did you get that one from?", which is a great conversation ice-breaker in itself.

So I think my nomination for inclusion in the SQLCommunity time capsule is a collection of creative conference ribbons. I wouldn't be able to provide it, but I'm sure there are people who have kept them over the years.

@rob_farley

Not speaking, but still supporting PASS Summit

First up – I'm not speaking at the PASS Summit this year. The virtual one in 2020. But I've registered and will attend, because the PASS Summit is still worthwhile.

I had made a big deal about submitting – I wrote a long thread about it on Twitter. I put three sessions in and hoped to be selected. But I wasn't and that's fine. That's not what this post is about.

Those of you who know me well might know that I'm not even someone who typically enjoys attending sessions at a conference. It's not my preferred way of learning, and even just sitting in a conference theatre chair for a long time gets to me and makes the learning experience hard. If I'm attending your session, I'm doing it because I'm interested in you more than in your content. For me, attending a conference is about what happens outside the sessions, not inside.

So if I'm not interested in the content of the event, and I'm not speaking, and I don't get to see people in person, what's the appeal of the PASS Summit this year? Why should I register and attend? Why should you?

And the answer is the same as if it was a real one – for the community and the networking.

PASS has always been about the community and providing ways for people to Connect, not just to Share or to Learn. "Connect, Share, Learn" has been a PASS motto for ages, but I think sometimes people see PASS as being all about the content, when it's really about the connection.

Many conferences are all about the content, and that's fair enough when you consider that it's usually employers that pay for people to attend these events. Attendees are often told to bring back knowledge, and maybe even give presentations to colleagues about what they've learned.

And yet for the PASS Summit, I'm not sure that content is the main draw.

I don't think content is what it was. When I first attended a large technical conference, which would've been TechEd Australia 1999, I was getting content that I couldn't really get anywhere else. A colleague and I flew from Melbourne to Brisbane, and we split the sessions we wanted to attend so that we got good coverage of the content. The parties were fun, and there was plenty of swag to be had from sponsors, but I didn't try to build a network at all.

By the time I got back to TechEd Australia it was 2005 and I had been actively involved in my user groups for a while. I spent time meeting people, especially presenters and other influencers, and got a sense of where things were moving. Instead of trying to catch a glimpse of what was going past, I tried to get caught up in the stream. By the time TechEd 2006 had come around, I had made a few trips to Redmond, I was involved in the hands-on labs, and my career was different.

The content from TechEd Australia 2006 was mostly already available through blog posts, articles, and videos from other events. But the networking was something I couldn't get in the same way.

PASS makes no bones about the fact about the networking side. They promote the "SQLFamily" concept enthusiastically. They provide activities to help first-time attendees get to know people. The content is still important, but the focus is on community. It is a community-focused organisation, after all.

This is what makes this year's PASS Summit tricky. Content-wise, people can get a lot of similar content from elsewhere. The sessions themselves are unique, but I'm sure that many sessions will have significant overlap with other sessions that have been given elsewhere. It's simply how presentations work. But without people gathering in person, that networking side will be hard. What will make the PASS Summit different this year, and different to most other online events, is that they are pushing to find ways to let people interact with each other despite the online-only format. You might not be able to physically walk up to a presenter at the end of their session, but they will still be available for questions, conversations, all that. With a wider reach possible because of the online approach, allowing networking between people in different parts of the world, it could be very significant.

Besides the presenters, many people who aren't presenting this year (myself included) will be hanging around. Time zones might make this trickier for some, considering that daytime in America is night-time where I am, but I'm typically jet-lagged at these things anyway. I'm hoping to be able to hang out in virtual rooms to see some of the old faces, meet new people, and be part of discussions. Most of the learning we do in life is through conversations not conventions, and I don't want that to stop just because we can't meet in person.

So I've registered as an attendee for the Virtual PASS Summit 2020. Hopefully I'll get to see you there.

@rob_farley

Getting started with Power BI Push Datasets via REST APIs

Real-time Power BI sets are a really useful feature, and there's a good description of them at https://docs.microsoft.com/en-us/power-bi/connect-data/service-real-time-streaming. I thought I'd do a quick walkthrough specifically around the Push side, and show you – including the odd gotcha that you might not have noticed.

To create a dataset that you want to push data into, you need to go to the Power BI service, go to your Workspace, and create a Streaming dataset. Even if you're not wanting to use it with a streaming service, this is the one you need.

Having said I want to create a dataset, I need to tell it that I want to use this via the API, rather than using Azure Streaming or PubNub (for more on those, check the documentation link above). I also need to give the dataset a name and define the columns. I only get to put in numbers, text, and datetime values, so for now I'll use one of each. I'm also going to tell it that I want to enable historical data analysis. You should be careful whether you pick this or not, as we'll see in a moment.

Once we hit the Create button, we get to a screen that gives us the Push URL and some sample code. This is really useful because you can immediately put some code together to start pushing data in. I can click on that window to grab the code and paste it straight into a script that runs regularly.

(Don't bother trying that particular endpoint, by the way. I created it for this post and it was long gone before this post was finished.)

This PowerShell could be kicked off by a number of things. I could poll a database, or have it run when certain conditions are met, or as part of a wider process. And if PowerShell isn't my thing, then I can use some other method – I just need to post data into a URL.

Let's talk a bit more about that Historical Data Analysis option.

If I don't choose that option, then the data hangs around for an hour and then disappears. That might be fine, but it might not. Also, I only get to use this data in a dashboard tile with a handful of visuals – not in reports at all.

If I do choose the Historical Data Analysis option, then 200k rows of data will hang around for me (that's not a configurable value), and I can even use the data in reports.

Let's explore this.

In a dashboard there's an Add tile option under the ellipsis.

Amongst the tiles that I can add is a Real Time tile for Custom Streaming Data.

The fact that it's Real Time means that it will refresh when the dataset underneath refreshes – which is absolutely what we want here if we're going to be showing this on a screen somewhere. Then we can choose the dataset we want, but the tiles we can use are very limited. Only Card, Gauge, and a few types of chart. And things are limited even more, because a Card can only show the number data, not text or datetime.

Suppose I tell it to display a Card, and then I run the PowerShell code repeatedly to push newer and newer data in – the tile updates with the new data almost immediately and this is great. Someone viewing this dashboard can get near-real-time updates from an application or script.

This happens regardless of whether or not we chose to do historical data analysis. It will show the latest row pushed in, regardless of whether the datetime column is older or newer. We push a new row, it replaces what's in that tile. When this is all I want to do, I don't even care too much about sending a timestamp column. Any data I send will disappear an hour later, and that's fine for me.

But if we did choose that option, we can also create a report based on the dataset.

However, you can't do this in Power BI Desktop. You have to do it from the service, where your dataset was defined.

You can either choose New Report and then pick the dataset from the list, or (especially if you can't see it in the list), you can pick "Create Report" from the dataset's ellipsis.

Now I can create all my usual visualisations based on this data, showing multiple rows, and all that. In my example, because I didn't change the timestamp value every time, I need to tell it explicitly "do not summarize" the number value, because I can't edit the dataset except to change the field names and types. I certainly can't do anything like setting default aggregation types. Also notice that the dataset contains a table called RealTimeData. I didn't choose that – it was set for me.

At this point I find myself wondering about that 200k rows that it's holding onto, and filtering to "Top N" can be handy, or filtering on the datetime column for relative time (being careful about UTC settings). Otherwise, you might find yourself inadvertently showing more data than you'd like.

But if I need to clear out my dataset at the start of each day, then that's an option too. I just need to do it deliberately. Luckily, Power BI's REST API is good for that too, and I can easily delete all the rows in a dataset with a new lines of PowerShell.

Clearing this will cause a dashboard tile to empty immediately, just like when we pushed the new data in.

But what about the report visual? That doesn't refresh automatically. To get the report to show the latest data you need to do a refresh.

However, you can pin the visual to a dashboard, and see it refresh automatically.

I made a simple loop to send a row, wait a second, and repeat, and the response that I got was pretty useful. I had the visual filtered to the Top 3 based on the latest timestamp.

Now, it would be nice if reports could be set to auto-refresh when new data is pushed in, but having it in a dashboard is still very useful.

You see, streaming datasets don't have to involve Azure Stream, but they can involve all your usual visuals if you choose the option for historical data analysis – it's just that you won't see your visuals refresh automatically unless you use a dashboard. Now that you understand these aspects of what's going on and the couple of extra hoops you need to jump through, hopefully you can see a good use for these types of datasets. You can decide whether you you need to clear rows occasionally, figure out how you want to visual your data, and really start to make push datasets your own.

@rob_farley

That time the warehouse figures didn't match

I don't write about situations at my customers. It's just a rule I have. Regular readers may have heard my say it before. So if I'm writing about a problem, it's much more likely that it's something I've come across elsewhere, or a general thing. This month's topic for T-SQL Tuesday (hosted by Kerry Tyler (@airbornegeek) is about "crash reports" – the stuff that we've learned from disasters that can teach us (and others) about how to avoid disaster in the future.

So it's a bit of a tricky topic without writing about something that's happened at a customer. I'm still not going to write about what happened at a customer, but I am going to write about a problem that I've seen plenty of times, and what I do to avoid it.

It's about not sufficiently protecting the data that is being populated into a warehouse. I've been guilty of this before, and I continue to try to do better.

The scenario is that there's a source system, and there's a table in a warehouse that is being used to report on it. Maybe it's being populated by Integration Services or Data Factory. Maybe it's being populated by T-SQL. I don't really care. What I care about is whether the data in the warehouse is a true representation of what's in the source system.

If it's not a true representation, then we have all kinds of problems.

Mostly, that our warehouse is rubbish.

Seriously – if the data in the warehouse can't be shown to be correct, then what's the point of it? People won't want to use it, because they won't want their reports to be wrong. If they use it because it's the best they've got, knowing the data is sometimes wrong, then anything that comes from it will need further analysis and the inaccuracies will be a constant frustration. A warehouse needs to be correct.

It should be easy to validate too. It should be easy to pull totals from the source system (which could be a staging environment – it's doesn't have to be production, it just has to be trusted to be correct), to see if they match what's in the warehouse. It should be able to be tested following refresh of the warehouse data, and alerts should be raised if there's a problem.

So there's a problem, and you probably have a good idea about how to detect it. But to be a "crash report", we need to explain what caused it in the first place. There could be several factors involved, but in the situation(s) I'm remembering, it came down to not protecting the base table.

An ETL process gets data from source system(s) into a warehouse, doing some amount of transformation to get it into the right shape, because the things that we're counting might need information from other tables in the source to work out how they should be represented in the warehouse tables, etc., etc., but the whole way along, we need to be protecting that data. And particularly from JOINs.

A join isn't a bad thing. It lets us get to the relevant data from other tables. It's very useful and you learn about it very early on in a T-SQL class. If you're using ADF or SSIS to do your ETL, you might have T-SQL in a source query. Or you might be pushing data through a Lookup or a Merge. You're almost certainly going to be needing information from another table. And you need to take care with that join.

You see, joins don't just pull data from other tables. They can also filter your data, if you're doing an inner join. They might make extra copies of the rows in your base table, if you hit multiple rows on the other side. You might even introduce rows full of nulls if you're doing a right outer join or full outer join.

This is a very real problem that I've had to deal with.

In an ideal world, every row in my base table goes through to my warehouse as exactly one row. But these three things that joins do… well, they can trip us up.

It's probably quite easy to stop pulling in null rows, because you're probably not using many right outer joins or full outer joins.

As for the filtering – well, it's easy to use left joins everywhere… except that you will need to be careful that you don't accidentally eliminate any nulls that you've grabbed from that by using them in another join or whatever. You might figure that you have foreign keys in place to make sure that you can't filter any rows with your join, but you'll need to make sure that the column in your base table doesn't allow nulls, even if it's currently always populated.

Hitting multiple rows should be easy to handle, because you can make sure the join predicates you're using can only hit one row. But this isn't always straightforward either. If you're referring to a primary (or other enforced-to-be-unique) key, then great. But will you be told if that changes? Or what if it's more complicated than that? Suppose you're joining to a row that describes a date range (like a dimension table with startdate and enddate), but somehow you've got overlapping ranges?

I've seen too many crash reports that cover this. Some have been of my own making. It's not always in warehouses, sometimes it's in basic queries used for other things, and I've failed to protect the base table.

What I try to do is to make sure that it's impossible for my joins to impact the number of rows in my base table. This means knowing what I consider the base table to be in the first place even, but then being cautious about the way I'm joining. I often find myself replacing joins with "outer apply (select top (1) …)", even when I'm pretty sure that it should be okay. And this might impact the performance of my query – but correctness is critical. Plus, depending on how it's being used, I can often find ways to ensure that performance is even better this way (because the Query Optimizer will know that the number of rows isn't being affected by the join).

Disasters are no fun. I'm all about making sure that my customers can trust their data. If my query gives incorrect results, they lose confidence in future queries that I write, and possibly in their data environment in general. It's critical that I get it right.

My "crash report" today is stopping before we reach the point of the customer losing confidence in things, but I can assure you that I've seen this happen. Today, I'm choosing those times when we've noticed what's going on and mitigated things before the disaster takes hold. But I implore you – protect your base tables. It really is very important.

@rob_farley

A tip about shortcuts

This month's T-SQL Tuesday brings another "Tips & Tricks" topic, this time from Kenneth Fisher (@sqlstudent144). Kenneth's trick to being different to previous ones is that he insists on anything we suggest not being SQL-related.

It reminds me of the Shortcuts piece that Jess Pomfret (@jpomfret) prompted four months ago, where I wrote about a language shortcut I remembered from high school. Well, since then I've picked up duolingo, and have now spent time each day for about eight weeks learning Polish (I've been a little curious ever since I went there last year), and completed the 'intro level' for all the other 34 other languages they have on the site. I find languages fascinating, and often wonder why I didn't find phone apps for learning them years ago. I have to admit that I still need to learn the endings of Polish words much better than I know now, figuring out which cases end in -ego and which end in -ami. I'll keep improving though, and one day hopefully I'll be able to spend more time in other languages too.

One thing I discovered recently is about how you can seem much more natural in a language by being across the shortcuts that they use. My shortcut post from February talked about how I used the pronoun 'on' in French to avoid having to remember so many verb conjugations, but in Polish there are some things that are even better. The most obvious of which is how you say Happy Birthday.

If I look up the Polish translation for Happy Birthday, it tells me "Wszystkiego najlepszego z okazji urodzin" (literally "All the best on the occasion of your birthday"). Don't be surprised at how scary that seems to pronounce, as long combinations of consonants are all the rage in Polish. I really can't tell how to say "Wszystkiego" properly – I know "w" sounds like a 'v', and that 'sz' sounds like 'sh', but 'vsh' doesn't seem any easier to say either.

But what Polish people say (as I spotted on Facebook by seeing how Polish people say Happy Birthday to Polish people) is "sto lat". Just six letters! And it's way more Polish than saying "Wszystkiego najlepszego z okazji urodzin". "Sto lat" literally means "a hundred years" (which means you can even get away with writing "100 lat") because the cultural thing to do is to tell someone you hope they live to be a hundred years old. Polish people realise this doesn't translate properly into English, so they don't say "100 years" to English-speaking people when it's their birthday.

So to me it's a great shortcut that has let me impress some of my Polish friends.

One thing I've found in life is that shortcuts betray a deeper knowledge of something, whether or not you actually have that deeper knowledge. When I used to catch taxis to the airport, I could tell if the driver was a local or not by the roads they'd take. Satnavs always recommend a slower route between my place and the airport, thinking that a wider road will be faster than the parallel narrower one despite having more traffic lights. The traditional "It's okay I know a shortcut" when driving suggests you're more familiar with the area than the passengers, and is the kind of thing tested in London taxi drivers' "The Knowledge" (sorely missing in my town).

Fluency in any environment, whether navigating streets, a language, or even a computer system, is most typically demonstrated by the tips, tricks, and shortcuts that you develop. It implies that you've learned things that work better for you, and a style that works for you. That you're a local. When I say "sto lat" to someone in Poland, they can tell that I've made the effort to what they do (even though "Wszystkiego najlepszego z okazji urodzin" would be much harder).

So take the time to learn shortcuts, particularly in the tools that you are starting to use more regularly. Explore your neighbourhood. Know what's around you, and then do things the best way.

Be a local.

@rob_farley