Not speaking, but still supporting PASS Summit

First up – I'm not speaking at the PASS Summit this year. The virtual one in 2020. But I've registered and will attend, because the PASS Summit is still worthwhile.

I had made a big deal about submitting – I wrote a long thread about it on Twitter. I put three sessions in and hoped to be selected. But I wasn't and that's fine. That's not what this post is about.

Those of you who know me well might know that I'm not even someone who typically enjoys attending sessions at a conference. It's not my preferred way of learning, and even just sitting in a conference theatre chair for a long time gets to me and makes the learning experience hard. If I'm attending your session, I'm doing it because I'm interested in you more than in your content. For me, attending a conference is about what happens outside the sessions, not inside.

So if I'm not interested in the content of the event, and I'm not speaking, and I don't get to see people in person, what's the appeal of the PASS Summit this year? Why should I register and attend? Why should you?

And the answer is the same as if it was a real one – for the community and the networking.

PASS has always been about the community and providing ways for people to Connect, not just to Share or to Learn. "Connect, Share, Learn" has been a PASS motto for ages, but I think sometimes people see PASS as being all about the content, when it's really about the connection.

Many conferences are all about the content, and that's fair enough when you consider that it's usually employers that pay for people to attend these events. Attendees are often told to bring back knowledge, and maybe even give presentations to colleagues about what they've learned.

And yet for the PASS Summit, I'm not sure that content is the main draw.

I don't think content is what it was. When I first attended a large technical conference, which would've been TechEd Australia 1999, I was getting content that I couldn't really get anywhere else. A colleague and I flew from Melbourne to Brisbane, and we split the sessions we wanted to attend so that we got good coverage of the content. The parties were fun, and there was plenty of swag to be had from sponsors, but I didn't try to build a network at all.

By the time I got back to TechEd Australia it was 2005 and I had been actively involved in my user groups for a while. I spent time meeting people, especially presenters and other influencers, and got a sense of where things were moving. Instead of trying to catch a glimpse of what was going past, I tried to get caught up in the stream. By the time TechEd 2006 had come around, I had made a few trips to Redmond, I was involved in the hands-on labs, and my career was different.

The content from TechEd Australia 2006 was mostly already available through blog posts, articles, and videos from other events. But the networking was something I couldn't get in the same way.

PASS makes no bones about the fact about the networking side. They promote the "SQLFamily" concept enthusiastically. They provide activities to help first-time attendees get to know people. The content is still important, but the focus is on community. It is a community-focused organisation, after all.

This is what makes this year's PASS Summit tricky. Content-wise, people can get a lot of similar content from elsewhere. The sessions themselves are unique, but I'm sure that many sessions will have significant overlap with other sessions that have been given elsewhere. It's simply how presentations work. But without people gathering in person, that networking side will be hard. What will make the PASS Summit different this year, and different to most other online events, is that they are pushing to find ways to let people interact with each other despite the online-only format. You might not be able to physically walk up to a presenter at the end of their session, but they will still be available for questions, conversations, all that. With a wider reach possible because of the online approach, allowing networking between people in different parts of the world, it could be very significant.

Besides the presenters, many people who aren't presenting this year (myself included) will be hanging around. Time zones might make this trickier for some, considering that daytime in America is night-time where I am, but I'm typically jet-lagged at these things anyway. I'm hoping to be able to hang out in virtual rooms to see some of the old faces, meet new people, and be part of discussions. Most of the learning we do in life is through conversations not conventions, and I don't want that to stop just because we can't meet in person.

So I've registered as an attendee for the Virtual PASS Summit 2020. Hopefully I'll get to see you there.

@rob_farley

Getting started with Power BI Push Datasets via REST APIs

Real-time Power BI sets are a really useful feature, and there's a good description of them at https://docs.microsoft.com/en-us/power-bi/connect-data/service-real-time-streaming. I thought I'd do a quick walkthrough specifically around the Push side, and show you – including the odd gotcha that you might not have noticed.

To create a dataset that you want to push data into, you need to go to the Power BI service, go to your Workspace, and create a Streaming dataset. Even if you're not wanting to use it with a streaming service, this is the one you need.

Having said I want to create a dataset, I need to tell it that I want to use this via the API, rather than using Azure Streaming or PubNub (for more on those, check the documentation link above). I also need to give the dataset a name and define the columns. I only get to put in numbers, text, and datetime values, so for now I'll use one of each. I'm also going to tell it that I want to enable historical data analysis. You should be careful whether you pick this or not, as we'll see in a moment.

Once we hit the Create button, we get to a screen that gives us the Push URL and some sample code. This is really useful because you can immediately put some code together to start pushing data in. I can click on that window to grab the code and paste it straight into a script that runs regularly.

(Don't bother trying that particular endpoint, by the way. I created it for this post and it was long gone before this post was finished.)

This PowerShell could be kicked off by a number of things. I could poll a database, or have it run when certain conditions are met, or as part of a wider process. And if PowerShell isn't my thing, then I can use some other method – I just need to post data into a URL.

Let's talk a bit more about that Historical Data Analysis option.

If I don't choose that option, then the data hangs around for an hour and then disappears. That might be fine, but it might not. Also, I only get to use this data in a dashboard tile with a handful of visuals – not in reports at all.

If I do choose the Historical Data Analysis option, then 200k rows of data will hang around for me (that's not a configurable value), and I can even use the data in reports.

Let's explore this.

In a dashboard there's an Add tile option under the ellipsis.

Amongst the tiles that I can add is a Real Time tile for Custom Streaming Data.

The fact that it's Real Time means that it will refresh when the dataset underneath refreshes – which is absolutely what we want here if we're going to be showing this on a screen somewhere. Then we can choose the dataset we want, but the tiles we can use are very limited. Only Card, Gauge, and a few types of chart. And things are limited even more, because a Card can only show the number data, not text or datetime.

Suppose I tell it to display a Card, and then I run the PowerShell code repeatedly to push newer and newer data in – the tile updates with the new data almost immediately and this is great. Someone viewing this dashboard can get near-real-time updates from an application or script.

This happens regardless of whether or not we chose to do historical data analysis. It will show the latest row pushed in, regardless of whether the datetime column is older or newer. We push a new row, it replaces what's in that tile. When this is all I want to do, I don't even care too much about sending a timestamp column. Any data I send will disappear an hour later, and that's fine for me.

But if we did choose that option, we can also create a report based on the dataset.

However, you can't do this in Power BI Desktop. You have to do it from the service, where your dataset was defined.

You can either choose New Report and then pick the dataset from the list, or (especially if you can't see it in the list), you can pick "Create Report" from the dataset's ellipsis.

Now I can create all my usual visualisations based on this data, showing multiple rows, and all that. In my example, because I didn't change the timestamp value every time, I need to tell it explicitly "do not summarize" the number value, because I can't edit the dataset except to change the field names and types. I certainly can't do anything like setting default aggregation types. Also notice that the dataset contains a table called RealTimeData. I didn't choose that – it was set for me.

At this point I find myself wondering about that 200k rows that it's holding onto, and filtering to "Top N" can be handy, or filtering on the datetime column for relative time (being careful about UTC settings). Otherwise, you might find yourself inadvertently showing more data than you'd like.

But if I need to clear out my dataset at the start of each day, then that's an option too. I just need to do it deliberately. Luckily, Power BI's REST API is good for that too, and I can easily delete all the rows in a dataset with a new lines of PowerShell.

Clearing this will cause a dashboard tile to empty immediately, just like when we pushed the new data in.

But what about the report visual? That doesn't refresh automatically. To get the report to show the latest data you need to do a refresh.

However, you can pin the visual to a dashboard, and see it refresh automatically.

I made a simple loop to send a row, wait a second, and repeat, and the response that I got was pretty useful. I had the visual filtered to the Top 3 based on the latest timestamp.

Now, it would be nice if reports could be set to auto-refresh when new data is pushed in, but having it in a dashboard is still very useful.

You see, streaming datasets don't have to involve Azure Stream, but they can involve all your usual visuals if you choose the option for historical data analysis – it's just that you won't see your visuals refresh automatically unless you use a dashboard. Now that you understand these aspects of what's going on and the couple of extra hoops you need to jump through, hopefully you can see a good use for these types of datasets. You can decide whether you you need to clear rows occasionally, figure out how you want to visual your data, and really start to make push datasets your own.

@rob_farley

That time the warehouse figures didn't match

I don't write about situations at my customers. It's just a rule I have. Regular readers may have heard my say it before. So if I'm writing about a problem, it's much more likely that it's something I've come across elsewhere, or a general thing. This month's topic for T-SQL Tuesday (hosted by Kerry Tyler (@airbornegeek) is about "crash reports" – the stuff that we've learned from disasters that can teach us (and others) about how to avoid disaster in the future.

So it's a bit of a tricky topic without writing about something that's happened at a customer. I'm still not going to write about what happened at a customer, but I am going to write about a problem that I've seen plenty of times, and what I do to avoid it.

It's about not sufficiently protecting the data that is being populated into a warehouse. I've been guilty of this before, and I continue to try to do better.

The scenario is that there's a source system, and there's a table in a warehouse that is being used to report on it. Maybe it's being populated by Integration Services or Data Factory. Maybe it's being populated by T-SQL. I don't really care. What I care about is whether the data in the warehouse is a true representation of what's in the source system.

If it's not a true representation, then we have all kinds of problems.

Mostly, that our warehouse is rubbish.

Seriously – if the data in the warehouse can't be shown to be correct, then what's the point of it? People won't want to use it, because they won't want their reports to be wrong. If they use it because it's the best they've got, knowing the data is sometimes wrong, then anything that comes from it will need further analysis and the inaccuracies will be a constant frustration. A warehouse needs to be correct.

It should be easy to validate too. It should be easy to pull totals from the source system (which could be a staging environment – it's doesn't have to be production, it just has to be trusted to be correct), to see if they match what's in the warehouse. It should be able to be tested following refresh of the warehouse data, and alerts should be raised if there's a problem.

So there's a problem, and you probably have a good idea about how to detect it. But to be a "crash report", we need to explain what caused it in the first place. There could be several factors involved, but in the situation(s) I'm remembering, it came down to not protecting the base table.

An ETL process gets data from source system(s) into a warehouse, doing some amount of transformation to get it into the right shape, because the things that we're counting might need information from other tables in the source to work out how they should be represented in the warehouse tables, etc., etc., but the whole way along, we need to be protecting that data. And particularly from JOINs.

A join isn't a bad thing. It lets us get to the relevant data from other tables. It's very useful and you learn about it very early on in a T-SQL class. If you're using ADF or SSIS to do your ETL, you might have T-SQL in a source query. Or you might be pushing data through a Lookup or a Merge. You're almost certainly going to be needing information from another table. And you need to take care with that join.

You see, joins don't just pull data from other tables. They can also filter your data, if you're doing an inner join. They might make extra copies of the rows in your base table, if you hit multiple rows on the other side. You might even introduce rows full of nulls if you're doing a right outer join or full outer join.

This is a very real problem that I've had to deal with.

In an ideal world, every row in my base table goes through to my warehouse as exactly one row. But these three things that joins do… well, they can trip us up.

It's probably quite easy to stop pulling in null rows, because you're probably not using many right outer joins or full outer joins.

As for the filtering – well, it's easy to use left joins everywhere… except that you will need to be careful that you don't accidentally eliminate any nulls that you've grabbed from that by using them in another join or whatever. You might figure that you have foreign keys in place to make sure that you can't filter any rows with your join, but you'll need to make sure that the column in your base table doesn't allow nulls, even if it's currently always populated.

Hitting multiple rows should be easy to handle, because you can make sure the join predicates you're using can only hit one row. But this isn't always straightforward either. If you're referring to a primary (or other enforced-to-be-unique) key, then great. But will you be told if that changes? Or what if it's more complicated than that? Suppose you're joining to a row that describes a date range (like a dimension table with startdate and enddate), but somehow you've got overlapping ranges?

I've seen too many crash reports that cover this. Some have been of my own making. It's not always in warehouses, sometimes it's in basic queries used for other things, and I've failed to protect the base table.

What I try to do is to make sure that it's impossible for my joins to impact the number of rows in my base table. This means knowing what I consider the base table to be in the first place even, but then being cautious about the way I'm joining. I often find myself replacing joins with "outer apply (select top (1) …)", even when I'm pretty sure that it should be okay. And this might impact the performance of my query – but correctness is critical. Plus, depending on how it's being used, I can often find ways to ensure that performance is even better this way (because the Query Optimizer will know that the number of rows isn't being affected by the join).

Disasters are no fun. I'm all about making sure that my customers can trust their data. If my query gives incorrect results, they lose confidence in future queries that I write, and possibly in their data environment in general. It's critical that I get it right.

My "crash report" today is stopping before we reach the point of the customer losing confidence in things, but I can assure you that I've seen this happen. Today, I'm choosing those times when we've noticed what's going on and mitigated things before the disaster takes hold. But I implore you – protect your base tables. It really is very important.

@rob_farley

A tip about shortcuts

This month's T-SQL Tuesday brings another "Tips & Tricks" topic, this time from Kenneth Fisher (@sqlstudent144). Kenneth's trick to being different to previous ones is that he insists on anything we suggest not being SQL-related.

It reminds me of the Shortcuts piece that Jess Pomfret (@jpomfret) prompted four months ago, where I wrote about a language shortcut I remembered from high school. Well, since then I've picked up duolingo, and have now spent time each day for about eight weeks learning Polish (I've been a little curious ever since I went there last year), and completed the 'intro level' for all the other 34 other languages they have on the site. I find languages fascinating, and often wonder why I didn't find phone apps for learning them years ago. I have to admit that I still need to learn the endings of Polish words much better than I know now, figuring out which cases end in -ego and which end in -ami. I'll keep improving though, and one day hopefully I'll be able to spend more time in other languages too.

One thing I discovered recently is about how you can seem much more natural in a language by being across the shortcuts that they use. My shortcut post from February talked about how I used the pronoun 'on' in French to avoid having to remember so many verb conjugations, but in Polish there are some things that are even better. The most obvious of which is how you say Happy Birthday.

If I look up the Polish translation for Happy Birthday, it tells me "Wszystkiego najlepszego z okazji urodzin" (literally "All the best on the occasion of your birthday"). Don't be surprised at how scary that seems to pronounce, as long combinations of consonants are all the rage in Polish. I really can't tell how to say "Wszystkiego" properly – I know "w" sounds like a 'v', and that 'sz' sounds like 'sh', but 'vsh' doesn't seem any easier to say either.

But what Polish people say (as I spotted on Facebook by seeing how Polish people say Happy Birthday to Polish people) is "sto lat". Just six letters! And it's way more Polish than saying "Wszystkiego najlepszego z okazji urodzin". "Sto lat" literally means "a hundred years" (which means you can even get away with writing "100 lat") because the cultural thing to do is to tell someone you hope they live to be a hundred years old. Polish people realise this doesn't translate properly into English, so they don't say "100 years" to English-speaking people when it's their birthday.

So to me it's a great shortcut that has let me impress some of my Polish friends.

One thing I've found in life is that shortcuts betray a deeper knowledge of something, whether or not you actually have that deeper knowledge. When I used to catch taxis to the airport, I could tell if the driver was a local or not by the roads they'd take. Satnavs always recommend a slower route between my place and the airport, thinking that a wider road will be faster than the parallel narrower one despite having more traffic lights. The traditional "It's okay I know a shortcut" when driving suggests you're more familiar with the area than the passengers, and is the kind of thing tested in London taxi drivers' "The Knowledge" (sorely missing in my town).

Fluency in any environment, whether navigating streets, a language, or even a computer system, is most typically demonstrated by the tips, tricks, and shortcuts that you develop. It implies that you've learned things that work better for you, and a style that works for you. That you're a local. When I say "sto lat" to someone in Poland, they can tell that I've made the effort to what they do (even though "Wszystkiego najlepszego z okazji urodzin" would be much harder).

So take the time to learn shortcuts, particularly in the tools that you are starting to use more regularly. Explore your neighbourhood. Know what's around you, and then do things the best way.

Be a local.

@rob_farley

Avoiding the 'WHERE clause wasn't selected' problem

…you know the one. In SSMS it looks like this:

You meant to only update a few rows, but you weren't paying attention to what you'd selected before hitting the Execute button. The WHERE clause got missed from what you executed, and you've updated every row. Oops.

Now, I totally hear you that you could've put this in a transaction, checked your results and then only done a COMMIT when you were sure. You could even have put that COMMIT inside an IF block that checked @@ROWCOUNT… yes, yes, yes. You could have used an SSMS add-in to let you execute the whole query that you're currently on. You could've used a different tool entirely, such as Azure Data Studio. There are plenty of ways to avoid this. But I'm going to show you something that I find works for me, especially when I'm giving the script to someone else to run.

The trick is to use a sub-query, and to run my UPDATE across that. I have to use an alias, sure, but I generally do that anyway.

It looks like this:

And the trick is that this code will not run unless the whole sub-query is selected. It's updating the alias of the sub-query, so unless you explicitly remove the WHERE clause (which I know you might do for some unbeknown reason) and forget to put it back, you can't select just part of this by careless mouse-and-keyboard-shortcut work.

The "SELECT *, –" section makes sure that whoever is going to run this can tell what's about to happen, including the new values that they're looking for, confirm the number of rows, and then highlight from the UPDATE on, knowing that it's impossible for them to miss highlighting some of it. Because it's just way too easy to highlight 'not quite everything'.

I get that this idea doesn't work quite so well when CTEs are involved, and that I might need to be careful about column names if the FROM clause has multiple tables (such as by only returning the columns from the table actually being updated and anything else I need for the sake of the update). But still, it helps.

It's a small thing. Sometimes life is made up of a bunch of small things that together make a difference.

@rob_farley

Helping the research into COVID-19

Interesting times at the moment. You may have read my piece a few weeks ago about how I think it's irresponsible to do data analysis on the coronavirus situation unless you have sufficient domain knowledge and enough empathy to understand the impact of what you find. But this doesn't mean that we make like we're at the Winchester and wait for it all to blow over. There are some things we can do.

One such thing is the something that Glenn Berry (@GlennAlanBerry) put me on to, and that's Folding@Home.

One major area of research into diseases is to see how protein 'folds' into various shapes, so that they can function as enzymes, antibodies, and so on. This is all chemical biology that is way outside my area of expertise, but I understand that when proteins fold incorrectly somehow, some diseases can occur. Horrible diseases like Alzheimer's or cystic fibrosis. Understanding protein folding can also lead to knowing more about how viruses use proteins against our immune systems… you can read more about all this at https://foldingathome.org/covid19/

The thing is that to understand protein folding better, tonnes of simulations need to be run. And this takes processing power that research labs simply don't have enough of.

And this is where you and I come in.

I remember when Jim Gray went missing. I was in the US, working on a project for Microsoft, when news came in that his boat had gone missing. Within a few days, satellite images became available, and people (including me) would look through them for anything that might have been worth investigating. I trawled through hundreds of images hoping that someone somewhere might find some sign of him before it was too late. We never did and he was declared dead five years later. But seeing people all over the world come together to assist in a project like this was remarkable. And practical too. The SETI project, in which people listen to radio signals hoping to find extra-terrestrial intelligence doesn't seem as practical to me as trying to find a person who has gone missing.

Or to help find a cure for diseases which are ravaging our population.

And so I downloaded the Folding @ Home software from foldingathome.org, ran the software, and joined a team. I even spun up an old machine I wasn't using, so that it could help too. Now, processing power that I'm not using is running these simulations and it might bring researchers closer to finding cures for some of these things. And because of other processing power I have available, Kelly Broekstra has also put me onto DreamLab (that's iOS, try here for Android), which is a phone-based system that's similar. Now when I'm not using my iPhone but near a power supply (like when I'm sleeping), I run this app and let it crunch numbers. The screen dims about ten seconds after I tell it to start, which reassures me that I'm not burning the screen out.

This is a small thing. There are plenty of bigger things that we could all do, but it's also very simple to do this small thing, and it could become a very big thing if a lot of us do it.

Thank you Glenn and Kelly for putting me onto this. I hope everyone who gets involved stays involved long after COVID-19 has disappeared from society, because it seems this can really help researchers into some of the most horrible diseases out there.

@rob_farley

Patching SQL in Docker

If you read my blog regularly, you may have noticed that I installed SQL in Docker on my new laptop a couple of months ago (incidentally, I think it must still be new, because I still make mistakes when doing various shortcuts around the special keys like the arrow keys, End, PgUp, etc). Anyway, that was a while back now, and I recently wanted to try Cumulative Update.

Now, the thing with running SQL in containers is that the concept of downloading a patch file doesn't work in the same way. If it were regular Linux, the commands would be very simple, such as 'sudo yum update mssql-server' in RHEL. But Docker doesn't quite work the same way, as reflected by the Microsoft documentation which mentions Docker for installing but not in the Update section.

The trick to this is to remember that the Docker container is just a container running an image. So blowing it away and starting again is easy, as is spinning up different containers based on separate images. The database files are (hopefully) in separate volumes – accessed by the container, but not part of the image that's running.

So we just grab the latest image, stop and remove the old container, and run it up again using the new image. So I run this PowerShell code to do that. It's worth noting that the pull won't actually pull anything if the 2019-latest image hasn't changed. I could be more specific and pull the image based on the tag (which is unique), but the code I'm using will just grab whatever is marked as 2019-latest. Specific tags is what I want for testing against a variety of versions, so that I can be really sure which one is which, but if I'm just patching my local machine, 2019-latest is good.

A container on this image looks for the master database files in /var/opt/mssql/data and /var/opt/mssql/log, so by having those folders reference folders outside the container, nothing changes when I blow away the sql2019 container. When the new one starts, it sees the master database, which tells it where to find the user databases. As is the case with any upgrade, it'll apply any required changes when the SQL process spins up and loads the databases.

So this ends up being a very simple process. I can just have that block of code run whenever I want to make sure I've got the latest version. And if I forget, I'll come back here and grab the code again!

@rob_farley

Responsibility with data

It's easy to be a superhero when you have skills with data.

Being able to analyse data gives you super-vision, letting you see things you couldn't see before. It can give insight into what might happen in the future – prescient knowledge. It gives super-speed because you can react and adapt to situations much more quickly.

But to quote Spider-Man: with great power comes great responsibility.

It's all well and good to do amazing things with data. To see that a particular value is going up while another value isn't. To improve data quality. To learn about how data relates to businesses. To develop a culture around data. To use data effectively.

It's all well and good because if we focus too much on the numbers we can disconnect the data from what it means. In some ways, this can provide an objectivity that can become a strength. Letting the numbers speak for themselves without emotional ties. No strings.

Except that data is not just numbers. Data refers to things. To people. To loan applications. To medical emergencies. To salaries. Things that should never be considered just numbers. Reduced through objectivity to mere objects. These things are the subjects of life, and we need to always retain some amount of subjectivity.

It's easy for an accountant to look at the cost of the people in an organisation, and decide that that cost needs to be reduced. That people need to be let go. It's easy to look at a credit rating score and use that to turn down a loan application, but does this give the full story?

Profile shot of a concerned businessman talking on the phone while sitting in front of monitors displaying financial information

We find that within any of our clients, the data may well tell a story. We can create models that predict whether someone will behave a particular way, much like meteorologists predict whether it's going to rain. But the emotional investment of a raindrop is not the same as the emotional investment in someone's life.

Recently my friend Meagan Longoria, an expert on data visualisation, wrote at https://datasavvy.me/2020/03/26/stress-cases-and-data-visualization/ about how we need to remember that the people looking at data need to understand the stress level that consumers of data are under. The current situation with COVID-19 a strong example she gives about how everyone has been impacted in some way by this virus, whether they've been in hospital from it or lost loved ones through to being inconvenienced by closure of their local gym. Some people might happily look at charts about the number of cases all day, while other might be petrified in fear.

I'm not about to start producing charts in the public domain about a subject area I know relatively little about. Particularly when those data points are people who are hurting, and all the data points that are not on there are people who are potentially fearing for their own lives. If a customer of ours wants me to do some analysis on their COVID-19 data, then that's up to them, and might help them understand various aspects. But the human element of it is not lost there, because it's an organisation trying to understand the data in their world.

Pay attention to what Meagan and others are saying, and don't be tempted to report on data just because it's there. Consider the human impact of it. Seek to understand how people are impacted by the data. Seek to understand what aspects of people are described by data points, by trends in the data, by your reporting of the data.

People are always more important than data, and we would do well not to forget that.

In summary, we need to maintain an empathetic perspective on the data we're analysing, just like we need to keep an empathetic perspective on people.

@rob_farley

Testing Within the Database

Specifically, unit testing. As per Hamish Watson (@thehybriddba)'s topic for this month's T-SQL Tuesday.

Hamish wants to develop a conversation about unit testing within database because he recognises that the lack of unit testing is a significant problem. It's quite commonplace in the world of iterative code, of C#, Java, and those kinds of languages, but a lot less commonplace in the world of data. I'm going to look at two of the reasons why I think this is.

Modularisation

When I was first taught to write code, I was taught about Conditionals and Looping. Conditions are the 'IF' statements, and Looping is the FOR or WHILE statements. You can write just about anything using these. But I was also taught that Modularisation was just as important, so that blocks of code could be called when appropriate, rather than having to repeat them each time. Everything else, including the building blocks of any paradigm like object-oriented programming or logic programming, was secondary to these core concepts.

And then years later I entered the database world, where coding was different. Queries used joins and predicates, not looping and conditionals, and although these things were possible within stored procedures, it was better include the logic within the query itself. As for modularisation, scalar functions were frowned upon, although procedures that called other procedures were okay.

Without modules, the units to be tested are a little less obvious, so fewer people think of unit testing in the database development world as they do in iterative languages.

Statelessness

'State' is how about the world looks when a particular set of code is called. A set of parameters that is passed into a module is not typically referred to as the 'state', but that's pretty much all that isn't. The concept of 'statelessness' is that anything that might be needed to allow a module to behave in particular way should be passed in as a parameter, and not pulled in from anywhere else. After all, that 'anywhere else' could be different tomorrow, and the module might behave differently.

Statelessness is what makes a module deterministic. When a function is called with a particular set of parameters, the result should be the same as any other time it was called with those same parameters. This is one of the fundamental premises of unit testing. We know that if we call an 'Add' function with the values 3 and -2, the result should always be 1. If it's anything else, then something's not right.

But the world of databases is all about state. The database describes the world of that system as it is in that moment, and it's unlikely to ever be in that same exact state again.

So do we skip unit testing for databases?

No! It's just that things might need to be tested in different ways.

A test harness for a database environment needs to include a database, but this should an opportunity not a hindrance. The database used for testing can include scenarios that you would hope never to find in a production database. You wouldn't want anyone's name to include strange characeters or potentially malicious code, but this could be included in a database used for testing. Products with negative prices, or prices that are so high that they would cause overflow problems, are excellent for test harnesses. It's how to make sure that errors are handled smoothly, and that alerts about these conditions are being flagged.

Databases have a lot of things that need testing, because it's not just about ensuring that modules of code have the right impact. It's important to know and test the impact of SSIS processes, to be confident that conditions that break constraints fail gracefully, and to be reassured that reports will still be produced even if the data is unexpected or missing.

So for unit testing within the database, yes it's very much a thing. I don't particularly care how you define the units you're testing, whether they be queries, statements, procedures, functions, packages, reports, execution plans, or anything else. You can test them for correctness, for performance, for grace in failure, for whatever you like. Think of the edge cases you want to put into the testing database. Think of the conditions you want to the system to be in when testing.

Just test.

@rob_farley

Developing a data culture when talking is harder

Developing a data culture is no easy task. Having a culture where people look to the data to understand what's going on is incredibly useful no matter whether it's to address big-picture questions or the minutiae of individual situations. And typically organisations have a lot more data than they expect, so it's quite realistic to assume that there is probably data that could answer a lot of questions, if only they have a way to get the answers.

I see a number of steps involved in establishing a data culture. Some involve technical hurdles and many are more directly cultural. It's one thing to be able to demonstrate the quality of some data, but it can be quite another to have people understand and appreciate said quality. Being able to act on data first means knowing you have it.

As consultants (here at LobsterPot Solutions) who help our customers on the journey of developing data culture, we like to work from their sites, and make ourselves available for answering questions about the data. Ultimately, we want our customers to be able to get the answers themselves. Tools like Power BI can help this become a reality. But before they reach that point, it's good to have them ask those questions of us. Not only does it help clue us up as to the things that we need to include in data models, but it helps demonstrate that they're thinking about data. That their data culture coming along.

This willingness to ask questions has to filter through a large portion of any organisation trying to establish a data culture. It's great to have senior management understand the significance of data, but if the people under them aren't also making that same shift, then there's trouble coming. What happens if the middle-manager reports up to senior management using an incorrect version of data that the senior manager can see. Having a 'single point of truth' for data is very important.

While they're still on the journey to being able to get the answers themselves, the single point of truth for data might often be the team of people who are developing the analytics system – often the team which my staff or I are helping. We're very used to having our days consist of the technical and design work that we're doing towards a self-service reporting environment, as well as workshops and training we're doing to develop the culture, and also a long list of practical questions about the data. The same questions that will get answered via the models we're creating, but which are more time-critical and get answered in other ways – queries against a model or cube or warehouse, or maybe even against the operational system (taking care to avoid any kind of impact). These questions are important. Turning these people away prevents the development of a data culture, because their curiosity is extinguished. Their appetite for the answers is dampened, not whetted. I try to encourage everyone to have access to data, even if it's through me.

Except that I'm not always there. And in a time when an increasing number of people are needing to work from home, maybe there's nobody physically in the office who is working on the data.

Many organisations have a culture of asking each other questions via email, Teams, Slack, or whatever. But one of the biggest impacts I have is by being physically present, so that people ask who I am, find out that I'm not scary, and start to talk about the data. A lot of the culture around asking data-related questions is knowing how to ask those questions, so being physically present helps.

Suddenly this month, it feels like everyone is working from home.

Australia has reacted quite late compared to many countries, but is now banning large non-essential gatherings, and it wouldn't surprise me if the restrictions only becomes more severe as time goes on until a solution is found. Organisations across the world are putting out their own advice, and telling people to work from home if they can.

Having teams spread out makes all kinds of communication harder, including those questions about data that I consider so important. If I'm not physically there, are people still wondering the same questions but not getting the answers? At what point do they simply stop wondering?

For me, I think this comes down to prompting the conversation about data. Find ways to tell stories about their data, and start communicating it – emails, SSRS subscriptions, whatever works, to get information out there to pique interest. With a standing invite to ask more questions. This can be put down to educating people about the opportunity with data, or to validate assumptions (this one is particularly useful because if you're asking someone for help with it, then they're more likely to look deeper and start to think about the potential), or to push the data through a "Quick Insights" tool to come up with 'Did you know…?'-style trivia. The more the data can be discussed and be interesting, the more a culture around understanding data thrive.

Pretty soon, conversation should be able to flow more easily even if you're not there, but communication channels need to be established. Self-service is great, but ideally you can make some sort of a forum for people to share their nuggets of insight. It's the kind of thing that someone can share with a colleague in a "Hey – check this out" context. It demonstrates excitement about what they're discovering. It demonstrates that they understand the business that they're in, what the organisation does, and hopefully where there's an opportunity for improvement. When people can't walk up to each other in the same way, an effort might be needed to make this happen – Slack conversations, Team channels, whatever works to develop and maintain social interaction. Then find ways to incorporate data nuggets into the mix.

A data culture doesn't have to be hard to instigate. It might require some work on the data itself, including data cleansing, modelling, and tools to let people get at the insight. But the key is to let interest in the data thrive, with enthusiasm firing, and to make sure there are ways to let that continue. Find ways to encourage conversation for the sake of your team, and for the sake of the data culture, find ways to encourage conversation about the data as well.

@rob_farley