Responsibility with data

It's easy to be a superhero when you have skills with data.

Being able to analyse data gives you super-vision, letting you see things you couldn't see before. It can give insight into what might happen in the future – prescient knowledge. It gives super-speed because you can react and adapt to situations much more quickly.

But to quote Spider-Man: with great power comes great responsibility.

It's all well and good to do amazing things with data. To see that a particular value is going up while another value isn't. To improve data quality. To learn about how data relates to businesses. To develop a culture around data. To use data effectively.

It's all well and good because if we focus too much on the numbers we can disconnect the data from what it means. In some ways, this can provide an objectivity that can become a strength. Letting the numbers speak for themselves without emotional ties. No strings.

Except that data is not just numbers. Data refers to things. To people. To loan applications. To medical emergencies. To salaries. Things that should never be considered just numbers. Reduced through objectivity to mere objects. These things are the subjects of life, and we need to always retain some amount of subjectivity.

It's easy for an accountant to look at the cost of the people in an organisation, and decide that that cost needs to be reduced. That people need to be let go. It's easy to look at a credit rating score and use that to turn down a loan application, but does this give the full story?

Profile shot of a concerned businessman talking on the phone while sitting in front of monitors displaying financial information

We find that within any of our clients, the data may well tell a story. We can create models that predict whether someone will behave a particular way, much like meteorologists predict whether it's going to rain. But the emotional investment of a raindrop is not the same as the emotional investment in someone's life.

Recently my friend Meagan Longoria, an expert on data visualisation, wrote at https://datasavvy.me/2020/03/26/stress-cases-and-data-visualization/ about how we need to remember that the people looking at data need to understand the stress level that consumers of data are under. The current situation with COVID-19 a strong example she gives about how everyone has been impacted in some way by this virus, whether they've been in hospital from it or lost loved ones through to being inconvenienced by closure of their local gym. Some people might happily look at charts about the number of cases all day, while other might be petrified in fear.

I'm not about to start producing charts in the public domain about a subject area I know relatively little about. Particularly when those data points are people who are hurting, and all the data points that are not on there are people who are potentially fearing for their own lives. If a customer of ours wants me to do some analysis on their COVID-19 data, then that's up to them, and might help them understand various aspects. But the human element of it is not lost there, because it's an organisation trying to understand the data in their world.

Pay attention to what Meagan and others are saying, and don't be tempted to report on data just because it's there. Consider the human impact of it. Seek to understand how people are impacted by the data. Seek to understand what aspects of people are described by data points, by trends in the data, by your reporting of the data.

People are always more important than data, and we would do well not to forget that.

In summary, we need to maintain an empathetic perspective on the data we're analysing, just like we need to keep an empathetic perspective on people.

@rob_farley

Testing Within the Database

Specifically, unit testing. As per Hamish Watson (@thehybriddba)'s topic for this month's T-SQL Tuesday.

Hamish wants to develop a conversation about unit testing within database because he recognises that the lack of unit testing is a significant problem. It's quite commonplace in the world of iterative code, of C#, Java, and those kinds of languages, but a lot less commonplace in the world of data. I'm going to look at two of the reasons why I think this is.

Modularisation

When I was first taught to write code, I was taught about Conditionals and Looping. Conditions are the 'IF' statements, and Looping is the FOR or WHILE statements. You can write just about anything using these. But I was also taught that Modularisation was just as important, so that blocks of code could be called when appropriate, rather than having to repeat them each time. Everything else, including the building blocks of any paradigm like object-oriented programming or logic programming, was secondary to these core concepts.

And then years later I entered the database world, where coding was different. Queries used joins and predicates, not looping and conditionals, and although these things were possible within stored procedures, it was better include the logic within the query itself. As for modularisation, scalar functions were frowned upon, although procedures that called other procedures were okay.

Without modules, the units to be tested are a little less obvious, so fewer people think of unit testing in the database development world as they do in iterative languages.

Statelessness

'State' is how about the world looks when a particular set of code is called. A set of parameters that is passed into a module is not typically referred to as the 'state', but that's pretty much all that isn't. The concept of 'statelessness' is that anything that might be needed to allow a module to behave in particular way should be passed in as a parameter, and not pulled in from anywhere else. After all, that 'anywhere else' could be different tomorrow, and the module might behave differently.

Statelessness is what makes a module deterministic. When a function is called with a particular set of parameters, the result should be the same as any other time it was called with those same parameters. This is one of the fundamental premises of unit testing. We know that if we call an 'Add' function with the values 3 and -2, the result should always be 1. If it's anything else, then something's not right.

But the world of databases is all about state. The database describes the world of that system as it is in that moment, and it's unlikely to ever be in that same exact state again.

So do we skip unit testing for databases?

No! It's just that things might need to be tested in different ways.

A test harness for a database environment needs to include a database, but this should an opportunity not a hindrance. The database used for testing can include scenarios that you would hope never to find in a production database. You wouldn't want anyone's name to include strange characeters or potentially malicious code, but this could be included in a database used for testing. Products with negative prices, or prices that are so high that they would cause overflow problems, are excellent for test harnesses. It's how to make sure that errors are handled smoothly, and that alerts about these conditions are being flagged.

Databases have a lot of things that need testing, because it's not just about ensuring that modules of code have the right impact. It's important to know and test the impact of SSIS processes, to be confident that conditions that break constraints fail gracefully, and to be reassured that reports will still be produced even if the data is unexpected or missing.

So for unit testing within the database, yes it's very much a thing. I don't particularly care how you define the units you're testing, whether they be queries, statements, procedures, functions, packages, reports, execution plans, or anything else. You can test them for correctness, for performance, for grace in failure, for whatever you like. Think of the edge cases you want to put into the testing database. Think of the conditions you want to the system to be in when testing.

Just test.

@rob_farley