Randomising data

December 7, 2009

I recently needed to randomise some data to keep some information secret. The idea was that it looked roughly similar to the real data, but was sufficiently different to avoid any identifying features.

After discussing it with the client, it was agreed that…

1/ ID numbers would be mixed around the people in the list. Therefore, they were all real numbers (that therefore matched the rules governing what made up a legitimate number), but they would be reordered at random amongst the people.

2/ Dates would be set randomly between the minimum and maximum dates available.

3/ Strings would become a series of random letters, but the same length as the original.

4/ Genders would be assigned a random value of M, F or N (Not Specified).

5/ Numeric fields (such as salaries) would be multiplied by somewhere between 0.1 and 10, with 1 being the median value used.

Here’s how I did it.

1/ I used row_number of this, twice. I used one ordered by the original ID field, and one ordered by newid() (which is a good-enough random order). I could then perform a self-join, and do the update.

with twonums as
(
select row_number() over (order by id) as orig_rownum,
      row_number() over (order by newid()) as new_rownum,
      *
from dbo.People
)
update t1 set id = t2.id
from twonums t1
      join
      twonums t2
      on t1.orig_rownum = t2.new_rownum
;

This mechanism takes advantage of the fact that you can update a CTE. The fact that row_number() assigns each number exactly once means that I update every row, and no row gets updated twice.

2/ To generate a random positive value less than some number N, I use abs(checksum(newid())) % N. Apparently this gives a good distribution of values. If N is the number of days between two dates (plus one, in case the two dates are identical), then the result can be added back onto the first date to get a random date between the two.

update dbo.theTable
set theDate = (
   select
   dateadd(day,
         abs(checksum(newid())) %
            datediff(day, min(theDate), max(theDate)) + 1,
         min(theDate)
         )
   from dbo.theTable
);

If you prefer, you could populate variables @startDateRange and @endDateRange and then use them instead of having dbo.theTable in the sub-query like this. The Query Optimizer should be able to kick in and make work out those values for you once though (which is did when I checked the Execution Plan).

3/ Without stepping through each character in a string, it doesn’t seem particularly trivial to change each one to something different. For this, I took advantage of SQL 2005’s ability to use expressions with the TOP clause, and the string concatenation feature available from FOR XML PATH(”).

Using any table with sufficient rows in it in my FROM clause, I generated random letters by converting a number from 0 to 25 to a letter. Adding the number to ascii(‘A’) and converting back to a character did the trick. Restricting the number of rows returned to the number of characters in the name gave me a set of characters, which I could easily concatenate using FOR XML PATH(”)

select top (len(isnull(GivenNames,”))) char(abs(checksum(newid())) % 26 + ascii(‘A’))
from sys.all_objects
for xml path(”)

4/ Assigning a random gender to a row was very easy. I simply took a random value between 0 and 2 and used it with CASE.

case abs(checksum(newid())) % 3 when 0 then ‘M’ when 1 then ‘F’ else ‘N’ end

5/ Finally, multiplying by a value between 0.1 and 10. It’s easy to generate a value between 0 and 99, add one to it and divide by 10.0 to get values in this range, but this isn’t really what’s desired, as it would give a distribution centred around five. The distribution that I want is actually logarithmic, giving roughly as many values less than 1 as there are greater.

Really what I wanted was to get a number between —1 and 1, and use 10^N, as 10^(-1) is 0.1, 10^0 is 0, and 10^1 is 10. This seemed quite easy, except that the POWER() function in SQL only uses integers. I could easily generate a value in the range —1 to 1, I simply used checksum(newid()) % 1001 (ignoring the ABS() function), and divided by 1000.0. But then to find 10 to the power of this value, I remember the logarithm function from school, which said that x^y was the same as the exponent of log(x) * y. Therefore, I used:

exp(log(10) * (checksum(newid()) % 1000 / 1000.))

…which did the trick nicely.

The client verified that the data was sufficiently random (as well as expressing some surprised over there being a practical use for log() and exp()), and I had an environment to which I could grant developers access.