Friday, 27 September 2013

Data Masking - overcoming the challenges

Everyone agrees that it is essential when provisioning test databases as full or partial copies of production data, that any sensitive data, that is personally identifiable or company confidential data, must be anonymised.  There are lots of terms for this process including scramble, obfuscate, de-sensitize, blind, redact and mask.  Encryption does not usually apply is it designed to be reversible, if you can remember the key.  I will use the term masking in the remainder of this article.

Masking data is a relatively straightforward although generally i/o intensive process.  That is because when you select a column to mask you have to update that column value in all the rows in the relevant table.

The primary drawback is not the data processing involved in masking;  it is ensuring that all sensitive columns are masked.  There is always a niggling concern that you may have missed a sensitive column and then go ahead and published on the assumption that all data is masked.  There is no automated method to allay your fears.

For relational data if you choose to mask a column which is used in a join condition you would need to mask both columns replacing each corresponding entry with the same masked value (deterministic).  This is necessary to ensure that your SQL will operate correctly when joining two masked columns.   It is arguable that you should not join columns which contain sensitive data but I imagine it does happen.

The join requirement necessitates that for any sensitive column we need to understand the relationships between that column and all other columns with which it might be joined.  In general this necessitates a data model and this is usually not available or difficult to build, in particular if the table relationships are defined in the application logic and not declaratively in the database.

So we have to:
  1. Identify all sensitive columns
  2. Identify all join relationships between sensitive columns so we can mask them using the same algorithm

Masking identified sensitive columns is known as a blacklist approach to masking, where only data you have identified as sensitive is masked.  An alternative is to use a whitelist approach where by default all data is masked except for columns which are listed in the whitelist.

The benefits of a whitelist approach are:
  1. No need to identify all sensitive columns, in the worst case you will mask data that is not sensitive.
  2. If new sensitive columns are added they will be masked by default.
  3. If used in conjunction with deterministic masking functions there is no need to model relationships.  The data model for a whitelist just has to contain a list of all the tables to be masked.
Sounds too good to be true, well it isn’t all plain sailing.  You would still need to:

  1. Assign a masking function for each and every column although a default function could be applied depending on data type.
  2. Identify columns that do not need to be masked.  This is not strictly necessary, the benefits would be slightly better masking performance and perhaps easier support of test cases that examine or report non-sensitive data.
  3. Use deterministic masking functions to avoid the need to identify relationships.  A deterministic masking function will produce the same output for the same input.  This is also necessary when masking data consistently across multiple databases.  The other benefit is that deterministic masking functions preserve cardinality which is essential when running performance tests as the data cardinality will impact the SQL execution plans.

Using a whitelist approach in combination with deterministic masking will preserve security by default and avoid the need to determine table relationships.  The downside is columns missing from the whitelist will be masked unnecessarily and that may extend the masking process execution time although not by much as a large portion of the masking process is rebuilding tables and dependent objects such as indexes.

Using production data as the basis for test data is by far the best approach, as long as you can guarantee security of the copy.   For realistic performance tests it is essential to use test data based on production.  The additional benefit is that you do not have to generate artificial data which itself is non-trivial and leads to artificial results.

With the addition of a few user defined masking functions Oracle Data Masking supports this approach.  Why not give it a try.

No comments:

Post a Comment