Everyone agrees that it is essential when provisioning test
databases as full or partial copies of production data, that any sensitive data,
that is personally identifiable or company confidential data, must be anonymised. There are lots of terms for this process including
scramble, obfuscate, de-sensitize, blind, redact and mask. Encryption does not usually apply is it
designed to be reversible, if you can remember the key. I will use the term masking in the remainder
of this article.
Masking data is a relatively straightforward although
generally i/o intensive process. That is
because when you select a column to mask you have to update that column value
in all the rows in the relevant table.
The primary drawback is not the data processing involved in masking;
it is ensuring that all sensitive
columns are masked. There is always a
niggling concern that you may have missed a sensitive column and then go ahead
and published on the assumption that all data is masked. There is no automated method to allay your
fears.
For relational data if you choose to mask a column which is
used in a join condition you would need to mask both columns replacing each corresponding
entry with the same masked value (deterministic). This is necessary to ensure that your SQL
will operate correctly when joining two masked columns. It is arguable that you should not join
columns which contain sensitive data but I imagine it does happen.
The join requirement necessitates that for any sensitive
column we need to understand the relationships between that column and all other
columns with which it might be joined.
In general this necessitates a data model and this is usually not
available or difficult to build, in particular if the table relationships are
defined in the application logic and not declaratively in the database.
So we have to:
- Identify all sensitive columns
- Identify all join relationships between sensitive columns so we can mask them using the same algorithm
Masking identified
sensitive columns is known as a blacklist approach to masking, where only data
you have identified as sensitive is masked.
An alternative is to use a whitelist approach where by default all data
is masked except for columns which are listed in the whitelist.
The benefits
of a whitelist approach are:
- No need to identify all sensitive columns, in the worst case you will mask data that is not sensitive.
- If new sensitive columns are added they will be masked by default.
- If used in conjunction with deterministic masking functions there is no need to model relationships. The data model for a whitelist just has to contain a list of all the tables to be masked.
- Assign a masking function for each and every column although a default function could be applied depending on data type.
- Identify columns that do not need to be masked. This is not strictly necessary, the benefits would be slightly better masking performance and perhaps easier support of test cases that examine or report non-sensitive data.
- Use deterministic masking functions to avoid the need to identify relationships. A deterministic masking function will produce the same output for the same input. This is also necessary when masking data consistently across multiple databases. The other benefit is that deterministic masking functions preserve cardinality which is essential when running performance tests as the data cardinality will impact the SQL execution plans.
Using a
whitelist approach in combination with deterministic masking will preserve security
by default and avoid the need to determine table relationships. The downside is columns missing from the
whitelist will be masked unnecessarily and that may extend the masking process
execution time although not by much as a large portion of the masking process
is rebuilding tables and dependent objects such as indexes.
Using
production data as the basis for test data is by far the best approach, as long
as you can guarantee security of the copy.
For realistic performance tests
it is essential to use test data based on production. The additional benefit is that you do not
have to generate artificial data which itself is non-trivial and leads to
artificial results.
With the
addition of a few user defined masking functions Oracle Data Masking supports
this approach. Why not give it a try.
