Introduction
The best
source of data for test environments is data from production. Since truth is stranger than fiction testing
against actual data is far more effective than testing against synthetically
generated data. It is also much easier to copy existing data
than to generate new data with the correct format, cardinality and relationships.
Unfortunately
there are several challenges when provision test data based on production data.
1. De-sensitizing data is slow.
De-sensitizing
data is inherently a long running process as all tables containing sensitive
columns must be rebuilt with masked replacements. We also need to rebuild all relevant indexes.
2. Data subsets are difficult to
produce.
A
subset of data is often all that is required but sub-setting data can be a
non-trivial task unless the application has built in sub-setting capabilities.
3. Refreshing test data is resource
intensive and slow
Traditionally
the de-sensitizing process operates as a batch job on the entire dataset. Each time you want an up to date de-sensitized
copy of production you have to run this job again and as data volumes grow this
process takes increasing longer to complete.
The previous de-sensitized copy is of little use and is discarded.
Several customers have asked if it
possible to produce a constantly updated de-sensitized copy of production to
avoid the time and resources building from scratch each time a new secure version
is required.
Replication and
Masking
The solution we propose is to use
GoldenGate to replicate DML to a replica database which is masked using DML (INSERT
and UPDATE) table triggers. The triggers take care of masking all the
sensitive column data prior to being inserted or updated.
There are two challenges that have
to be overcome when masking a database which is maintained via logical
replication, specifically around UPDATEs.
Logical replication updates the
replica by using a WHERE clause which can uniquely identify the correct row in
the replica database. This WHERE clause
uses either a primary or unique key.
Note logical replication reverse engineers row deltas from redo log
streams where the actual rows being updated or deleted have already been
identified through interpreting the original transaction in the source
database.
Problem 1 – the key contains a
sensitive column
If
one of the columns in the primary or unique key contains a sensitive column,
then without interception, the WHERE clause used to replicate the change will
reference a value which does not match the corresponding row in the replica
database.
To
handle these UPDATES where the key used in the
WHERE clause references a sensitive column we simply generate the masked
version of the sensitive column which is part of the key used in the WHERE
clause before it is applied. This is
configured in the GoldenGate replication process which calls the appropriate
stored procedure to mask the column.
Problem 2 – the key containing a
sensitive column is updated directly in the source database
Slightly
more challenging is the case where the key, containing the sensitive column, is
updated on the source database. On the
replica database we need to ensure the UPDATE statement has access to the old
and new values of the sensitive column and calls the masking procedure on both,
eg.
UPDATE <table> set
sensitive_key_column = mask(new.sensitive_key_column)
WHERE sensitive_key_column =
mask(old.sensitive_key_column);
This
solution requires a duplicate of the sensitive key column so we can issue a
WHERE clause using the old value and a SET clause using the new value. However for this
solution to be necessary two unusual pre-conditions are required:
1.
Updates to
primary keys on production
2.
Primary or Unique
key contains a sensitive column
Minimal Storage
With GoldenGate replication and
database trigger based masking as described above we have a constantly
up-to-date secure copy of the production database.
We usually want to support multiple test and development environments so we also want to reduce the storage incurred for these copies.
We can provision point in time
copies of the masked database simply by using Snap Clone technology. For any desired point in time we create a
snapshot of the secure copy. This incurs
no storage overhead except for database blocks that are changed as a result of replication
of DML following the time the snapshot is taken. The snapshot takes a copy of the original
version of a block just before it is overwritten.
Typically this is a small percentage of database blocks, likely less
than 3%. Note any new blocks do not
incur a copy in our snapshot as new blocks do not overwrite an existing block
that has to be preserved.
For each test or development environment we then create a private clone from
our snapshot which provides us full access to a virtual copy of the database
with read only shared access to the set of blocks in the snapshot. Anytime we want to modify a block referenced
via the snapshot a private copy of that block is created. If we add data and create new blocks then
those blocks become part of our private copy.
As a side benefit of our replication, the redo generated on our replica database will reference only masked versions of sensitive columns. If we archive the redo from the replica database we could enable provisioning a secure database to any point in time spanned by the retained archived redo.
Summary
Using the solution described above we can
produce and maintain a constantly up to date and secure master copy of a
production database with minimal storage overhead if combined with Snap Clone
or storage snapshot technology.
Caveats
As always there are some housekeeping
tasks
1. The initial masked database would be generated
using normal batch data masking processes.
2. Database triggers to mask sensitive columns
would have to be written manually as these are not generated from current Oracle
Data Masking definitions. We can though
reuse the same stored procedures that perform the same masking tasks.
3. If supporting updates to primary keys
containing sensitive columns then an additional column is required on the corresponding
replica table.
No comments:
Post a Comment