Friday, 29 November 2013

Instant, Secure, Up-to-date Test Data with Minimal Storage

Introduction

The best source of data for test environments is data from production.  Since truth is stranger than fiction testing against actual data is far more effective than testing against synthetically generated data.   It is also much easier to copy existing data than to generate new data with the correct format, cardinality and relationships.

Unfortunately there are several challenges when provision test data based on production data.

1.   De-sensitizing data is slow.

De-sensitizing data is inherently a long running process as all tables containing sensitive columns must be rebuilt with masked replacements. We also need to rebuild all relevant indexes.

2.   Data subsets are difficult to produce.

A subset of data is often all that is required but sub-setting data can be a non-trivial task unless the application has built in sub-setting capabilities.

3.   Refreshing test data is resource intensive and slow

Traditionally the de-sensitizing process operates as a batch job on the entire dataset.  Each time you want an up to date de-sensitized copy of production you have to run this job again and as data volumes grow this process takes increasing longer to complete.  The previous de-sensitized copy is of little use and is discarded.

Several customers have asked if it possible to produce a constantly updated de-sensitized copy of production to avoid the time and resources building from scratch each time a new secure version is required.

Replication and Masking

The solution we propose is to use GoldenGate to replicate DML to a replica database which is masked using DML (INSERT and UPDATE) table triggers. The triggers take care of masking all the sensitive column data prior to being inserted or updated.

There are two challenges that have to be overcome when masking a database which is maintained via logical replication, specifically around UPDATEs.

Logical replication updates the replica by using a WHERE clause which can uniquely identify the correct row in the replica database.   This WHERE clause uses either a primary or unique key.  Note logical replication reverse engineers row deltas from redo log streams where the actual rows being updated or deleted have already been identified through interpreting the original transaction in the source database.

Problem 1 – the key contains a sensitive column

If one of the columns in the primary or unique key contains a sensitive column, then without interception, the WHERE clause used to replicate the change will reference a value which does not match the corresponding row in the replica database.

To handle these UPDATES where the key used in the WHERE clause references a sensitive column we simply generate the masked version of the sensitive column which is part of the key used in the WHERE clause before it is applied.  This is configured in the GoldenGate replication process which calls the appropriate stored procedure to mask the column.


Problem 2 – the key containing a sensitive column is updated directly in the source database

Slightly more challenging is the case where the key, containing the sensitive column, is updated on the source database.   On the replica database we need to ensure the UPDATE statement has access to the old and new values of the sensitive column and calls the masking procedure on both, eg.

 UPDATE <table> set sensitive_key_column = mask(new.sensitive_key_column)
 WHERE sensitive_key_column = mask(old.sensitive_key_column);

This solution requires a duplicate of the sensitive key column so we can issue a WHERE clause using the old value and a SET clause using the new value.  However for this solution to be necessary two unusual pre-conditions are required:

1.   Updates to primary keys on production
2.   Primary or Unique key contains a sensitive column


Minimal Storage

With GoldenGate replication and database trigger based masking as described above we have a constantly up-to-date secure copy of the production database.

We usually want to support multiple test and development environments so we also want to reduce the storage incurred for these copies.

We can provision point in time copies of the masked database simply by using Snap Clone technology.   For any desired point in time we create a snapshot of the secure copy.  This incurs no storage overhead except for database blocks that are changed as a result of replication of DML following the time the snapshot is taken.  The snapshot takes a copy of the original version of a block just before it is overwritten.  Typically this is a small percentage of database blocks, likely less than 3%.  Note any new blocks do not incur a copy in our snapshot as new blocks do not overwrite an existing block that has to be preserved.

For each test or development environment we then create a private clone from our snapshot which provides us full access to a virtual copy of the database with read only shared access to the set of blocks in the snapshot.  Anytime we want to modify a block referenced via the snapshot a private copy of that block is created.  If we add data and create new blocks then those blocks become part of our private copy.

As a side benefit of our replication, the redo generated on our replica database will reference only masked versions of sensitive columns.  If we archive the redo from the replica database we could enable provisioning a secure database to any point in time spanned by the retained archived redo.


Summary

Using the solution described above we can produce and maintain a constantly up to date and secure master copy of a production database with minimal storage overhead if combined with Snap Clone or storage snapshot technology.


Caveats

As always there are some housekeeping tasks

1.   The initial masked database would be generated using normal batch data masking processes.

2.   Database triggers to mask sensitive columns would have to be written manually as these are not generated from current Oracle Data Masking definitions.  We can though reuse the same stored procedures that perform the same masking tasks.

3.   If supporting updates to primary keys containing sensitive columns then an additional column is required on the corresponding replica table.




No comments:

Post a Comment