Saturday, 7 December 2013

Continuous and Incremental Data Masking

Many customers are interested in minimizing the impact of the masking process whilst maintaining up to date masked copies of production.  Data masking is typically run as a batch process which operates on an entire dataset.  This becomes increasingly inconvenient, time and resource consuming as the database grows and there is greater need to access up to date secure copies of production data.

There are two approaches to address this problem.  One is to implement Continuous Masking using logical replication. I explain how to achieve this with GoldenGate and database triggers in my blog article "Instant, Secure, Up-to-date Test Data with Minimal Storage".

The alternative is to implement Incremental Masking using Change Data Capture (CDC) methods to identify the set of deltas since the current masked replica was last updated.

To generate the deltas you can apply CDC techniques, such as Oracle Data Integrator, between two point in time versions of the source database. You can do this using two database snapshots or flashback query if the retention period is adequate.  The first snapshot needs to align to the last update on the masked target.  You then apply the masking to the deltas and merge the results to the masked target.

Alternatively you can perform comparisons between the source table and the masked target table, as long as you are careful to avoid the obvious differences due to the masked columns in the target.

It turns out that the DBMS_COMPARISON package which performs differences between two tables or views, based on single tables, can be used to identify and apply changes between two copies of a table where the second copy has masked columns.  The tables can of course be in different databases.

The key is to compare the source and target tables using a view on the source table that calls the masking functions so that the values returned by the comparison match the actual masked values stored in the target, otherwise we would get false differences.  An interesting side effect is that the deltas generated by the comparison function are based on the values returned from the view and hence the desired masked values are good to go if applying the changes using the dbms_comparison.converge() procedure.   If necessary you can apply the changes by issuing DMLs based on the deltas identified by the dbms_comparison.compare() procedure.

The process is

  1.  Create a set of masking functions, you could use the functions referenced by Oracle Data Masking
  2. Create a view on the source table calling the relevant masking functions. This view can be used to populate the masked target.
  3.  Create a comparison function using dbms_comparison.create_comparison().
  4. Execute comparison using dbms_comparison.compare().
  5.  Review differences using the user_comparison_row_dif view.
  6. Apply changes found using dbms_comparison.converge().
This works for all sensitive columns even if the primary or unique key contains a sensitive column.

Caveats

  1. This is primarily intended masked targets where there are no independent DMLs occurring on the target (although this could be accommodated).
  2. There may be issues if we need to coordinate changes to multiple tables which have integrity constraints that could be violated unless all related DMLs are performed in the same transaction.  In this circumstance we can issue DMLs based on the identified deltas.
  3. This can also work for the scenario where a sensitive column is part of a primary key and that primary key is updated in the source table.  This is an unusual scenario but can be catered for with some extra coding.
  4. There are limitations on which type of masking functions can be used.
  5. Some coding is required to setup comparisons and the application of deltas.
  6. I am not sure how performant this is for very large tables.
Let me know if you want some example SQL which illustrates the process.

Friday, 29 November 2013

Instant, Secure, Up-to-date Test Data with Minimal Storage

Introduction

The best source of data for test environments is data from production.  Since truth is stranger than fiction testing against actual data is far more effective than testing against synthetically generated data.   It is also much easier to copy existing data than to generate new data with the correct format, cardinality and relationships.

Unfortunately there are several challenges when provision test data based on production data.

1.   De-sensitizing data is slow.

De-sensitizing data is inherently a long running process as all tables containing sensitive columns must be rebuilt with masked replacements. We also need to rebuild all relevant indexes.

2.   Data subsets are difficult to produce.

A subset of data is often all that is required but sub-setting data can be a non-trivial task unless the application has built in sub-setting capabilities.

3.   Refreshing test data is resource intensive and slow

Traditionally the de-sensitizing process operates as a batch job on the entire dataset.  Each time you want an up to date de-sensitized copy of production you have to run this job again and as data volumes grow this process takes increasing longer to complete.  The previous de-sensitized copy is of little use and is discarded.

Several customers have asked if it possible to produce a constantly updated de-sensitized copy of production to avoid the time and resources building from scratch each time a new secure version is required.

Replication and Masking

The solution we propose is to use GoldenGate to replicate DML to a replica database which is masked using DML (INSERT and UPDATE) table triggers. The triggers take care of masking all the sensitive column data prior to being inserted or updated.

There are two challenges that have to be overcome when masking a database which is maintained via logical replication, specifically around UPDATEs.

Logical replication updates the replica by using a WHERE clause which can uniquely identify the correct row in the replica database.   This WHERE clause uses either a primary or unique key.  Note logical replication reverse engineers row deltas from redo log streams where the actual rows being updated or deleted have already been identified through interpreting the original transaction in the source database.

Problem 1 – the key contains a sensitive column

If one of the columns in the primary or unique key contains a sensitive column, then without interception, the WHERE clause used to replicate the change will reference a value which does not match the corresponding row in the replica database.

To handle these UPDATES where the key used in the WHERE clause references a sensitive column we simply generate the masked version of the sensitive column which is part of the key used in the WHERE clause before it is applied.  This is configured in the GoldenGate replication process which calls the appropriate stored procedure to mask the column.


Problem 2 – the key containing a sensitive column is updated directly in the source database

Slightly more challenging is the case where the key, containing the sensitive column, is updated on the source database.   On the replica database we need to ensure the UPDATE statement has access to the old and new values of the sensitive column and calls the masking procedure on both, eg.

 UPDATE <table> set sensitive_key_column = mask(new.sensitive_key_column)
 WHERE sensitive_key_column = mask(old.sensitive_key_column);

This solution requires a duplicate of the sensitive key column so we can issue a WHERE clause using the old value and a SET clause using the new value.  However for this solution to be necessary two unusual pre-conditions are required:

1.   Updates to primary keys on production
2.   Primary or Unique key contains a sensitive column


Minimal Storage

With GoldenGate replication and database trigger based masking as described above we have a constantly up-to-date secure copy of the production database.

We usually want to support multiple test and development environments so we also want to reduce the storage incurred for these copies.

We can provision point in time copies of the masked database simply by using Snap Clone technology.   For any desired point in time we create a snapshot of the secure copy.  This incurs no storage overhead except for database blocks that are changed as a result of replication of DML following the time the snapshot is taken.  The snapshot takes a copy of the original version of a block just before it is overwritten.  Typically this is a small percentage of database blocks, likely less than 3%.  Note any new blocks do not incur a copy in our snapshot as new blocks do not overwrite an existing block that has to be preserved.

For each test or development environment we then create a private clone from our snapshot which provides us full access to a virtual copy of the database with read only shared access to the set of blocks in the snapshot.  Anytime we want to modify a block referenced via the snapshot a private copy of that block is created.  If we add data and create new blocks then those blocks become part of our private copy.

As a side benefit of our replication, the redo generated on our replica database will reference only masked versions of sensitive columns.  If we archive the redo from the replica database we could enable provisioning a secure database to any point in time spanned by the retained archived redo.


Summary

Using the solution described above we can produce and maintain a constantly up to date and secure master copy of a production database with minimal storage overhead if combined with Snap Clone or storage snapshot technology.


Caveats

As always there are some housekeeping tasks

1.   The initial masked database would be generated using normal batch data masking processes.

2.   Database triggers to mask sensitive columns would have to be written manually as these are not generated from current Oracle Data Masking definitions.  We can though reuse the same stored procedures that perform the same masking tasks.

3.   If supporting updates to primary keys containing sensitive columns then an additional column is required on the corresponding replica table.




Friday, 27 September 2013

Data Masking - overcoming the challenges

Everyone agrees that it is essential when provisioning test databases as full or partial copies of production data, that any sensitive data, that is personally identifiable or company confidential data, must be anonymised.  There are lots of terms for this process including scramble, obfuscate, de-sensitize, blind, redact and mask.  Encryption does not usually apply is it designed to be reversible, if you can remember the key.  I will use the term masking in the remainder of this article.

Masking data is a relatively straightforward although generally i/o intensive process.  That is because when you select a column to mask you have to update that column value in all the rows in the relevant table.

The primary drawback is not the data processing involved in masking;  it is ensuring that all sensitive columns are masked.  There is always a niggling concern that you may have missed a sensitive column and then go ahead and published on the assumption that all data is masked.  There is no automated method to allay your fears.

For relational data if you choose to mask a column which is used in a join condition you would need to mask both columns replacing each corresponding entry with the same masked value (deterministic).  This is necessary to ensure that your SQL will operate correctly when joining two masked columns.   It is arguable that you should not join columns which contain sensitive data but I imagine it does happen.

The join requirement necessitates that for any sensitive column we need to understand the relationships between that column and all other columns with which it might be joined.  In general this necessitates a data model and this is usually not available or difficult to build, in particular if the table relationships are defined in the application logic and not declaratively in the database.

So we have to:
  1. Identify all sensitive columns
  2. Identify all join relationships between sensitive columns so we can mask them using the same algorithm

Masking identified sensitive columns is known as a blacklist approach to masking, where only data you have identified as sensitive is masked.  An alternative is to use a whitelist approach where by default all data is masked except for columns which are listed in the whitelist.

The benefits of a whitelist approach are:
  1. No need to identify all sensitive columns, in the worst case you will mask data that is not sensitive.
  2. If new sensitive columns are added they will be masked by default.
  3. If used in conjunction with deterministic masking functions there is no need to model relationships.  The data model for a whitelist just has to contain a list of all the tables to be masked.
Sounds too good to be true, well it isn’t all plain sailing.  You would still need to:

  1. Assign a masking function for each and every column although a default function could be applied depending on data type.
  2. Identify columns that do not need to be masked.  This is not strictly necessary, the benefits would be slightly better masking performance and perhaps easier support of test cases that examine or report non-sensitive data.
  3. Use deterministic masking functions to avoid the need to identify relationships.  A deterministic masking function will produce the same output for the same input.  This is also necessary when masking data consistently across multiple databases.  The other benefit is that deterministic masking functions preserve cardinality which is essential when running performance tests as the data cardinality will impact the SQL execution plans.

Using a whitelist approach in combination with deterministic masking will preserve security by default and avoid the need to determine table relationships.  The downside is columns missing from the whitelist will be masked unnecessarily and that may extend the masking process execution time although not by much as a large portion of the masking process is rebuilding tables and dependent objects such as indexes.

Using production data as the basis for test data is by far the best approach, as long as you can guarantee security of the copy.   For realistic performance tests it is essential to use test data based on production.  The additional benefit is that you do not have to generate artificial data which itself is non-trivial and leads to artificial results.

With the addition of a few user defined masking functions Oracle Data Masking supports this approach.  Why not give it a try.

Small Data

What is Small Data?  Since databases have a tendency to keep growing and database performance needs to be constantly maintained or improved and storage, certainly high performance storage is not as cheap as we would like to believe, the aim of Small Data is to manage the storage costs and performance of production databases and their copies.

When it comes to Big Data people talk in terms of volume, velocity and variety.  When it comes to Small Data we need to talk in terms of shrinking, archiving, purging, compressing and thin-provisioning.

Below are the methods to achieve Small Data summarized in this article.


1
Drop unused Indexes
2
Drop useless Indexes
3
Remove Orphan Data
4
Trim over Provisioned Data files
5
Implement Retention Periods through Archive and Purge
6
Tiered Compression
7
Tiered Storage
8
Snap Clone, Thin Provisioning


1.   Drop unused Indexes

Dropping indexes that are not used will not only save space but also reduce overhead during DML operations.

You can monitor index usage over a period of time and drop, or temporarily hide any indexes which are not referenced.   Indexes need to be enabled for usage monitoring with “ALTER INDEX ‘index’ MONITORING USAGE”.

If you don’t want to do this in production you can do it in test and run an SPA workload through it.


2.    Drop useless Indexes

Sometimes indexes get in the way of performance.  The optimizer may choose the index by mistake or the index may be explicitly referenced via a SQL hint.  One method to test the validity of indexes is to use the “ALTER ‘index’ INVISIBLE” command and run a workload using SPA to test the impact.  The index will still be maintained but will not be referenced.


3.   Remove Orphan Data

Orphan data consists of rows, usually in child tables with foreign key relationship where there is no corresponding row in the parent table.  This can happen for all sorts of reasons but usually means data inconsistency.  Such orphaned rows will not be retrieved as part of a query through the foreign key join condition.

You can enable a foreign key to check for consistency for all DML subsequent to the foreign key being enabled or you can validate a foreign key to retrospectively check the existing data for orphan row.


Also worth noting that declaring constraints can increase query performance, constraints are used by the optimizer when determining the optimal execution plan.


4.    Trim over Provisioned Data files

If you have some free space in a tablespace it can be “reclaimed” if it is at the end of the data files.



5.   Implement Retention Periods through Archive and Purge

Enforcing data retention periods can mean you can remove or at least relocate data which is either no longer or very infrequently required.  This can result in storage savings and performance improvement.

Archiving typically involves relocating data from the table referenced by the application to another table from where it is still accessible.

Purging data involves removing it from the database but it may be backed up offline.

It is generally non-trivial to ensure that you purge all related data for a set of rows, this is because you need to have a data model to ensure all ‘records’ are archived or purged with all necessary related data and that this process does not remove any data which is relevant to the retained data.

See this blog for more details on reducing EBS database size:


See this PPT for more details on Archiving and Purging PeopleSoft (or any OLTP) data:



6.   Tiered Compression

The primary compression options for relational data are row and column.  The usual trade off for compression is the write performance overhead maintaining modified compressed data versus the storage savings and query performance gain.  Different levels of compression which yield increasing higher compression ratios can be aligned to the modification frequency of the data using tiered compression.  Hence as data ages from a maintenance perspective it can be compressed more aggressively.   Oracle 12c has an Automatic Data Optimizer which takes care of this behind the scenes or you can be more hands on using Partitioning and Advanced Compression to define how and when data moves between compression (and storage) tiers.


7.    Tiered Storage

Most customers I come across use the same premium storage for all their data regardless of how old it is or whether it is production or test.  This is a side effect of centralized storage provisioning via SAN and NAS devices.  The average cost per TB for storage can be dramatically reduced if you align the age and value of the data to the appropriate storage tier, including flash, disk (performance, capacity) and tape.  This may even allow you to increase performance by investing in more flash capacity to address the active data set.



8.    Snap Clone, Thin Provisioning

This is possibly the easiest method to dramatically save on storage costs.  Snap clone is typically used to provision non-production or semi-production full database copies using thin provisioning methods to give full access to a copy of the source data, a test master.  Using copy-on-write technology each clone only consumes additional storage when it modifies data.  This means the storage overhead per clone is tiny, particularly if it is used for functional testing, training or reporting.

Snap clone also has the benefit of near instant provisioning.  Note Database 12c can provision a clone Pluggable Database near instantaneously if using a Snap Clone enabled file system.





Wednesday, 18 September 2013

Relational Data Management - Partition, Compress, Storage Tier and Snap Clone your way to success

We are finding a lot of customers are answering yes to most of the following questions:
  1. Is your growing data volume inhibiting application performance?
  2. How long will it be before your database doubles in size?
  3. Are you required to retain data for extended periods and still make it accessible to end users?
  4. Do you have multiple full size copies of production databases for test, development, training, etc?
  5. Is storage consuming an increasing portion of your IT budget?
To quantify this problem we can run some scripts which generate an easy to assimilate graphical report, for example.




What can be done?

  • Data Partitioning can reduce the active database footprint by 40% for most transaction based applications including E-Business Suite, Siebel and PeopleSoft
  • Data Compression can reduce transaction data storage consumption by 30%
  • Tiered Storage can reduce storage costs by 73% over premium storage whilst improving performance
  • Snap Clone can decimate the storage costs for non-production database copies and reduce provisioning time

This is one of those rare problems where the solution can save money and improve performance.   Let me know if you would like a similar report for one of your growing Oracle OLTP databases.


Monday, 25 March 2013

Ensuring database performance stability


Many DBA teams manage databases which support applications that generate great dollops of non-shareable SQL, i.e. SQL made unique by literal values, syntax differences or references to temporary tables.   In OLTP environments non-shareable SQL is a performance inhibitor and can cause throughput bottlenecks.

Oracle database performance stability relies largely on:
  1. Good schema design
  2. Comprehensive and up-to-date optimizer statistics
  3. Shareable SQL (particularly for OLTP, for warehouse is does not really matter)
  4. Plan Management or Profiles for SQL regression exceptions that cannot be handled by optimizer statistics and/or table structuring e.g. partitions, index organized tables or derived schema objects eg. indexes, materialized views, cubes

The goal is to manage performance using the 90-9-1 rule which is 90% optimizer statistics, 9% enhanced optimizer statistics and 1% exceptions.

SQL Plan Management allows you to fix a plan to a given sql_id which means it is good for matching shareable SQL, literal SQL which executes very frequently or SQL containing some literal values that do not change frequently.

SQL Profiles can use a force_match signature to match all statements which are essentially the same but differ only in their literal values.   In other words, statements that could use bind variables.   Profiles can match statements with all literals, all binds (no need for force_match).  However for mixtures of literals and binds is the same force_match signature is generated.  This is a documented limitation.  One solution is to use cursor_sharing but this may have other implications.

When performing a database upgrade you can use the upgrade project as an opportunity to stabilise performance going forward.  Even if the upgrade also involves a hardware refresh where overall performance is likely to be improved the opportunity presents itself to reduce your tuning headaches until long after the upgrade.

SQL Performance Analyzer (SPA) is an ideal method to test different target scenarios using actual production SQL.  Although SPA can capture any workload via a SQL Tuning Set (STS), with literal SQL the STS is filled with single execution statements, so needs to be regenerated to aggregate the execution counts and eliminate duplicates (i.e. statements having the same force_match signature) in order to produce an accurate weighted performance report.

So for all of you upgrading a database version underpinning an application which pumps out challenging SQL:
  1. Capture a STS and eliminate duplicates and aggregate execution counts (to preserve integrity of the SPA weighted performance report)
  2. Execute STS using SPA in target environment ignoring all hints (break free)
  3. Use the optimizer settings for the target release (don’t hold it back)
  4. Ensure regressed SQL is the tiny minority of SQL (use optimizer statistics and derived schema objects until this is the case)
  5. Tune any remaining SQL using SQL profiles (if you must add hints add hints to the SQL profiles)
  6.  Consider raising performance bugs with the application vendor if the above is not possible
So, do not just upgrade the database, leverage the database to improve performance stability and do it now.

Thursday, 7 February 2013

Hide your valuables

It's always tempting to use copies of production data for test purposes rather than generating data because functional and load tests, in particular, are then more realistic.   The challenge is always ensuring that no sensitive data, i.e. personal or company confidential data is visible.

Data masking is the most secure method to irreversibly hide such data but often there are challenges around:
  1. Ensuring that all sensitive data is masked and nothing is missed
  2. The process of masking is heavy duty as all rows in any table holding sensitive data must be processed
  3. If sensitive columns are used in join conditions then the columns from the tables which are joined must be masked using the same values to ensure that queries continue to work correctly post masking.
The solution to issue number 3 is often to build a data model to understand the relationships between columns in tables.  In general this sits in the "too hard" bucket, often a data model proves elusive because whether the application is 3rd party or custom the data model is difficult to build (or reverse engineer).

One solution to this problem is to only model relationships which are relevant to sensitive columns.  First identify all sensitive columns (iteratively) and build a list or table of these columns uniquely identified by schema.table.column.  Alternatively you can add or prepend a 'SENSITIVE' label to each relevant column using the table comment facility, eg:

COMMENT ON COLUMN customers.email_id IS 'SENSITIVE';

Then you need to confirm which if any of these columns are used in a join condition for a given workload.  The best method to do this is to identify columns used in join operations using a SQL Tuning Set in your test environment:

-- display contents of SQL Tuning Set                               
select sql_text from TABLE(DBMS_SQLTUNE.select_sqlset ( 'MY_STS' ));


SQL_TEXT
-------------------------------------------------------
SELECT c.cust_id, c.cust_city, s.amount_sold                                                                                                                 
  FROM sales s, customers c, products p , promotions po 
 WHERE s.cust_id=c.cust_id
   AND s.prod_id=p.prod_id
   AND s.promo_id =po.promo_id
   AND c.country_id='US'
   AND c.cust_state_province='CA'
   AND p.prod_name='O/S Documentation Set - English' 
   AND po.promo_name='internet promotion #29-350



-- reset column usage information             
exec dbms_stats.reset_col_usage( USER, NULL );

--  gather information on columns used in filters, joins, etc, from the SQL
--  contained in the tuning set 'MY_STS'                                   
exec dbms_stats.seed_col_usage( 'MY_STS', USER );                          

-- display the results with ...                          
select dbms_stats.report_col_usage( USER, NULL ) from dual;


-- or with ...                                             

select 
  o.name               "TABLE",
  c.name               "COLUMN", 
  u.equijoin_preds     EQ_JOIN,
  u.nonequijoin_preds  NONEQ_JOIN
from 
        sys.col_usage$ u
  join  sys.obj$       o on u.obj# = o.obj#
  join  sys.col$       c on u.obj# = c.obj# and u.intcol# = c.col#
where
  o.obj# in (select object_id from user_objects)
and
  (u.equijoin_preds = 1 or u.nonequijoin_preds = 1);



TABLE      COLUMN       EQ_JOIN  NONEQ_JOIN
---------- ---------- ---------- ----------
PRODUCTS   PROD_ID             1          0
CUSTOMERS  CUST_ID             1          0
SALES      PROD_ID             1          0
SALES      CUST_ID             1          0
SALES      PROMO_ID            1          0
PROMOTIONS PROMO_ID            1          0




We can correlate this result with our list of sensitive columns to see if any are involved in a join which would then require the same masked values to be applied to corresponding entries in each table.   This can be done by adding a relationship to the data model used when performing data masking.  Note we do not need to build a complete data model just a data model with relationships that are relevant for masking.

The above does not tell us which table/column combinations are joined and we cannot rely on column names which may not match.  To find the actual join conditions you can query the SQL text in the SQL Tuning Set for the sensitive column names identified above.   The final step would be to run a functional test on the masked data to ensure that the results are as anticipated.

Masking sensitive data not only allows you to use production data for test purposes but also allows you to share that data with a third party for data enrichment, off-site benchmarking or proof of concept purposes.

Tuesday, 5 February 2013

Start dropping hints

The most common reason for database performance degradation over time is data growth.   Databases have a habit of getting larger rather than smaller and the database server has to work smarter to sift out and operate on the minimal set of data to satisfy queries.   Ask yourself, "if this database were smaller would it run faster?"

The Oracle database calculates the best execution plan - the plan that attracts the least work - through the use of optimizer statistics.   These statistics tell the optimizer how large tables are, how many rows they contain, minimum and maximum values, cardinality (selectivity), etc.  Without these statistics the optimizer is blind and has to rely on some very basic assumptions.

I often work with customers who are upgrading or migrating their databases and discover that many do not maintain optimizer statistics, sometimes on purpose.  The primary concern is usually around performance regressions, to the degree that they also try and peg the database back by setting compatibility and optimizer version to a previous release.

Of course even with good statistics the optimizer may not choose the optimal or even a good plan but that is the exception.  The accepted goal is to have less than 1% of your SQL using execution plan influencing features, such as Plan Management or SQL Profiles.   The most common method to influence the execution plan is to use hints in the SQL statement.  In practice hints are used not because the optimizer cannot produce a good plan, but because it is compromised by inadequate statistics gathering and maintenance.

Adding correct hints to a SQL statement is a skilled task and should be treated as a short term measure and only used if a good plan cannot be produced using correct statistics.  The need for a hint in these circumstances may even be reason to raise a bug.  A hint is a countermeasure for an optimizer limitation or bug.

It should be possible to preserve performance in spite of growing data volumes and shifting selectivity purely by leveraging the optimizer and feeding it good statistics.  If you have a lot of hinted SQL and out of date, missing or inadequate statistics - note later database releases introduce more effective statistics gathering and recording - then you can easily test the impact of dropping those hints and letting the optimizer do its best with new statistics by enabling pending statistics and setting _optimizer_ignore_hints at session level. This is most easily performed using a SQL Tuning Set and SQL Performance Analyzer.

Give it a try, you may be surprised.  I have seen customers getting orders of magnitude improvements.  Not only can they simplify their SQL but database performance becomes managed by the optimizer and the DBAs.