Friday, 27 September 2013

Data Masking - overcoming the challenges

Everyone agrees that it is essential when provisioning test databases as full or partial copies of production data, that any sensitive data, that is personally identifiable or company confidential data, must be anonymised.  There are lots of terms for this process including scramble, obfuscate, de-sensitize, blind, redact and mask.  Encryption does not usually apply is it designed to be reversible, if you can remember the key.  I will use the term masking in the remainder of this article.

Masking data is a relatively straightforward although generally i/o intensive process.  That is because when you select a column to mask you have to update that column value in all the rows in the relevant table.

The primary drawback is not the data processing involved in masking;  it is ensuring that all sensitive columns are masked.  There is always a niggling concern that you may have missed a sensitive column and then go ahead and published on the assumption that all data is masked.  There is no automated method to allay your fears.

For relational data if you choose to mask a column which is used in a join condition you would need to mask both columns replacing each corresponding entry with the same masked value (deterministic).  This is necessary to ensure that your SQL will operate correctly when joining two masked columns.   It is arguable that you should not join columns which contain sensitive data but I imagine it does happen.

The join requirement necessitates that for any sensitive column we need to understand the relationships between that column and all other columns with which it might be joined.  In general this necessitates a data model and this is usually not available or difficult to build, in particular if the table relationships are defined in the application logic and not declaratively in the database.

So we have to:
  1. Identify all sensitive columns
  2. Identify all join relationships between sensitive columns so we can mask them using the same algorithm

Masking identified sensitive columns is known as a blacklist approach to masking, where only data you have identified as sensitive is masked.  An alternative is to use a whitelist approach where by default all data is masked except for columns which are listed in the whitelist.

The benefits of a whitelist approach are:
  1. No need to identify all sensitive columns, in the worst case you will mask data that is not sensitive.
  2. If new sensitive columns are added they will be masked by default.
  3. If used in conjunction with deterministic masking functions there is no need to model relationships.  The data model for a whitelist just has to contain a list of all the tables to be masked.
Sounds too good to be true, well it isn’t all plain sailing.  You would still need to:

  1. Assign a masking function for each and every column although a default function could be applied depending on data type.
  2. Identify columns that do not need to be masked.  This is not strictly necessary, the benefits would be slightly better masking performance and perhaps easier support of test cases that examine or report non-sensitive data.
  3. Use deterministic masking functions to avoid the need to identify relationships.  A deterministic masking function will produce the same output for the same input.  This is also necessary when masking data consistently across multiple databases.  The other benefit is that deterministic masking functions preserve cardinality which is essential when running performance tests as the data cardinality will impact the SQL execution plans.

Using a whitelist approach in combination with deterministic masking will preserve security by default and avoid the need to determine table relationships.  The downside is columns missing from the whitelist will be masked unnecessarily and that may extend the masking process execution time although not by much as a large portion of the masking process is rebuilding tables and dependent objects such as indexes.

Using production data as the basis for test data is by far the best approach, as long as you can guarantee security of the copy.   For realistic performance tests it is essential to use test data based on production.  The additional benefit is that you do not have to generate artificial data which itself is non-trivial and leads to artificial results.

With the addition of a few user defined masking functions Oracle Data Masking supports this approach.  Why not give it a try.

Small Data

What is Small Data?  Since databases have a tendency to keep growing and database performance needs to be constantly maintained or improved and storage, certainly high performance storage is not as cheap as we would like to believe, the aim of Small Data is to manage the storage costs and performance of production databases and their copies.

When it comes to Big Data people talk in terms of volume, velocity and variety.  When it comes to Small Data we need to talk in terms of shrinking, archiving, purging, compressing and thin-provisioning.

Below are the methods to achieve Small Data summarized in this article.


1
Drop unused Indexes
2
Drop useless Indexes
3
Remove Orphan Data
4
Trim over Provisioned Data files
5
Implement Retention Periods through Archive and Purge
6
Tiered Compression
7
Tiered Storage
8
Snap Clone, Thin Provisioning


1.   Drop unused Indexes

Dropping indexes that are not used will not only save space but also reduce overhead during DML operations.

You can monitor index usage over a period of time and drop, or temporarily hide any indexes which are not referenced.   Indexes need to be enabled for usage monitoring with “ALTER INDEX ‘index’ MONITORING USAGE”.

If you don’t want to do this in production you can do it in test and run an SPA workload through it.


2.    Drop useless Indexes

Sometimes indexes get in the way of performance.  The optimizer may choose the index by mistake or the index may be explicitly referenced via a SQL hint.  One method to test the validity of indexes is to use the “ALTER ‘index’ INVISIBLE” command and run a workload using SPA to test the impact.  The index will still be maintained but will not be referenced.


3.   Remove Orphan Data

Orphan data consists of rows, usually in child tables with foreign key relationship where there is no corresponding row in the parent table.  This can happen for all sorts of reasons but usually means data inconsistency.  Such orphaned rows will not be retrieved as part of a query through the foreign key join condition.

You can enable a foreign key to check for consistency for all DML subsequent to the foreign key being enabled or you can validate a foreign key to retrospectively check the existing data for orphan row.


Also worth noting that declaring constraints can increase query performance, constraints are used by the optimizer when determining the optimal execution plan.


4.    Trim over Provisioned Data files

If you have some free space in a tablespace it can be “reclaimed” if it is at the end of the data files.



5.   Implement Retention Periods through Archive and Purge

Enforcing data retention periods can mean you can remove or at least relocate data which is either no longer or very infrequently required.  This can result in storage savings and performance improvement.

Archiving typically involves relocating data from the table referenced by the application to another table from where it is still accessible.

Purging data involves removing it from the database but it may be backed up offline.

It is generally non-trivial to ensure that you purge all related data for a set of rows, this is because you need to have a data model to ensure all ‘records’ are archived or purged with all necessary related data and that this process does not remove any data which is relevant to the retained data.

See this blog for more details on reducing EBS database size:


See this PPT for more details on Archiving and Purging PeopleSoft (or any OLTP) data:



6.   Tiered Compression

The primary compression options for relational data are row and column.  The usual trade off for compression is the write performance overhead maintaining modified compressed data versus the storage savings and query performance gain.  Different levels of compression which yield increasing higher compression ratios can be aligned to the modification frequency of the data using tiered compression.  Hence as data ages from a maintenance perspective it can be compressed more aggressively.   Oracle 12c has an Automatic Data Optimizer which takes care of this behind the scenes or you can be more hands on using Partitioning and Advanced Compression to define how and when data moves between compression (and storage) tiers.


7.    Tiered Storage

Most customers I come across use the same premium storage for all their data regardless of how old it is or whether it is production or test.  This is a side effect of centralized storage provisioning via SAN and NAS devices.  The average cost per TB for storage can be dramatically reduced if you align the age and value of the data to the appropriate storage tier, including flash, disk (performance, capacity) and tape.  This may even allow you to increase performance by investing in more flash capacity to address the active data set.



8.    Snap Clone, Thin Provisioning

This is possibly the easiest method to dramatically save on storage costs.  Snap clone is typically used to provision non-production or semi-production full database copies using thin provisioning methods to give full access to a copy of the source data, a test master.  Using copy-on-write technology each clone only consumes additional storage when it modifies data.  This means the storage overhead per clone is tiny, particularly if it is used for functional testing, training or reporting.

Snap clone also has the benefit of near instant provisioning.  Note Database 12c can provision a clone Pluggable Database near instantaneously if using a Snap Clone enabled file system.





Wednesday, 18 September 2013

Relational Data Management - Partition, Compress, Storage Tier and Snap Clone your way to success

We are finding a lot of customers are answering yes to most of the following questions:
  1. Is your growing data volume inhibiting application performance?
  2. How long will it be before your database doubles in size?
  3. Are you required to retain data for extended periods and still make it accessible to end users?
  4. Do you have multiple full size copies of production databases for test, development, training, etc?
  5. Is storage consuming an increasing portion of your IT budget?
To quantify this problem we can run some scripts which generate an easy to assimilate graphical report, for example.




What can be done?

  • Data Partitioning can reduce the active database footprint by 40% for most transaction based applications including E-Business Suite, Siebel and PeopleSoft
  • Data Compression can reduce transaction data storage consumption by 30%
  • Tiered Storage can reduce storage costs by 73% over premium storage whilst improving performance
  • Snap Clone can decimate the storage costs for non-production database copies and reduce provisioning time

This is one of those rare problems where the solution can save money and improve performance.   Let me know if you would like a similar report for one of your growing Oracle OLTP databases.