Thursday, 7 February 2013

Hide your valuables

It's always tempting to use copies of production data for test purposes rather than generating data because functional and load tests, in particular, are then more realistic.   The challenge is always ensuring that no sensitive data, i.e. personal or company confidential data is visible.

Data masking is the most secure method to irreversibly hide such data but often there are challenges around:
  1. Ensuring that all sensitive data is masked and nothing is missed
  2. The process of masking is heavy duty as all rows in any table holding sensitive data must be processed
  3. If sensitive columns are used in join conditions then the columns from the tables which are joined must be masked using the same values to ensure that queries continue to work correctly post masking.
The solution to issue number 3 is often to build a data model to understand the relationships between columns in tables.  In general this sits in the "too hard" bucket, often a data model proves elusive because whether the application is 3rd party or custom the data model is difficult to build (or reverse engineer).

One solution to this problem is to only model relationships which are relevant to sensitive columns.  First identify all sensitive columns (iteratively) and build a list or table of these columns uniquely identified by schema.table.column.  Alternatively you can add or prepend a 'SENSITIVE' label to each relevant column using the table comment facility, eg:

COMMENT ON COLUMN customers.email_id IS 'SENSITIVE';

Then you need to confirm which if any of these columns are used in a join condition for a given workload.  The best method to do this is to identify columns used in join operations using a SQL Tuning Set in your test environment:

-- display contents of SQL Tuning Set                               
select sql_text from TABLE(DBMS_SQLTUNE.select_sqlset ( 'MY_STS' ));


SQL_TEXT
-------------------------------------------------------
SELECT c.cust_id, c.cust_city, s.amount_sold                                                                                                                 
  FROM sales s, customers c, products p , promotions po 
 WHERE s.cust_id=c.cust_id
   AND s.prod_id=p.prod_id
   AND s.promo_id =po.promo_id
   AND c.country_id='US'
   AND c.cust_state_province='CA'
   AND p.prod_name='O/S Documentation Set - English' 
   AND po.promo_name='internet promotion #29-350



-- reset column usage information             
exec dbms_stats.reset_col_usage( USER, NULL );

--  gather information on columns used in filters, joins, etc, from the SQL
--  contained in the tuning set 'MY_STS'                                   
exec dbms_stats.seed_col_usage( 'MY_STS', USER );                          

-- display the results with ...                          
select dbms_stats.report_col_usage( USER, NULL ) from dual;


-- or with ...                                             

select 
  o.name               "TABLE",
  c.name               "COLUMN", 
  u.equijoin_preds     EQ_JOIN,
  u.nonequijoin_preds  NONEQ_JOIN
from 
        sys.col_usage$ u
  join  sys.obj$       o on u.obj# = o.obj#
  join  sys.col$       c on u.obj# = c.obj# and u.intcol# = c.col#
where
  o.obj# in (select object_id from user_objects)
and
  (u.equijoin_preds = 1 or u.nonequijoin_preds = 1);



TABLE      COLUMN       EQ_JOIN  NONEQ_JOIN
---------- ---------- ---------- ----------
PRODUCTS   PROD_ID             1          0
CUSTOMERS  CUST_ID             1          0
SALES      PROD_ID             1          0
SALES      CUST_ID             1          0
SALES      PROMO_ID            1          0
PROMOTIONS PROMO_ID            1          0




We can correlate this result with our list of sensitive columns to see if any are involved in a join which would then require the same masked values to be applied to corresponding entries in each table.   This can be done by adding a relationship to the data model used when performing data masking.  Note we do not need to build a complete data model just a data model with relationships that are relevant for masking.

The above does not tell us which table/column combinations are joined and we cannot rely on column names which may not match.  To find the actual join conditions you can query the SQL text in the SQL Tuning Set for the sensitive column names identified above.   The final step would be to run a functional test on the masked data to ensure that the results are as anticipated.

Masking sensitive data not only allows you to use production data for test purposes but also allows you to share that data with a third party for data enrichment, off-site benchmarking or proof of concept purposes.

2 comments: