Data masking is the most secure method to irreversibly hide such data but often there are challenges around:
- Ensuring that all sensitive data is masked and nothing is missed
- The process of masking is heavy duty as all rows in any table holding sensitive data must be processed
- If sensitive columns are used in join conditions then the columns from the tables which are joined must be masked using the same values to ensure that queries continue to work correctly post masking.
The solution to issue number 3 is often to build a data model to understand the relationships between columns in tables. In general this sits in the "too hard" bucket, often a data model proves elusive because whether the application is 3rd party or custom the data model is difficult to build (or reverse engineer).
One solution to this problem is to only model relationships which are relevant to sensitive columns. First identify all sensitive columns (iteratively) and build a list or table of these columns uniquely identified by schema.table.column. Alternatively you can add or prepend a 'SENSITIVE' label to each relevant column using the table comment facility, eg:
COMMENT ON COLUMN customers.email_id IS 'SENSITIVE';
Then you need to confirm which if any of these columns are used in a join condition for a given workload. The best method to do this is to identify columns used in join operations using a SQL Tuning Set in your test environment:
-- display contents of SQL Tuning Set
select sql_text from TABLE(DBMS_SQLTUNE.select_sqlset ( 'MY_STS' ));
SQL_TEXT
-------------------------------------------------------
SELECT c.cust_id, c.cust_city, s.amount_sold
FROM sales s, customers c, products p , promotions po
WHERE s.cust_id=c.cust_id
AND s.prod_id=p.prod_id
AND s.promo_id =po.promo_id
AND c.country_id='US'
AND c.cust_state_province='CA'
AND p.prod_name='O/S Documentation Set - English'
AND po.promo_name='internet promotion #29-350
-- reset column usage information
exec dbms_stats.reset_col_usage( USER, NULL );
COMMENT ON COLUMN customers.email_id IS 'SENSITIVE';
Then you need to confirm which if any of these columns are used in a join condition for a given workload. The best method to do this is to identify columns used in join operations using a SQL Tuning Set in your test environment:
-- display contents of SQL Tuning Set
select sql_text from TABLE(DBMS_SQLTUNE.select_sqlset ( 'MY_STS' ));
SQL_TEXT
-------------------------------------------------------
SELECT c.cust_id, c.cust_city, s.amount_sold
FROM sales s, customers c, products p , promotions po
WHERE s.cust_id=c.cust_id
AND s.prod_id=p.prod_id
AND s.promo_id =po.promo_id
AND c.country_id='US'
AND c.cust_state_province='CA'
AND p.prod_name='O/S Documentation Set - English'
AND po.promo_name='internet promotion #29-350
-- reset column usage information
exec dbms_stats.reset_col_usage( USER, NULL );
-- gather information on columns used in filters, joins, etc, from the SQL
-- contained in the tuning set 'MY_STS'
exec dbms_stats.seed_col_usage( 'MY_STS', USER );
-- display the results with ...
select dbms_stats.report_col_usage( USER, NULL ) from dual;
-- or with ...
select
o.name "TABLE",
c.name "COLUMN",
u.equijoin_preds EQ_JOIN,
u.nonequijoin_preds NONEQ_JOIN
from
sys.col_usage$ u
join sys.obj$ o on u.obj# = o.obj#
join sys.col$ c on u.obj# = c.obj# and u.intcol# = c.col#
where
o.obj# in (select object_id from user_objects)
and
(u.equijoin_preds = 1 or u.nonequijoin_preds = 1);
TABLE COLUMN EQ_JOIN NONEQ_JOIN
---------- ---------- ---------- ----------
PRODUCTS PROD_ID 1 0
CUSTOMERS CUST_ID 1 0
SALES PROD_ID 1 0
SALES CUST_ID 1 0
SALES PROMO_ID 1 0
PROMOTIONS PROMO_ID 1 0
We can correlate this result with our list of sensitive columns to see if any are involved in a join which would then require the same masked values to be applied to corresponding entries in each table. This can be done by adding a relationship to the data model used when performing data masking. Note we do not need to build a complete data model just a data model with relationships that are relevant for masking.
The above does not tell us which table/column combinations are joined and we cannot rely on column names which may not match. To find the actual join conditions you can query the SQL text in the SQL Tuning Set for the sensitive column names identified above. The final step would be to run a functional test on the masked data to ensure that the results are as anticipated.
Masking sensitive data not only allows you to use production data for test purposes but also allows you to share that data with a third party for data enrichment, off-site benchmarking or proof of concept purposes.