Thursday, 7 February 2013

Hide your valuables

It's always tempting to use copies of production data for test purposes rather than generating data because functional and load tests, in particular, are then more realistic.   The challenge is always ensuring that no sensitive data, i.e. personal or company confidential data is visible.

Data masking is the most secure method to irreversibly hide such data but often there are challenges around:
  1. Ensuring that all sensitive data is masked and nothing is missed
  2. The process of masking is heavy duty as all rows in any table holding sensitive data must be processed
  3. If sensitive columns are used in join conditions then the columns from the tables which are joined must be masked using the same values to ensure that queries continue to work correctly post masking.
The solution to issue number 3 is often to build a data model to understand the relationships between columns in tables.  In general this sits in the "too hard" bucket, often a data model proves elusive because whether the application is 3rd party or custom the data model is difficult to build (or reverse engineer).

One solution to this problem is to only model relationships which are relevant to sensitive columns.  First identify all sensitive columns (iteratively) and build a list or table of these columns uniquely identified by schema.table.column.  Alternatively you can add or prepend a 'SENSITIVE' label to each relevant column using the table comment facility, eg:

COMMENT ON COLUMN customers.email_id IS 'SENSITIVE';

Then you need to confirm which if any of these columns are used in a join condition for a given workload.  The best method to do this is to identify columns used in join operations using a SQL Tuning Set in your test environment:

-- display contents of SQL Tuning Set                               
select sql_text from TABLE(DBMS_SQLTUNE.select_sqlset ( 'MY_STS' ));


SQL_TEXT
-------------------------------------------------------
SELECT c.cust_id, c.cust_city, s.amount_sold                                                                                                                 
  FROM sales s, customers c, products p , promotions po 
 WHERE s.cust_id=c.cust_id
   AND s.prod_id=p.prod_id
   AND s.promo_id =po.promo_id
   AND c.country_id='US'
   AND c.cust_state_province='CA'
   AND p.prod_name='O/S Documentation Set - English' 
   AND po.promo_name='internet promotion #29-350



-- reset column usage information             
exec dbms_stats.reset_col_usage( USER, NULL );

--  gather information on columns used in filters, joins, etc, from the SQL
--  contained in the tuning set 'MY_STS'                                   
exec dbms_stats.seed_col_usage( 'MY_STS', USER );                          

-- display the results with ...                          
select dbms_stats.report_col_usage( USER, NULL ) from dual;


-- or with ...                                             

select 
  o.name               "TABLE",
  c.name               "COLUMN", 
  u.equijoin_preds     EQ_JOIN,
  u.nonequijoin_preds  NONEQ_JOIN
from 
        sys.col_usage$ u
  join  sys.obj$       o on u.obj# = o.obj#
  join  sys.col$       c on u.obj# = c.obj# and u.intcol# = c.col#
where
  o.obj# in (select object_id from user_objects)
and
  (u.equijoin_preds = 1 or u.nonequijoin_preds = 1);



TABLE      COLUMN       EQ_JOIN  NONEQ_JOIN
---------- ---------- ---------- ----------
PRODUCTS   PROD_ID             1          0
CUSTOMERS  CUST_ID             1          0
SALES      PROD_ID             1          0
SALES      CUST_ID             1          0
SALES      PROMO_ID            1          0
PROMOTIONS PROMO_ID            1          0




We can correlate this result with our list of sensitive columns to see if any are involved in a join which would then require the same masked values to be applied to corresponding entries in each table.   This can be done by adding a relationship to the data model used when performing data masking.  Note we do not need to build a complete data model just a data model with relationships that are relevant for masking.

The above does not tell us which table/column combinations are joined and we cannot rely on column names which may not match.  To find the actual join conditions you can query the SQL text in the SQL Tuning Set for the sensitive column names identified above.   The final step would be to run a functional test on the masked data to ensure that the results are as anticipated.

Masking sensitive data not only allows you to use production data for test purposes but also allows you to share that data with a third party for data enrichment, off-site benchmarking or proof of concept purposes.

Tuesday, 5 February 2013

Start dropping hints

The most common reason for database performance degradation over time is data growth.   Databases have a habit of getting larger rather than smaller and the database server has to work smarter to sift out and operate on the minimal set of data to satisfy queries.   Ask yourself, "if this database were smaller would it run faster?"

The Oracle database calculates the best execution plan - the plan that attracts the least work - through the use of optimizer statistics.   These statistics tell the optimizer how large tables are, how many rows they contain, minimum and maximum values, cardinality (selectivity), etc.  Without these statistics the optimizer is blind and has to rely on some very basic assumptions.

I often work with customers who are upgrading or migrating their databases and discover that many do not maintain optimizer statistics, sometimes on purpose.  The primary concern is usually around performance regressions, to the degree that they also try and peg the database back by setting compatibility and optimizer version to a previous release.

Of course even with good statistics the optimizer may not choose the optimal or even a good plan but that is the exception.  The accepted goal is to have less than 1% of your SQL using execution plan influencing features, such as Plan Management or SQL Profiles.   The most common method to influence the execution plan is to use hints in the SQL statement.  In practice hints are used not because the optimizer cannot produce a good plan, but because it is compromised by inadequate statistics gathering and maintenance.

Adding correct hints to a SQL statement is a skilled task and should be treated as a short term measure and only used if a good plan cannot be produced using correct statistics.  The need for a hint in these circumstances may even be reason to raise a bug.  A hint is a countermeasure for an optimizer limitation or bug.

It should be possible to preserve performance in spite of growing data volumes and shifting selectivity purely by leveraging the optimizer and feeding it good statistics.  If you have a lot of hinted SQL and out of date, missing or inadequate statistics - note later database releases introduce more effective statistics gathering and recording - then you can easily test the impact of dropping those hints and letting the optimizer do its best with new statistics by enabling pending statistics and setting _optimizer_ignore_hints at session level. This is most easily performed using a SQL Tuning Set and SQL Performance Analyzer.

Give it a try, you may be surprised.  I have seen customers getting orders of magnitude improvements.  Not only can they simplify their SQL but database performance becomes managed by the optimizer and the DBAs.