Structured Data | The Impact Of Good Table And Query Design

There are many ways to design tables/schemas and many ways to write SQL queries that execute against those tables/schemas. Some designs are better than others for various reasons, however, I think that frequently people underestimate the power of SQL (for both “good” and “evil”). All too often in data warehouses, I see tables designed for one specific report, or a very select few reports. These tables frequently resemble Microsoft Excel Spreadsheets (generally Pivot Tables), not good Dimensional (Star Schema) or Third Normal Form (3NF) schema design. The problem with such designs is that it severely limits the usefulness of that data, as queries that were not known at the time of design often time become problematic. The following is a simple one table example, derived from a field experience in which I discuss two table designs and provide the SQL queries to answer a question the business is seeking.

The Business Question

First lets start with the business question for which the answer is being sought. What customers meet the following criteria:

do not own PRODUCT1 or PRODUCT2 but have downloaded SOFTWARE
do not own PRODUCT2 and it has been more than 90 days between SOFTWARE download and their purchase of PRODUCT1

Version 1: The Column Based (Pivot) Table Design

For Version 1, there is a single row for each customer and each attribute has its own column. In this case there are 4 columns, each representing the most recent activity date for that product.

SQL> desc column_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 SOFTWARE_MAC_RECENCY_TS                            DATE
 SOFTWARE_WIN_RECENCY_TS                            DATE
 PRODUCT1_RECENCY_TS                                DATE
 PRODUCT2_RECENCY_TS                                DATE

SQL> select * from column_tab;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> select customer_id
  2  from   column_tab
  3  where  product2_recency_ts is null and
  4         (((software_win_recency_ts is not null or
  5            software_mac_recency_ts is not null) and
  6           product1_recency_ts is null) or
  7          ((software_win_recency_ts - product1_recency_ts) > 90 or
  8           (software_mac_recency_ts - product1_recency_ts) > 90));

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4293700422

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     2 |    42 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COLUMN_TAB |     2 |    42 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("PRODUCT2_RECENCY_TS" IS NULL AND ("PRODUCT1_RECENCY_TS"
              IS NULL AND ("SOFTWARE_MAC_RECENCY_TS" IS NOT NULL OR
              "SOFTWARE_WIN_RECENCY_TS" IS NOT NULL) OR
              "SOFTWARE_MAC_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90 OR
              "SOFTWARE_WIN_RECENCY_TS"-"PRODUCT1_RECENCY_TS">90))

As you can see, the query construct to answer the business question is straight forward and requires just one pass over the table.

Version 2: The Row Based (Unpivot) Table, Take 1

In Version 2, there is a single row (tuple) which tracks the customer, product and the recency date. Unlike Version 1, none of the columns can be NULL.

SQL> desc row_tab
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTOMER_ID                               NOT NULL NUMBER
 RECENCY_TS                                NOT NULL DATE
 PRODUCT                                   NOT NULL VARCHAR2(32)

SQL> select * from row_tab;

CUSTOMER_ID RECENCY_TS PRODUCT
----------- ---------- --------------------------------
        100 2009-03-17 SOFTWARE_MAC
        200 2009-03-17 SOFTWARE_MAC
        300 2009-03-17 SOFTWARE_MAC
        500 2009-03-17 SOFTWARE_MAC
        400 2009-03-17 SOFTWARE_WIN
        100 2008-11-17 PRODUCT1
        200 2009-01-16 PRODUCT1
        300 2008-10-08 PRODUCT1
        400 2008-11-07 PRODUCT1
        300 2009-02-25 PRODUCT2

10 rows selected.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id not in (select customer_id
  9                               from   row_tab
 10                               where  product in ('PRODUCT1', 'PRODUCT2')) and
 11         a.customer_id = b.customer_id
 12  union
 13  select a.customer_id
 14  from   row_tab a,
 15         (select customer_id,
 16                 product,
 17                 recency_ts
 18          from   row_tab
 19          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
 20  where  a.customer_id not in (select customer_id
 21                               from   row_tab
 22                               where  product = 'PRODUCT2') and
 23         a.customer_id = b.customer_id and
 24         (a.product = 'PRODUCT1' and
 25          b.recency_ts - a.recency_ts > 90);

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3517586312

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |    11 |   368 |    22  (60)| 00:00:01 |
|   1 |  SORT UNIQUE          |         |    11 |   368 |    22  (60)| 00:00:01 |
|   2 |   UNION-ALL           |         |       |       |            |          |
|*  3 |    HASH JOIN ANTI     |         |    10 |   310 |    10  (10)| 00:00:01 |
|*  4 |     HASH JOIN         |         |    11 |   187 |     7  (15)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL| ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| ROW_TAB |    10 |    30 |     3   (0)| 00:00:01 |
|*  7 |     TABLE ACCESS FULL | ROW_TAB |     5 |    70 |     3   (0)| 00:00:01 |
|*  8 |    HASH JOIN ANTI     |         |     1 |    58 |    10  (10)| 00:00:01 |
|*  9 |     HASH JOIN         |         |     1 |    44 |     7  (15)| 00:00:01 |
|* 10 |      TABLE ACCESS FULL| ROW_TAB |     4 |    88 |     3   (0)| 00:00:01 |
|* 11 |      TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   4 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   5 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   7 - filter("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2')
   8 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   9 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
       filter("RECENCY_TS"-"A"."RECENCY_TS">90)
  10 - filter("A"."PRODUCT"='PRODUCT1')
  11 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
  12 - filter("PRODUCT"='PRODUCT2')

Version 2, Take 2

The way the query is written in Version 2, Take 1, it requires six accesses to the table. Partly this is because it uses a UNION. In this case the UNION can be removed and replaced with an OR branch.

SQL> select a.customer_id
  2  from   row_tab a,
  3         (select customer_id,
  4                 product,
  5                 recency_ts
  6          from   row_tab
  7          where  product in ('SOFTWARE_MAC', 'SOFTWARE_WIN')) b
  8  where  a.customer_id = b.customer_id and
  9         ((a.customer_id not in (select customer_id
 10                               from   row_tab
 11                               where  product in ('PRODUCT1', 'PRODUCT2')))
 12         or
 13         ((a.customer_id not in (select customer_id
 14                               from   row_tab
 15                               where  product = 'PRODUCT2') and
 16         (a.product = 'PRODUCT1' and
 17          b.recency_ts - a.recency_ts > 90))))
 18  /

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3327813549

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    44 |     7  (15)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|*  2 |   HASH JOIN         |         |    11 |   484 |     7  (15)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| ROW_TAB |     5 |   110 |     3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
|*  6 |   TABLE ACCESS FULL | ROW_TAB |     1 |    14 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR "PRODUCT"='PRODUCT2'))
              OR  NOT EXISTS (SELECT 0 FROM "ROW_TAB" "ROW_TAB" WHERE
              "PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B2) AND
              "A"."PRODUCT"='PRODUCT1' AND "RECENCY_TS"-"A"."RECENCY_TS">90)
   2 - access("A"."CUSTOMER_ID"="CUSTOMER_ID")
   3 - filter("PRODUCT"='SOFTWARE_MAC' OR "PRODUCT"='SOFTWARE_WIN')
   5 - filter("CUSTOMER_ID"=:B1 AND ("PRODUCT"='PRODUCT1' OR
              "PRODUCT"='PRODUCT2'))
   6 - filter("PRODUCT"='PRODUCT2' AND "CUSTOMER_ID"=:B1)

This rewrite brings the table accesses down to four from six, so progress is being made, but I think we can do even better.

Version 2, Take 3

SQL is a very powerful language and there is usually more than one way to structure a query. Version 2, Take 1 uses a very literal translation of the business question and Take 2 just does a mild rewrite changing the UNION to an OR. In Version 2, Take 3, I am going to leverage some different, but very powerful functionality to yield the same results.

SQL> -- COLUMN_TAB can be expressed using ROW_TAB with MAX + CASE WHEN + GROUP BY:
SQL> select   customer_id,
  2           max (case
  3                   when product = 'SOFTWARE_MAC'
  4                      then recency_ts
  5                end) software_mac_recency_ts,
  6           max (case
  7                   when product = 'SOFTWARE_WIN'
  8                      then recency_ts
  9                end) software_win_recency_ts,
 10           max (case
 11                   when product = 'PRODUCT1'
 12                      then recency_ts
 13                end) product1_recency_ts,
 14           max (case
 15                   when product = 'PRODUCT2'
 16                      then recency_ts
 17                end) product2_recency_ts
 18  from     row_tab
 19  group by customer_id;

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- The original query can be expressed as follows:
SQL> select customer_id
  2  from   (select   customer_id,
  3                   max (case
  4                           when product = 'SOFTWARE_MAC'
  5                              then recency_ts
  6                        end) software_mac_recency_ts,
  7                   max (case
  8                           when product = 'SOFTWARE_WIN'
  9                              then recency_ts
 10                        end) software_win_recency_ts,
 11                   max (case
 12                           when product = 'PRODUCT1'
 13                              then recency_ts
 14                        end) product1_recency_ts,
 15                   max (case
 16                           when product = 'PRODUCT2'
 17                              then recency_ts
 18                        end) product2_recency_ts
 19          from     row_tab
 20          group by customer_id)
 21  where  product2_recency_ts is null and
 22         (((software_win_recency_ts is not null or
 23            software_mac_recency_ts is not null) and
 24           product1_recency_ts is null) or
 25          ((software_win_recency_ts - product1_recency_ts) > 90 or
 26           (software_mac_recency_ts - product1_recency_ts) > 90)
 27         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 825621652

-------------------------------------------------------------------------------
| Id  | Operation           | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER             |         |       |       |            |          |
|   2 |   HASH GROUP BY     |         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE "PRODUCT" WHEN 'PRODUCT2' THEN "RECENCY_TS" END
              ) IS NULL AND ((MAX(CASE "PRODUCT" WHEN 'SOFTWARE_WIN' THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE "PRODUCT" WHEN
              'SOFTWARE_MAC' THEN "RECENCY_TS" END ) IS NOT NULL) AND MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END ) IS NULL OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_WIN' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90 OR MAX(CASE
              "PRODUCT" WHEN 'SOFTWARE_MAC' THEN "RECENCY_TS" END )-MAX(CASE
              "PRODUCT" WHEN 'PRODUCT1' THEN "RECENCY_TS" END )>90))

Rewriting the query as a CASE WHEN with a GROUP BY not only cleaned up the SQL, it also resulted in a single pass over the table. Version 2, Take 3 reduces the table access from four to one!

Version 2, Take 4: The PIVOT operator in 11g

In 11g the PIVOT operator was introduced and can simplify the query even more.

SQL> -- In 11g the PIVOT operator can be used, so COLUMN_TAB can be expressed as:
SQL> select *
  2  from row_tab
  3  pivot (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts));

CUSTOMER_ID SOFTWARE_M SOFTWARE_W PRODUCT1_R PRODUCT2_R
----------- ---------- ---------- ---------- ----------
        100 2009-03-17            2008-11-17
        200 2009-03-17            2009-01-16
        300 2009-03-17            2008-10-08 2009-02-25
        400            2009-03-17 2008-11-07
        500 2009-03-17

5 rows selected.

SQL> -- Using PIVOT the original query can be expressed as:
SQL> select customer_id
  2  from   row_tab
  3  pivot  (max(recency_ts) for product in
  4         ('SOFTWARE_MAC' as software_mac_recency_ts,
  5          'SOFTWARE_WIN' as software_win_recency_ts,
  6          'PRODUCT1' as product1_recency_ts,
  7          'PRODUCT2' as product2_recency_ts))
  8  where  product2_recency_ts is null and
  9         (((software_win_recency_ts is not null or
 10            software_mac_recency_ts is not null) and
 11           product1_recency_ts is null) or
 12          ((software_win_recency_ts - product1_recency_ts) > 90 or
 13           (software_mac_recency_ts - product1_recency_ts) > 90)
 14         );

CUSTOMER_ID
-----------
        100
        400
        500

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3127820873

--------------------------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |     1 |    22 |     4  (25)| 00:00:01 |
|*  1 |  FILTER              |         |       |       |            |          |
|   2 |   HASH GROUP BY PIVOT|         |     1 |    22 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL | ROW_TAB |    10 |   220 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX(CASE  WHEN ("PRODUCT"='PRODUCT2') THEN "RECENCY_TS"
              END ) IS NULL AND ((MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN
              "RECENCY_TS" END ) IS NOT NULL OR MAX(CASE  WHEN
              ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END ) IS NOT NULL) AND
              MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END ) IS NULL
              OR MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_WIN') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90 OR
              MAX(CASE  WHEN ("PRODUCT"='SOFTWARE_MAC') THEN "RECENCY_TS" END
              )-MAX(CASE  WHEN ("PRODUCT"='PRODUCT1') THEN "RECENCY_TS" END )>90))

The Big Picture

One thing that I did not touch on is the flexibility of the ROW_TAB design when it comes to evolution. Any number of products can be added without making any modifications to the loading process. In order to do this with the COLUMN_TAB a new column must be added for each new product. The other major difference between the two table designs is that ROW_TAB is insert only while COLUMN_TAB must be updated if the customer exists. Generally one wants to avoid updated in a data warehouse as 1) old data is usually over written and 2) updates are more expensive than inserts.

The other major thing I won’t discuss in detail is how to partition or index (if required) COLUMN_TAB. Think about this. With ROW_TAB it is very straight forward.

Summary

There are many ways to design tables and write queries. Some of them work well, some do not. Some appear impossible at first, only to appear more simple later. Literal translation of a business question into SQL is usually far from optimal. One needs to think about the question being asked, the shape of the data, and the options available to solve that problem as well as the trade offs of those solutions. Remember: table definitions do not have to look like Spreadsheets. Generally only the output of a query needs to.

Don’t get stuck in SQL-92. It is the year 2009. You should be writing your SQL using the constructs that are provided. Often times very complex data transformations can be done with just SQL. Leverage this power.

All experiments performed on 11.1.0.7