Oracle 11g: Extended Statistics
In the Real-World Performance Roundtable, Part 2: The Optimizer, Schema Statistics, SQL Tuning at Oracle OpenWorld 2006, I worked an example of how the optimizer can have difficulty estimating the correct cardinality when there is data correlation. (The Zodiac example can be found on pages 46-49 of the presentation.) In Oracle 11g, there has been some enhancements to help the optimizer deal with data correlation.
DBMS_STATS.CREATE_EXTENDED_STATS
Previously I blogged about the 11g enhancement to DBMS_STATS.AUTO_SAMPLE_SIZE
and the new algorithm for gathering NDV. One of the other enhancements to DBMS_STATS
is the CREATE_EXTENDED_STATS
function. It is this function that will allow us to tell the Optimizer that two or more columns have data that is correlated.
Zodiac Calendar Example
Let’s turn to the Zodiac calendar example to demonstrate where the functionality of DBMS_STATS.CREATE_EXTENDED_STATS
can be applied. As you may know, there is a correlation between the Zodiac Sign and the calendar month. Below are the Zodiac signs and the corresponding days of the month.
- Aries : March 21 - April 20
- Taurus : April 21 - May 21
- Gemini : May 22 - June 21
- Cancer : June 22 - July 22
- Leo : July 23 -August 21
- Virgo : August 22 - September 23
- Libra : September 24 - October 23
- Scorpio : October 24 - November 22
- Sagittarius : November 23 - December 22
- Capricorn : December 23 - January 20
- Aquarius : January 21 - February 19
- Pisces : February 20- March 20
For this test case I am going to load two tables, CALENDAR
and PERSON
. Below is a description of each.
SQL> desc calendar
Name Null? Type
----------------- -------- ------------
DATE_ID NOT NULL NUMBER(8)
MONTH NOT NULL VARCHAR2(16)
ZODIAC NOT NULL VARCHAR2(16)
SQL> desc person
Name Null? Type
----------------- -------- ------------
PERSON_ID NOT NULL NUMBER(10)
DATE_ID NOT NULL NUMBER(8)
The CALENDAR
table has 365 rows, one row for every day of the calendar year. The PERSON
table has 32,768 rows for each DAY_ID
(each day of the year) for a total of 11,960,320 rows.
There are a few indexes I’m building on the tables:
- Unique index on
PERSON(PERSON_ID)
- Unique index on
CALENDAR(DATE_ID)
- Non-Unique index on
PERSON(DATE_ID)
Now that the tables loaded and indexes created, it’s time to create the Extended Stats. Below is a portion of the documentation.
CREATE_EXTENDED_STATS Function
This function creates a column statistics entry in the system for a user specified column group or an expression in a table. Statistics for this extension will be gathered when user or auto statistics gathering job gathers statistics for the table. We call statistics for such an extension, “extended statistics”. This function returns the name of this newly created entry for the extension.
Syntax
DBMS_STATS.CREATE_EXTENDED_STATS (
ownname VARCHAR2,
tabname VARCHAR2,
extension VARCHAR2)
RETURN VARCHAR2;
Parameters
Table 127-8 CREATE_EXTENDED_STATS Function Parameters
ownname | Owner name of a table |
tabname | Name of the table |
extension | Can be either a column group or an expression. Suppose the specified table has two column `c1`, `c2`. An example column group can be "(`c1`, `c2`)" and an example expression can be "(`c1` + `c2`)". |
Return Values
This function returns the name of this newly created entry for the extension.
Since there is a correlation between the MONTH
and ZODIAC
columns in the CALENDAR
table, the column group for the extended statistics will be (MONTH, ZODIAC)
.
Here is the command to create the extended stats: SELECT DBMS_STATS.CREATE_EXTENDED_STATS(USER, 'CALENDAR', '(MONTH, ZODIAC)') FROM DUAL;
Now that we have the extended stats definition created, it’s time to gather stats. Here are the commands I’m using to gather stats:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS
(
OWNNAME => USER
,TABNAME => 'CALENDAR'
,ESTIMATE_PERCENT => NULL
,METHOD_OPT => 'FOR ALL COLUMNS SIZE SKEWONLY'
);
END;
/
BEGIN
DBMS_STATS.GATHER_TABLE_STATS
(
OWNNAME => USER
,TABNAME => 'PERSON'
,ESTIMATE_PERCENT => NULL
);
END;
/
Lets look at the column stats on the two tables:
SELECT
TABLE_NAME,
COLUMN_NAME,
NUM_DISTINCT as NDV,
NUM_BUCKETS,
SAMPLE_SIZE,
HISTOGRAM
FROM
USER_TAB_COL_STATISTICS
ORDER BY 1,2;
TABLE_NAME COLUMN_NAME NDV NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
---------- ------------------------------ -------- ----------- ----------- ---------------
CALENDAR DATE_ID 365 254 365 HEIGHT BALANCED
CALENDAR MONTH 12 12 365 FREQUENCY
CALENDAR SYS_STUWHPY_ZSVI_W3#C$I3EUUYB4 24 24 365 FREQUENCY
CALENDAR ZODIAC 12 12 365 FREQUENCY
PERSON DATE_ID 365 1 11960320 NONE
PERSON PERSON_ID 11960320 1 11960320 NONE
As you can see, there are column statistics gathered on column group of CALENDAR.(MONTH, ZODIAC)
represented by the SYS_STUWHPY_ZSVI_W3#C$I3EUUYB4
column.
The Moment of Truth
Will the extended statistics be enough to give the optimizer the information it needs to estimate an accurate number of rows? Let’s test it by running three test cases:
- How many people have a birth month of May?
- How many people have a Zodiac sign of Taurus?
- How many people have a birth month of May and a Zodiac sign of Taurus?
Each query is run with a /*+ gather_plan_statistics */
hint followed by SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST'));
The goal is to have the E-Rows (Optimizer Estimated Rows) be statistically accurate of the A-Rows (Actual Rows).
Below is the output from DBMS_XPLAN.DISPLAY_CURSOR
for each of the three test cases.
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID 55qv2rt3k8b3w, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*)
from person p ,calendar c
where p.date_id = c.da te_id and month = 'may'
Plan hash value: 1463406140
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 1015K| 1015K|
|* 3 | TABLE ACCESS FULL| CALENDAR | 1 | 31 | 31 |
|* 4 | INDEX RANGE SCAN | PERSON_N1 | 31 | 32768 | 1015K|
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("MONTH"='may')
4 - access("P"."DATE_ID"="C"."DATE_ID")
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID 8y54wtmy228r0, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*)
from person p ,calendar c
where p.date_id = c.date_id and zodiac = 'taurus'
Plan hash value: 1463406140
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 1015K| 1015K|
|* 3 | TABLE ACCESS FULL| CALENDAR | 1 | 31 | 31 |
|* 4 | INDEX RANGE SCAN | PERSON_N1 | 31 | 32768 | 1015K|
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("ZODIAC"='taurus')
4 - access("P"."DATE_ID"="C"."DATE_ID")
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID 8ntkxs4ztb2rz, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*)
from person p ,calendar c
where p.date_id = c.date_id and zodiac = 'taurus' and month = 'may'
Plan hash value: 1463406140
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 688K| 688K|
|* 3 | TABLE ACCESS FULL| CALENDAR | 1 | 21 | 21 |
|* 4 | INDEX RANGE SCAN | PERSON_N1 | 21 | 32768 | 688K|
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(("ZODIAC"='taurus' AND "MONTH"='may'))
4 - access("P"."DATE_ID"="C"."DATE_ID")
Summary
As demonstrated, adding Extended Statistics and using Histograms allowed the Optimizer to accurately estimate the number of rows, even when there was data correlation. This is a very useful enhancement to assist the Optimizer when there is known data correlation.