NOTE: This article was originally created for version 7.3. Conceptually the ideas presented are
the same as in later versions but implementation details may have changed or have
been enhanced.
Histograms
==========
For uniformly distributed data, the cost-based approach makes fairly accurate
guesses at the cost of executing a particular statement. However, when the
data is not uniformly distributed, the optimizer cannot accurately estimate
the selectivity of a query. Beginning in release 7.3, for columns which do not
have uniform data distribution, Oracle will allow you to store histograms
describing the data distribution of a particular column.
When to Use Histograms
———————-
Histograms are stored in the dictionary and computed by using the DBMS_STATS
command on a particular column. Therefore, there is a maintenance and space
cost for using histograms. You should only compute histograms for columns
which you know have highly-skewed data distribution.
When to Not Use Histograms
————————–
Also, be aware that histograms, as well as all optimizer statistics, are
static. If the data distribution of a column changes frequently, it is
necessary to recompute the histogram for a given column. Histograms are not
useful for columns with the following characteristics:
o all predicates on the column use bind variables
o the column data is uniformly distributed
o the column is not used in WHERE clauses of queries
o the column is unique and is used only with equality predicates
How to Use Histograms
———————
Create histograms on columns that are frequently used in WHERE clauses of
queries and have a highly-skewed data distribution. You create a histogram
by using the ANALYZE and DBMS_STATS TABLE command for later versions. For example, if you want to create a 10-bucket histogram on the SAL column of the EMP table, issue the following
statement:
DBMS_STATS.GATHER_TABLE_STATS (NULL,’EMP’, method_opt => ‘FOR COLUMNS sal SIZE 10’);
ANALYZE TABLE emp COMPUTE STATISTICS FOR COLUMNS sal SIZE 10;
The SIZE keyword states the maximum number of buckets for the histogram.
You would create a histogram on the SAL column if there were an unusual
number of employees with the same salary and few employees with other
salaries.
The ‘FOR’ clause can be used with either COMPUTE STATISTICS or ESTIMATE
STATISTICS. The following clauses can be used with the ANALYZE TABLE command:
FOR TABLE
collect table statistics for the table
FOR ALL COLUMNS
collect column statistics for all columns in the table
FOR ALL INDEXED COLUMNMS
collect column statistics for all indexed columns in the table
FOR COLUMNS
collect column statistics for the specified columns
FOR ALL INDEXES
all indexes associated with the table will be analyzed
SIZE
specifies the maximum number of partitions (buckets) in the
histogram.
Default value: 75
Range of values: 1 – 254
For DBMS_STATS, syntax is the following:
method_opt
Accepts:
FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
*
FOR COLUMNS [size clause] column|attribute [size_clause] [,column|attribute [size_clause]…]
size_clause is defined as size_clause := SIZE {integer | REPEAT | AUTO | SKEWONLY}
– integer : Number of histogram buckets. Must be in the range [1,254].
– REPEAT : Collects histograms only on the columns that already have histograms.
– AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns.
– SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.
For 11.2, there is following additional parameter:
– extension : can be either a column group in the format of (column_name, Colume_name [, …]) or an expression
In 10g and 11g, the default is FOR ALL COLUMNS SIZE AUTO. The default value can be changed using the SET_DATABASE_PREFS Procedure, SET_GLOBAL_PREFS Procedure, SET_SCHEMA_PREFS Procedure and SET_TABLE_PREFS Procedure.
Choosing the Number of Buckets for a Histogram
———————————————-
The default number of buckets is 75. 10g onwards, the default value is FOR ALL COLUMNS SIZE AUTO. This value provides an appropriate level of detail for most data distributions. However, since the number of buckets in the histogram, the sampling rate, and the data distribution all affect
the usefulness of a histogram, you may need to experiment with different
numbers of buckets to obtain the best results.
If the number of frequently occurring distinct values in a column is relatively
small, then it is useful to set the number of buckets to be greater than the
number of frequently occurring distinct values.
Viewing Histograms
——————
You can find information about existing histograms in the database through the
following data dictionary views:
USER_TAB_HISTOGRAMS, ALL_TAB_HISTOGRAMS, and DBA_TAB_HISTOGRAMS.
USER_PART_HISTOGRAMS, ALL_PART_HISTOGRAMS, and DBA_PART_HISTOGRAMS.
USER_SUBPART_HISTOGRAMS, ALL_SUBPART_HISTOGRAMS, and DBA_SUBPART_HISTOGRAMS.
The number of buckets in each column’s histogram is found in these dictionary views :
o USER_TAB_COL_STATISTICS, ALL_TAB_COL_STATISTICS,DBA_TAB_COL_STATISTICS
(extracted from USER_TAB_COLUMNS, ALL_TAB_COLUMNS, and DBA_TAB_COLUMNS)
o USER_PART_COL_STATISTICS,ALL_PART_COL_STATISTICS, DBA_PART_COL_STATISTICS,
o USER_SUBPART_COL_STATISTICS, ALL_SUBPART_COL_STATISTICS, DBA_SUBPART_COL_STATISTICS
These views have the same definition.
DBA_TAB_HISTOGRAMS
This view lists histograms on columns of all tables.
Column name Represents This
———————————————————
OWNER Owner of table
TABLE_NAME Table name
COLUMN_NAME Column name
ENDPOINT_NUMBER Endpoint number
ENDPOINT_VALUE Normalized endpoint values for this bucket
DBA_TAB_COLUMNS
This view contains information which describes columns of all tables.
(NOTE: Views and clusters, although included in this view are not relevant
to histograms.)
Column Name Represents This
———————————————————-
OWNER Owner of table
TABLE_NAME Table name
COLUMN_NAME Column name
DATA_TYPE Datatype of the column
DATA_LENGTH Length of the column
DATA_PRECISION Precision for NUMBER or FLOAT datatypes
DATA_SCALE Digits to right of decimal
NULLABLE NULL allowable?
COLUMN_ID Sequence no. of column
DEFAULT_LENGTH Length of default value
DATA_DEFAULT Default value
NUM_DISTINCT Number of distinct values for the column
LOW_VALUE Smallest value for the column, expressed in hex
for the internal representation ofthe first 32
bytes of the value
HIGH_VALUE Highest value for the column, expressed in hex for
the internal representation of the first 32 bytes
of the value
DENSITY Density of the column (a measure of how distinct
the values are)
NUM_NULLS The number of columns with null value
NUM_BUCKETS The number of buckets in the histogram
LAST_ANALYZED The date that analyze was last run on the table
SAMPLE_SIZE The amount of data sampled
The column LAST_ANALYZED is useful in determining the last time
statistics, with or without histograms, were computed. This is
often important to assess the reason for cost-based optimizer’s
choices of execution paths. All tables involved in a query must be
regularly analyzed as data changes.