In this article, I take a look at the analytic functions in SQL Server 2012 to handle frequency distributions.
CUME_DIST
The CUME_DIST function returns the percentage of records that are of same or lower order than the current record.
The expression:
CUME_DIST() OVER(ORDER BY MyValue)
Is equivalent (neglecting precision) to :
1.0 * COUNT(*)
OVER (ORDER BY MyValue
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
/ COUNT(*) OVER ()
Possible scenario: calculate the percentage of households whose income is not greater than the current one.
PERCENT_RANK
The PERCENT_RANK rank is similar to the CUME_DIST function.
The expression:
PERCENT_RANK() OVER(ORDER BY MyValue)
Is equivalent to (neglecting integer division) :
( RANK() OVER (ORDER BY MyValue) –1 )
/ ( COUNT(*) OVER () –1 )
Possible scenario: for each household, calculate the percentage of the other households that earn less than the current one.
PERCENTILE_DISC
Returns the smallest value such that CUME_DIST is equal to or greater than the provided probability.
PERCENTILE_DISC (0.4)
WITHIN GROUP ( ORDER BY MyValue ASC )
OVER()
A few remarks:
- NULL values are ignored, although this is not the case for the CUME_DIST function;
- The OVER clause is mandatory, although it may be empty;
- No ORDER BY is allowed in the OVER clause; there is a specific WITHIN GROUP clause to specify the ordering of the partition.
Possible scenario: what is the income under which 10% of households fall?
PERCENTILE_CONT
This is an interpolated version of PERCENTILE_DISC. It shares the same syntax. The same remarks as above apply.
Filed under: SQL, SQL Server 2012 Tagged: 2012, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, SQL SERVER, T-SQL, TRANSACT-SQL