2012-02-20

SQL: Cumulative histogram from histogram

Quite often you are given the task of constructing a cumulative histogram from frequency distributions. One reason for such a task might be that you want to sample random numbers from the original histogram.

Sampling a random number between 0 and 1, the normalized cumulative histogram could be used together with a search algorithm to find the corresponding bin. The search could be made quite efficient since the cumulative distribution is monotonically increasing.

Suppose we have defined the following table in a database, i.e. a histogram with weights (or counts) sorted into bins.

CREATE TABLE histogram(bin INTEGER, weight INTEGER);

The following two SQL statements will both produce the original histogram and the cumulative histogram.

SELECT h1.bin AS Bin,
       h1.weight AS Histogram,
       ( SELECT SUM(h2.weight)
         FROM histogram h2
         WHERE h2.bin<=h1.bin ) AS Cumulative
 FROM histogram h1;
SELECT h1.bin AS Bin,
       h1.weight AS Histogram,
       SUM(h2.weight) AS Cumulative
 FROM histogram h1
 LEFT OUTER JOIN histogram h2 ON h2.bin<=h1.bin
 GROUP BY h1.bin;

Here is some sample data:

BEGIN TRANSACTION;
INSERT INTO "histogram" VALUES(1,10);
INSERT INTO "histogram" VALUES(2,20);
INSERT INTO "histogram" VALUES(3,25);
INSERT INTO "histogram" VALUES(4,5);
INSERT INTO "histogram" VALUES(5,15);
INSERT INTO "histogram" VALUES(6,1);
INSERT INTO "histogram" VALUES(7,4);
INSERT INTO "histogram" VALUES(8,10);
INSERT INTO "histogram" VALUES(9,15);
INSERT INTO "histogram" VALUES(10,100);
COMMIT;

Executing the above statements using SQLite as the database engine will produce the follow output:

Bin         Histogram   Cumulative
----------  ----------  ----------
1           10          10        
2           20          30        
3           25          55        
4           5           60        
5           15          75        
6           1           76        
7           4           80        
8           10          90        
9           15          105       
10          100         205