Quite often you are given the task of constructing a cumulative histogram from frequency distributions. One reason for such a task might be that you want to sample random numbers from the original histogram.
Sampling a random number between 0 and 1, the normalized cumulative histogram could be used together with a search algorithm to find the corresponding bin. The search could be made quite efficient since the cumulative distribution is monotonically increasing.
Suppose we have defined the following table in a database, i.e. a histogram with weights (or counts) sorted into bins.
CREATE TABLE histogram(bin INTEGER, weight INTEGER);
The following two SQL statements will both produce the original histogram and the cumulative histogram.
SELECT h1.bin AS Bin, h1.weight AS Histogram, ( SELECT SUM(h2.weight) FROM histogram h2 WHERE h2.bin<=h1.bin ) AS Cumulative FROM histogram h1;
SELECT h1.bin AS Bin, h1.weight AS Histogram, SUM(h2.weight) AS Cumulative FROM histogram h1 LEFT OUTER JOIN histogram h2 ON h2.bin<=h1.bin GROUP BY h1.bin;
Here is some sample data:
BEGIN TRANSACTION; INSERT INTO "histogram" VALUES(1,10); INSERT INTO "histogram" VALUES(2,20); INSERT INTO "histogram" VALUES(3,25); INSERT INTO "histogram" VALUES(4,5); INSERT INTO "histogram" VALUES(5,15); INSERT INTO "histogram" VALUES(6,1); INSERT INTO "histogram" VALUES(7,4); INSERT INTO "histogram" VALUES(8,10); INSERT INTO "histogram" VALUES(9,15); INSERT INTO "histogram" VALUES(10,100); COMMIT;
Executing the above statements using SQLite as the database engine will produce the follow output:
Bin Histogram Cumulative ---------- ---------- ---------- 1 10 10 2 20 30 3 25 55 4 5 60 5 15 75 6 1 76 7 4 80 8 10 90 9 15 105 10 100 205