

Extened sketch trial update#
The update speed behavior of the HLL sketches compared to the Theta-Alpha sketch will be similar to the following graph: It is important to understand that the bounds values returned by calling the getUpperBound(numStdDev) and getLowerBound(numStdDev) methodsĪre not hard limits, but attempts to measure meaningful “waist-lines” of distributions that theoretically can reach out to +/- infinity. There are no known closed-form mathematical solutions for these error distributions so we use lookup tables and empirical measurements to produce hopefully meaningful bounds. Returning meaningful bounds for low values of K is empirical and approximate. These normalized rank values were chosen because they allow easy comparison with confidence intervals that are commonly used in statistics. Nonetheless, choosing these quantile points allows us to also claim that the area between the +/- 1 Standard Deviation contours corresponds to 68% confidence īetween the +/- 2 Standard Deviation contours corresponds to 95.4% confidence, and between the +/- 3 Standard Deviation contours corresponds to 99.7% confidence. Q(.00135)) correspond to the normalized rank values at +/- 1,2 and 3 standard deviations of a Gaussian, which this is obviously not! The normalized rank values (the values inside the parentheses, e.g. This graph shows the quantile contours for the HLL sketch where LgK = 4 (K = 16). The following plot for a very small-sized HLL sketch is a good example: If K is small (K <= 4096, lgK<=12), and even if N is large, the distribution becomes quite distorted. Nonetheless, the Central Limit Theorem still applies if both K and N are large enough. In fact, it is quite complex and more related to the family of multinomial distributions. The underlying stochastic processes for unique count sketches (both HLL and Theta) do not produce a symmetric Gaussian error distribution at all. The actual value bounds can also be derived directly from the sketch itself by calling the getUpperBound(numStdDev)Īnd getLowerBound(numStdDev) methods. The reader of this chart can easily see that this size HLL sketch will have error “bounds” of +/- 1.3% with 95.4% confidence. The area between the brown and the purple curves represent +/- 3 SD, which correspons to a confidence level of 99.7%. The area between the red and the blue curves represent +/- 2 SD, which corresponds to a confidence level of 95.4%.

Therefore, the area between the orange and the green curves represent +/- 1 SD, which corresponds to a confidence level of 68.3%. (See “The Error Distribution Is Not Gaussian” below.) The different color curves are contours of the actual error distribution measured at normalized rank valuesĭerived from the Gaussian distribution at +/- 1, 2, and 3 standard deviations. The horizontal gridlines are configured to be +/- multiples of the base RSE. The base Relative Standard Error (RSE) for this sketch (at LgK = 14) is 0.0065 = 0.8326 / 2 7. This is due to a newly developed theory and estimator developed by Kevin Lang. This is partially due to the use of the HIP estimator for range above the transition point, which occurs at about 1500 on the graph.īelow this transition point the accuracy is near zero (an RSE of about 50 ppm), which is far better than any known implementation of HLL.

Will be about 20% = (0.8326/1.04 -1) more accurate than a conventional HLL sketch using Flajolet’s estimators (or derived estimators). The Factor = 0.8326 is directly relatable to the Flajolet alpha factor of 1.04.Īs a result, this plot demonstrates that this implementation of the HLL sketch The following plot was generated with LgK = 14 using 2 20 trials. The pitch-fork accuracy plot for any of the HLL sketch types (HLL_4, HLL_6, or HLL_8) are identical because the different sketch types are isomorphic to each other.
