Additional contribution by: Santanu Chatterjee, Trystan Leftwich, Bryan Naden.
A new and powerful method of computing percentile estimates on Big Data is now available to you! By combining the well known t-Digest algorithm with AtScale’s semantic layer and smart aggregation features AtScale addresses gaps in both the Business Intelligence and Big Data landscapes. Most BI tools have features to compute and display various percentiles (i.e. medians, interquartile ranges, etc), but they move data for processing which dramatically limits the size of the analysis. The Hadoop-based SQL engines (Hive, Impala, Spark) can compute approximate percentiles on large datasets, however these expensive calculations are not aggregated and reused to answer similar queries. AtScale offers robust percentile estimates that work with AtScale’s semantic layer and aggregate tables to provide fast, accurate, and reusable percentile estimates.
In this three-part blog series we discuss the benefits of percentile estimates and how to compute them in a Big Data environment. Subscribe today to learn the best practices of percentile estimation on Big Data and more. Let's dive right in!
To learn how to be a data driven orgazation, check out this webinar!
Why Use Percentiles?
Percentiles can help you understand important aspects of your dataset, ranging from the data’s overall shape to single record ranking. If you find yourself in any of the following situations then percentiles can help you.
1. Do you wish your “average” calculation was resistant to outliers? The real estate industry reports the median home price because the average is heavily influenced by a few large or small sales. If they didn’t use the median, real estate analysts would need complicated logic to remove outlier data points to gain a true picture of the real estate market.
2. Do you want to see skew without being affected by it? We already know that we can use the median to resist outliers, but we don’t want to completely ignore them either. It would be nice to know if a small number of extreme data points exist in the dataset, but how can we determine this quickly? Fortunately, you can tell if your data is skewed without plotting every point by comparing the average to the median within the context of the interquartile range (a.k.a. the 25th and 75th percentiles). This looks great in a dashboard, but before AtScale, this was usually too expensive to compute on large datasets.
3. How to identify tiers in your data? Let’s say that you want to identify your price-insensitive customers and market to them separately. You may want to rank your customers by average sales amount and select the subset that rank above the 75th percentile of sales.
4. How does this one record compare to the rest? If your data is ranked then you can tell how a specific record compares to the rest. For example, if Jane Doe calls up and asks to speak to a manager regarding her $2500 order and you know that the 90th percentile threshold is $2000, then you instantly know that Jane is a high-value customer who ranks above the 90th percentile in sales amount.
Advantages of AtScale’s Percentile Estimates
AtScale implemented a modified version of Ted Dunning’s t-Digest algorithm that offers several important advantages over alternative methods used by existing BI tools and Hadoop query engines.
1. No Data movement. The importance of this property cannot be overstated because moving data from the data-lake to a BI tool is so painfully slow. AtScale firmly believes that data should be analyzed in-place whenever possible because data movement is a scalability killer. Simply put, moving data into Tableau or Qlik via extracts is completely infeasible for large-scale applications.
2. Works with Aggregates. AtScale’s version of the algorithm has the huge advantage of using your semantic model to speed up calculations. For example, if your model has a geography hierarchy, AtScale can aggregate sales to the state level and then answer a separate percentile question at the country level. Reading the comparatively small state aggregate table instead of the multi-billion row fact table provides results in a fraction of the time. Hive, Impala, and Spark’s percentile estimation functions cannot do this because they don’t have a semantic layer!
3. You Control the Accuracy vs. Resources Trade-off. AtScale provides very good estimates of your data, however if desired, you can configure AtScale to create more accurate data sketches (think histograms) to reduce estimation error to less than 0.0001%.
When it comes to Big Data, there is more to percentiles than just a simple statistical calculation. Hopefully you are imagining ways to apply percentile estimates to your Big Data. For more best practices for BI on Big Data, check out this webinar with Jen Underwood below.
In part two of this post we illustrate how to model percentile estimates in a Big Data environment using AtScale and Tableau. It is definitely easier than you think.