Here, we provide its prototype implementation in Python and in Java, and an experimental pipeline for evaluating its accuracy on synthetic and real-world datasets and also its update, merge, and query times, in comparison with t-digest, KLL, GKAdaptive, and MomentSketch. Additionally, DDSketch may also be included in the evaluation.
SplineSketch in Java comes in two main versions:
SplineSketch.java-- faster version without frequent items filtering.SplineSketchMG.java-- with frequent items filtering by the Misra-Gries sketch. Additionally,SplineSketchAdjustable.javais a modification ofSplineSketch.javathat allows for changing the parameters and components of the sketch, intended for ablation studies.
The plots/ directory contains results of the experiments.
Setup: Clone the repository and then run make to compile the Java wrappers that run the individual skeches.
There are four experimental pipelines, with parameters adjusted in the individual Python source codes:
- Accuracy and running time experiments on synthetic datasets: run with
python run_experiments_IID.py - Accuracy and running time experiments on real-world datasets: download datasets as described below and then run with
python run_experiments_datasets.py(optionally adjust the datasets inload_<dataset>_datafunctions) - Update time experiment: run with
python run_experiments_update_time.py - Query time experiment: run with
python run_experiments_query_time.py - Ablation studies (set up inside the code): run with
python run_experiments_ablation.py
All of these Python programs produce a set of plots with results into plots/ directory.
- HEPMASS dataset from UC Irvine ML Repository: download
all_train.csv.gzandall_test.csv.gzand decompress both files intodatasets/hepmass/ - Power dataset from UC Irvine ML Repository: download into
datasets/household_power_consumption/household_power_consumption.txt - Books dataset from SOSD (a benchmark for learned indexes): download using
download_books_dataset.sh.