This is one of many restarts from a pristine testing columnar 2018 implementation. It instantly specializes in a domain and the tests suffer
This project is a custom columnar data processing library for the JVM, analogous to Pandas (Python), data.table (R), or Apache Spark (Scala/Java).
Its key characteristics are:
- Lazy Evaluation: The core data structure,
Cursor, is not an in-memory collection but a lazy representation of a dataset, defined by its size and an accessor function. This is a powerful pattern for handling data larger than memory. - Functional & Expressive DSL: It uses symbolic operators (
αfor map,∑for reduce) and a functional style, reminiscent of languages like APL or Scala's more functional libraries. - High-Performance I/O: It has first-class support for memory-mapped files, both for text (fixed-width) and a custom binary
ISAMformat, designed for very fast access. - Rich Metadata Context: It uses Kotlin's
CoroutineContextin a novel way as a "property bag" or "blackboard" to carry rich metadata about data types, storage layout, and schema, making this information available throughout the system.
- In this library:
Cursoris atypealiasforVect0r<RowVec>, whereVect0ris essentially a(size, accessorFunction)pair. This means that data isn't loaded until an accessor function is called for a specific row index. This is a core design decision for performance and memory management. - Overlaps & Analogues:
- Python: The API surface (slicing, grouping, pivoting) is very similar to Pandas. However, Pandas is generally eager, loading data into memory. The lazy nature of
Cursoris much more like Dask DataFrame or Polars. Polars, in particular, uses a lazy expression system that builds a query plan and executes it all at once, which is a very similar philosophy. - Scala/Java: This is the fundamental model of Apache Spark. A Spark DataFrame is a lazily evaluated, distributed collection of rows with a known schema. Operations on a DataFrame build up a logical plan, which is only executed when an "action" (like
collect()orsave()) is called. - R: The
data.tableandtidyverse(dplyr) packages provide a similar high-level API but are primarily in-memory. For lazy evaluation, R users typically connect to a database using a library likedbplyr, which translates R code into SQL queries.
- Python: The API surface (slicing, grouping, pivoting) is very similar to Pandas. However, Pandas is generally eager, loading data into memory. The lazy nature of
- In this library:
cursor α { ... }: An element-wise transformation (map/apply). The use of the Greek letter alpha is stylistic, evoking lambda calculus's alpha-conversion.cursor ∑ { ... }: A reduction operation over aggregated data.c1[-"colA", -"colB"]: Column selection by negation, a very convenient feature.- Fluent chaining of operations like
resample(...).pivot(...).group(...).
- Overlaps & Analogues:
- Python (Pandas/Polars): The fluent chaining style is identical.
df.resample(...).pivot_table(...).groupby(...). - R (tidyverse): The "pipe" operator (
%>%) achieves the same fluent, readable style:data %>% group_by(col) %>% summarize(...). - Scala: The use of symbolic operators is very common in the Scala ecosystem (e.g., in libraries like Cats or ZIO) to create concise, powerful DSLs.
- APL/J/K: These are array-oriented programming languages built almost entirely on terse, symbolic operators for data manipulation. This library's DSL borrows philosophical inspiration from them.
- Python (Pandas/Polars): The fluent chaining style is identical.
- In this library:
NioMMapandISAMCursorprovide direct access to data on disk via memory-mapping, bypassing much of the standard JVM I/O stack. The library defines its own binaryISAM(Indexed Sequential Access Method) format, complete with a.metafile for schema, which is a classic database technique. TheCellDriverclasses (Tokenized,Fixed) are codecs for reading/writing types to/fromByteBuffer. - Overlaps & Analogues:
- Databases: This is the core competency of columnar databases. The approach is very similar to how DuckDB (in-process) or ClickHouse manage data on disk. Creating a custom binary format is what high-performance systems do.
- Python: Apache Arrow is the modern standard for language-agnostic, zero-copy, columnar in-memory data. Libraries like Polars and DuckDB are built on it. The memory-mapping concept is available in NumPy (
memmap) and Arrow. - Java: Chronicle Map is a well-known Java library for creating off-heap, low-latency key-value stores that are memory-mapped. This library implements similar ideas from scratch using standard
java.nio.
- In this library: The
contextpackage (Arity,RecordBoundary,Ordering,Scalar) usesCoroutineContext.Elementto create a type-safe, extensible "property bag". This context is attached to each cell's value (RowVecis aVect02<Any?, CellMeta>), allowing any function to retrieve metadata about a cell's type, name, storage format, etc., without passing it explicitly. - Overlaps & Analogues: This is a very creative use of a Kotlin-specific feature. The general pattern is known as a Context Object or Property Bag.
- Functional Programming: This pattern is analogous to using a Reader Monad. A Reader Monad allows you to "inject" a shared environment or configuration into a computation, which can then be accessed by any function within that computation without being an explicit parameter.
- Java: A less elegant way to achieve this would be with
ThreadLocalvariables holding a context object. - Web Frameworks: In frameworks like Flask (Python) or Express (Node.js), a global-like
requestobject is available during the lifecycle of a request, serving a similar purpose.
- In this library:
Vect0ris a functional, lazy list.Pai2,Tripl3, etc., are custom tuple interfaces.TrieandArrayMapare specialized data structures for fast lookups.ArrayMapis an immutable, sorted map backed by an array, enabling O(log n) lookups via binary search.
- Overlaps & Analogues:
- Custom Collections: Many high-performance libraries create their own collection types to escape the overhead or semantic limitations of standard library collections. Eclipse Collections (Java) and
boost::container(C++) are prime examples. - Immutable Data Structures: Libraries like Vavr (Java) or Immutable.js (JavaScript) provide persistent, immutable data structures, which is a core tenet of functional programming and a feature of this library's design. The
Trieis a classic example of such a structure.
- Custom Collections: Many high-performance libraries create their own collection types to escape the overhead or semantic limitations of standard library collections. Eclipse Collections (Java) and
This library is a fascinating and powerful piece of engineering. It doesn't just copy one tool; it synthesizes ideas from several domains:
- It has the user-friendly, high-level API of Pandas.
- It has the lazy evaluation model of Apache Spark or Dask.
- It has the low-level, high-performance I/O and custom storage formats of a columnar database.
- It uses a terse, functional DSL reminiscent of APL or Scala.
- It leverages a unique Kotlin feature (
CoroutineContext) to implement a clean, extensible metadata system akin to what one might build with a Reader Monad in Haskell.
This is not a general-purpose replacement for Pandas or Spark but appears to be a highly specialized "boutique" library designed for extreme performance on specific, large-scale data processing tasks on the JVM, where control over memory layout and I/O is critical.