Skip to content

An idiomatic kotlin dataframe toolkit for data engineering tasks of any size dataset

License

Unknown, GPL-2.0 licenses found

Licenses found

Unknown
LICENSE
GPL-2.0
LICENSE.md
Notifications You must be signed in to change notification settings

jnorthrup/columnar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is one of many restarts from a pristine testing columnar 2018 implementation. It instantly specializes in a domain and the tests suffer

High-Level Summary

This project is a custom columnar data processing library for the JVM, analogous to Pandas (Python), data.table (R), or Apache Spark (Scala/Java).

Its key characteristics are:

  1. Lazy Evaluation: The core data structure, Cursor, is not an in-memory collection but a lazy representation of a dataset, defined by its size and an accessor function. This is a powerful pattern for handling data larger than memory.
  2. Functional & Expressive DSL: It uses symbolic operators (α for map, for reduce) and a functional style, reminiscent of languages like APL or Scala's more functional libraries.
  3. High-Performance I/O: It has first-class support for memory-mapped files, both for text (fixed-width) and a custom binary ISAM format, designed for very fast access.
  4. Rich Metadata Context: It uses Kotlin's CoroutineContext in a novel way as a "property bag" or "blackboard" to carry rich metadata about data types, storage layout, and schema, making this information available throughout the system.

Feature-by-Feature Analysis and Overlaps

1. Core Abstraction: The Cursor as a Lazy DataFrame

  • In this library: Cursor is a typealias for Vect0r<RowVec>, where Vect0r is essentially a (size, accessorFunction) pair. This means that data isn't loaded until an accessor function is called for a specific row index. This is a core design decision for performance and memory management.
  • Overlaps & Analogues:
    • Python: The API surface (slicing, grouping, pivoting) is very similar to Pandas. However, Pandas is generally eager, loading data into memory. The lazy nature of Cursor is much more like Dask DataFrame or Polars. Polars, in particular, uses a lazy expression system that builds a query plan and executes it all at once, which is a very similar philosophy.
    • Scala/Java: This is the fundamental model of Apache Spark. A Spark DataFrame is a lazily evaluated, distributed collection of rows with a known schema. Operations on a DataFrame build up a logical plan, which is only executed when an "action" (like collect() or save()) is called.
    • R: The data.table and tidyverse (dplyr) packages provide a similar high-level API but are primarily in-memory. For lazy evaluation, R users typically connect to a database using a library like dbplyr, which translates R code into SQL queries.

2. Functional & Expressive DSL

  • In this library:
    • cursor α { ... }: An element-wise transformation (map/apply). The use of the Greek letter alpha is stylistic, evoking lambda calculus's alpha-conversion.
    • cursor ∑ { ... }: A reduction operation over aggregated data.
    • c1[-"colA", -"colB"]: Column selection by negation, a very convenient feature.
    • Fluent chaining of operations like resample(...).pivot(...).group(...).
  • Overlaps & Analogues:
    • Python (Pandas/Polars): The fluent chaining style is identical. df.resample(...).pivot_table(...).groupby(...).
    • R (tidyverse): The "pipe" operator (%>%) achieves the same fluent, readable style: data %>% group_by(col) %>% summarize(...).
    • Scala: The use of symbolic operators is very common in the Scala ecosystem (e.g., in libraries like Cats or ZIO) to create concise, powerful DSLs.
    • APL/J/K: These are array-oriented programming languages built almost entirely on terse, symbolic operators for data manipulation. This library's DSL borrows philosophical inspiration from them.

3. High-Performance I/O & Persistence (NioMMap, ISAMCursor)

  • In this library: NioMMap and ISAMCursor provide direct access to data on disk via memory-mapping, bypassing much of the standard JVM I/O stack. The library defines its own binary ISAM (Indexed Sequential Access Method) format, complete with a .meta file for schema, which is a classic database technique. The CellDriver classes (Tokenized, Fixed) are codecs for reading/writing types to/from ByteBuffer.
  • Overlaps & Analogues:
    • Databases: This is the core competency of columnar databases. The approach is very similar to how DuckDB (in-process) or ClickHouse manage data on disk. Creating a custom binary format is what high-performance systems do.
    • Python: Apache Arrow is the modern standard for language-agnostic, zero-copy, columnar in-memory data. Libraries like Polars and DuckDB are built on it. The memory-mapping concept is available in NumPy (memmap) and Arrow.
    • Java: Chronicle Map is a well-known Java library for creating off-heap, low-latency key-value stores that are memory-mapped. This library implements similar ideas from scratch using standard java.nio.

4. Metadata via CoroutineContext

  • In this library: The context package (Arity, RecordBoundary, Ordering, Scalar) uses CoroutineContext.Element to create a type-safe, extensible "property bag". This context is attached to each cell's value (RowVec is a Vect02<Any?, CellMeta>), allowing any function to retrieve metadata about a cell's type, name, storage format, etc., without passing it explicitly.
  • Overlaps & Analogues: This is a very creative use of a Kotlin-specific feature. The general pattern is known as a Context Object or Property Bag.
    • Functional Programming: This pattern is analogous to using a Reader Monad. A Reader Monad allows you to "inject" a shared environment or configuration into a computation, which can then be accessed by any function within that computation without being an explicit parameter.
    • Java: A less elegant way to achieve this would be with ThreadLocal variables holding a context object.
    • Web Frameworks: In frameworks like Flask (Python) or Express (Node.js), a global-like request object is available during the lifecycle of a request, serving a similar purpose.

5. Data Structures (Vect0r, Trie, ArrayMap)

  • In this library:
    • Vect0r is a functional, lazy list.
    • Pai2, Tripl3, etc., are custom tuple interfaces.
    • Trie and ArrayMap are specialized data structures for fast lookups. ArrayMap is an immutable, sorted map backed by an array, enabling O(log n) lookups via binary search.
  • Overlaps & Analogues:
    • Custom Collections: Many high-performance libraries create their own collection types to escape the overhead or semantic limitations of standard library collections. Eclipse Collections (Java) and boost::container (C++) are prime examples.
    • Immutable Data Structures: Libraries like Vavr (Java) or Immutable.js (JavaScript) provide persistent, immutable data structures, which is a core tenet of functional programming and a feature of this library's design. The Trie is a classic example of such a structure.

Conclusion

This library is a fascinating and powerful piece of engineering. It doesn't just copy one tool; it synthesizes ideas from several domains:

  • It has the user-friendly, high-level API of Pandas.
  • It has the lazy evaluation model of Apache Spark or Dask.
  • It has the low-level, high-performance I/O and custom storage formats of a columnar database.
  • It uses a terse, functional DSL reminiscent of APL or Scala.
  • It leverages a unique Kotlin feature (CoroutineContext) to implement a clean, extensible metadata system akin to what one might build with a Reader Monad in Haskell.

This is not a general-purpose replacement for Pandas or Spark but appears to be a highly specialized "boutique" library designed for extreme performance on specific, large-scale data processing tasks on the JVM, where control over memory layout and I/O is critical.

Sponsor this project

Packages

No packages published

Contributors 3

  •  
  •  
  •