Skip to content

Blog post about processing larger than memory queries #17089

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As part of trying to help people understand how to tune memory related settings in #17069, @2010YOUY01 pointed out #17069 (comment)

Let's make it concise now, I think adding a few more sentences of explanation might actually confuse those without the background knowledge. A tutorial-style doc is still needed to describe the full picture.

I agree and I think a blog post explaining how DataFusion processes larger than memory data sets would be really helpful to give people the broader context

Describe the solution you'd like

I suggest a blog on https://datafusion.apache.org/blog/ (make a PR to https://github.com/alamb/datafusion-site)

Describe alternatives you've considered

Some ideas

  1. Background on memory usage in DataFusion (the memory manager model and what takes large amounts of memory) -- can probably copy/paste from https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-overview
  2. Memory Consumers Describe at a high level what consumes most memory in a plan (group by hash table, sorting, etc)
  3. Optimizations that DataFusion does to try and avoid needing memory (e.g. take advantage of pre-xisting sort orders, topk, etc)
  4. Spilling Sort: Provide an overview of the main spilling sort algorithm (sort in memory, spill to disk, merge pre-sorted runs, etc)
  5. Using Spilling Sort in Group By: Explain that the grouping operation uses the same underlying building block
  6. Call for help: 🎣 for people to figure out how to add spilling for joins and make the spilling hash faster

Additional context

I am very happy to help write this

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions