-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
- part of [Epic] DataFusion Blogs #14836
As part of trying to help people understand how to tune memory related settings in #17069, @2010YOUY01 pointed out #17069 (comment)
Let's make it concise now, I think adding a few more sentences of explanation might actually confuse those without the background knowledge. A tutorial-style doc is still needed to describe the full picture.
I agree and I think a blog post explaining how DataFusion processes larger than memory data sets would be really helpful to give people the broader context
Describe the solution you'd like
I suggest a blog on https://datafusion.apache.org/blog/ (make a PR to https://github.com/alamb/datafusion-site)
Describe alternatives you've considered
Some ideas
- Background on memory usage in DataFusion (the memory manager model and what takes large amounts of memory) -- can probably copy/paste from https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-overview
- Memory Consumers Describe at a high level what consumes most memory in a plan (group by hash table, sorting, etc)
- Optimizations that DataFusion does to try and avoid needing memory (e.g. take advantage of pre-xisting sort orders, topk, etc)
- Spilling Sort: Provide an overview of the main spilling sort algorithm (sort in memory, spill to disk, merge pre-sorted runs, etc)
- Using Spilling Sort in Group By: Explain that the grouping operation uses the same underlying building block
- Call for help: 🎣 for people to figure out how to add spilling for joins and make the spilling hash faster
Additional context
I am very happy to help write this
connec
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request