Skip to content

[BUG] select from table limit reads the full dataset and persists in memory.  #385

@ayushdg

Description

@ayushdg

What happened:
When performing a SELECT * FROM table LIMIT 10, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.

What you expected to happen:
Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.

Minimal Complete Verifiable Example:

from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cudf
import dask_cudf
from dask_sql import Context
import dask

write_data = False

if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)
    c = Context()
    
    if write_data:
        dask.datasets.timseries(start="2022-01-01", end="2024-01-01").to_parquet("test_data.parquet")


    ddf = dask_cudf.read_parquet("test_data.parquet")
    c.create_table("test", ddf, persist=False)

    # This results in the whole dataset persisted in memory and even though `len(res)==5` all the data is in memory
    res = c.sql("SELECT * from test LIMIT 10")

Anything else we need to know?:

Environment:

  • dask-sql version: 2022.1.0
  • Python version: 3.8
  • Operating System: ubuntu 18.04
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformanceImprovements to or issues with performancepythonAffects Python API

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions