What happened:
When performing a SELECT * FROM table LIMIT 10, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.
What you expected to happen:
Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.
Minimal Complete Verifiable Example:
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cudf
import dask_cudf
from dask_sql import Context
import dask
write_data = False
if __name__ == "__main__":
cluster = LocalCUDACluster()
client = Client(cluster)
c = Context()
if write_data:
dask.datasets.timseries(start="2022-01-01", end="2024-01-01").to_parquet("test_data.parquet")
ddf = dask_cudf.read_parquet("test_data.parquet")
c.create_table("test", ddf, persist=False)
# This results in the whole dataset persisted in memory and even though `len(res)==5` all the data is in memory
res = c.sql("SELECT * from test LIMIT 10")
Anything else we need to know?:
Environment:
- dask-sql version: 2022.1.0
- Python version: 3.8
- Operating System: ubuntu 18.04
- Install method (conda, pip, source): conda
What happened:
When performing a
SELECT * FROM table LIMIT 10, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.What you expected to happen:
Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment: