Skip to content

Conversation

@nils-braun
Copy link
Collaborator

Currently, a lot of dask computation is spent in renaming or assigning new columns instead of the "real" calculation, as - to be consistent with what the Relational Algebra gives to us - the columns of the intermediate dataframes are renamed quite often.

This WIP PR includes a new data type, which contains the real data as well as a frontend column mapping. Typically, physical plans will only operate on the data itself and do not care so much about the column names, so we can store the columns separately to the real data.

In my very small tests, this brings dask-sql approximately en-par with usual dask calls in simple cases, such as

SELECT a, MAX(b) FROM df

The PR still needs documentation and more unittests for the new classes.

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2020

Codecov Report

Merging #29 into main will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              main       #29    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           23        24     +1     
  Lines          656       770   +114     
  Branches        91       103    +12     
==========================================
+ Hits           656       770   +114     
Impacted Files Coverage Δ
dask_sql/context.py 100.00% <100.00%> (ø)
dask_sql/datacontainer.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/base.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/aggregate.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/filter.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/join.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/project.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/sort.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/table_scan.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/union.py 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48ae97e...317e6f1. Read the comment docs.

@mrocklin
Copy link

mrocklin commented Sep 9, 2020 via email

@nils-braun
Copy link
Collaborator Author

@mrocklin That is of course also true! However, to make life a bit easier, there are many "unneeded" renames or column reordering done in dask-sql (which are only done for bookkeeping reasons).
No matter what, the renaming will still need to go to each of the pandas dataframes and actually do the renaming, so there will always be a cost (and most of the renames can be completely ignored because they are just there for bookkeeping).

Maybe I should not call it renaming: it is also creating columns out of already present columns (again, mostly for bookkeeping) - which will always involve some data copying if I want to do it directly on the dataframe. So it is definitely not dask's fault - but more my misusing it :-)

@nils-braun nils-braun changed the title [WIP] Increase Speed Increase Speed Sep 20, 2020
@nils-braun nils-braun merged commit ef9ed1e into main Sep 21, 2020
@nils-braun nils-braun deleted the feature/increase-speed branch September 21, 2020 20:39
rajagurunath added a commit to rajagurunath/dask-sql that referenced this pull request Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants