Increase Speed #29

nils-braun · 2020-09-09T21:51:11Z

Currently, a lot of dask computation is spent in renaming or assigning new columns instead of the "real" calculation, as - to be consistent with what the Relational Algebra gives to us - the columns of the intermediate dataframes are renamed quite often.

This WIP PR includes a new data type, which contains the real data as well as a frontend column mapping. Typically, physical plans will only operate on the data itself and do not care so much about the column names, so we can store the columns separately to the real data.

In my very small tests, this brings dask-sql approximately en-par with usual dask calls in simple cases, such as

SELECT a, MAX(b) FROM df

The PR still needs documentation and more unittests for the new classes.

This container splits up the real dataframe from the column names so that we do not need to have that many column renames anymore

codecov-commenter · 2020-09-09T21:53:25Z

Codecov Report

Merging #29 into main will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main       #29    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           23        24     +1     
  Lines          656       770   +114     
  Branches        91       103    +12     
==========================================
+ Hits           656       770   +114

Impacted Files	Coverage Δ
dask_sql/context.py	`100.00% <100.00%> (ø)`
dask_sql/datacontainer.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/base.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/aggregate.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/filter.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/join.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/project.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/sort.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/table_scan.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/union.py	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48ae97e...317e6f1. Read the comment docs.

mrocklin · 2020-09-09T21:54:06Z

We could also try to make Dask Dataframe renaming faster?

…

On Wed, Sep 9, 2020 at 2:51 PM Nils Braun ***@***.***> wrote: Currently, a lot of dask computation is spent in renaming or assigning new columns instead of the "real" calculation, as - to be consistent with what the Relational Algebra gives to us - the columns of the intermediate dataframes are renamed quite often. This WIP PR includes a new data type, which contains the real data as well as a frontend column mapping. Typically, physical plans will only operate on the data itself and do not care so much about the column names, so we can store the columns separately to the real data. In my very small tests, this brings dask-sql approximately en-par with usual dask calls in simple cases, such as SELECT a, MAX(b) FROM df The PR still needs documentation and more unittests for the new classes. ------------------------------ You can view, comment on, or merge this pull request online at: #29 Commit Summary - Start adding a datacontainer - Use the new datatype in the rex classes - Use the new datatype in the context - Use the new datatype in the physical plans - Remove the now outdated make_unique function - Merge remote-tracking branch 'origin/main' into feature/increase-speed - Make sure to always have a str column - Add a shortcut to not create the same column again and again - Add a test for aliases - Merge branch 'main' into feature/increase-speed - Re-add the column fixing File Changes - *M* dask_sql/context.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-e24da7f2d05ce5e756f8b18005259d6b> (12) - *A* dask_sql/datacontainer.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-9e27689e82cff316fe346f7250f6774e> (103) - *M* dask_sql/physical/rel/base.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-5dee315c9f804e069b66ad7808fbc164> (25) - *M* dask_sql/physical/rel/logical/aggregate.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-b18dc18c9c43cced21e374c5131c3ccb> (48) - *M* dask_sql/physical/rel/logical/filter.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-8d5bbae665048a25da3401b823d873d3> (16) - *M* dask_sql/physical/rel/logical/join.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-5b9da44013148697d32fc2b524e7def5> (46) - *M* dask_sql/physical/rel/logical/project.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-d803256bcc62458736d0c3ef1daae5d6> (37) - *M* dask_sql/physical/rel/logical/sort.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-4eacf918e3883a719d3807b9c539e735> (19) - *M* dask_sql/physical/rel/logical/table_scan.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-053d76a7590746c5ff6777748cfcca99> (17) - *M* dask_sql/physical/rel/logical/union.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-3efb89603f8b7102c9a5282db606e2cd> (40) - *M* dask_sql/physical/rel/logical/values.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-02554ae2825416394630d9bf846e535e> (29) - *M* dask_sql/physical/rex/base.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-322bfe9fd3a1ffb88f08becc073eab43> (4) - *M* dask_sql/physical/rex/convert.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-51d39e0b94d92400ad1da369cf035aa1> (7) - *M* dask_sql/physical/rex/core/call.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-05275a1e08d68d6ed3bae9e6a2ef8c30> (5) - *M* dask_sql/physical/rex/core/input_ref.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-4b7634934b23e2deae8681da89b799d3> (9) - *M* dask_sql/physical/rex/core/literal.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-13da9a6367e3eab11f9aded422f70227> (4) - *M* tests/integration/test_select.py <https://github.com/nils-braun/dask-sql/pull/29/files#diff-29a2a7eb9309e5e9f88b5c4538ac50c6> (9) Patch Links: - https://github.com/nils-braun/dask-sql/pull/29.patch - https://github.com/nils-braun/dask-sql/pull/29.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#29>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFB2XEQGJHYJ6E5Y3DSE72F5ANCNFSM4RDMAELA> .

nils-braun · 2020-09-10T20:12:48Z

@mrocklin That is of course also true! However, to make life a bit easier, there are many "unneeded" renames or column reordering done in dask-sql (which are only done for bookkeeping reasons).
No matter what, the renaming will still need to go to each of the pandas dataframes and actually do the renaming, so there will always be a cost (and most of the renames can be completely ignored because they are just there for bookkeeping).

Maybe I should not call it renaming: it is also creating columns out of already present columns (again, mostly for bookkeeping) - which will always involve some data copying if I want to do it directly on the dataframe. So it is definitely not dask's fault - but more my misusing it :-)

Increase Speed (dask-contrib#29)

nils-braun added 11 commits September 6, 2020 18:01

Start adding a datacontainer

87c9cf3

This container splits up the real dataframe from the column names so that we do not need to have that many column renames anymore

Use the new datatype in the rex classes

9048069

Use the new datatype in the context

0473dba

Use the new datatype in the physical plans

069c570

Remove the now outdated make_unique function

43854ef

Merge remote-tracking branch 'origin/main' into feature/increase-speed

f3962c5

Make sure to always have a str column

4c1483a

Add a shortcut to not create the same column again and again

503e7a1

Add a test for aliases

d7d08a7

Merge branch 'main' into feature/increase-speed

0942909

Re-add the column fixing

7872f1c

Some more comments

404eb9b

Merge branch 'main' into feature/increase-speed

317e6f1

nils-braun changed the title ~~[WIP] Increase Speed~~ Increase Speed Sep 20, 2020

nils-braun merged commit ef9ed1e into main Sep 21, 2020

nils-braun deleted the feature/increase-speed branch September 21, 2020 20:39

rajagurunath added a commit to rajagurunath/dask-sql that referenced this pull request Sep 22, 2020

Merge pull request #1 from nils-braun/main

381939c

Increase Speed (dask-contrib#29)

nils-braun mentioned this pull request Sep 26, 2020

Do only map known columns #47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase Speed #29

Increase Speed #29

Uh oh!

nils-braun commented Sep 9, 2020

Uh oh!

codecov-commenter commented Sep 9, 2020 •

edited

Loading

Uh oh!

mrocklin commented Sep 9, 2020 via email

Uh oh!

nils-braun commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Increase Speed #29

Increase Speed #29

Uh oh!

Conversation

nils-braun commented Sep 9, 2020

Uh oh!

codecov-commenter commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mrocklin commented Sep 9, 2020 via email

Uh oh!

nils-braun commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Sep 9, 2020 •

edited

Loading