-
Notifications
You must be signed in to change notification settings - Fork 72
Increase Speed #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase Speed #29
Conversation
This container splits up the real dataframe from the column names so that we do not need to have that many column renames anymore
Codecov Report
@@ Coverage Diff @@
## main #29 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 23 24 +1
Lines 656 770 +114
Branches 91 103 +12
==========================================
+ Hits 656 770 +114
Continue to review full report at Codecov.
|
|
We could also try to make Dask Dataframe renaming faster?
…On Wed, Sep 9, 2020 at 2:51 PM Nils Braun ***@***.***> wrote:
Currently, a lot of dask computation is spent in renaming or assigning new
columns instead of the "real" calculation, as - to be consistent with what
the Relational Algebra gives to us - the columns of the intermediate
dataframes are renamed quite often.
This WIP PR includes a new data type, which contains the real data as well
as a frontend column mapping. Typically, physical plans will only operate
on the data itself and do not care so much about the column names, so we
can store the columns separately to the real data.
In my very small tests, this brings dask-sql approximately en-par with
usual dask calls in simple cases, such as
SELECT a, MAX(b) FROM df
The PR still needs documentation and more unittests for the new classes.
------------------------------
You can view, comment on, or merge this pull request online at:
#29
Commit Summary
- Start adding a datacontainer
- Use the new datatype in the rex classes
- Use the new datatype in the context
- Use the new datatype in the physical plans
- Remove the now outdated make_unique function
- Merge remote-tracking branch 'origin/main' into
feature/increase-speed
- Make sure to always have a str column
- Add a shortcut to not create the same column again and again
- Add a test for aliases
- Merge branch 'main' into feature/increase-speed
- Re-add the column fixing
File Changes
- *M* dask_sql/context.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-e24da7f2d05ce5e756f8b18005259d6b>
(12)
- *A* dask_sql/datacontainer.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-9e27689e82cff316fe346f7250f6774e>
(103)
- *M* dask_sql/physical/rel/base.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-5dee315c9f804e069b66ad7808fbc164>
(25)
- *M* dask_sql/physical/rel/logical/aggregate.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-b18dc18c9c43cced21e374c5131c3ccb>
(48)
- *M* dask_sql/physical/rel/logical/filter.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-8d5bbae665048a25da3401b823d873d3>
(16)
- *M* dask_sql/physical/rel/logical/join.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-5b9da44013148697d32fc2b524e7def5>
(46)
- *M* dask_sql/physical/rel/logical/project.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-d803256bcc62458736d0c3ef1daae5d6>
(37)
- *M* dask_sql/physical/rel/logical/sort.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-4eacf918e3883a719d3807b9c539e735>
(19)
- *M* dask_sql/physical/rel/logical/table_scan.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-053d76a7590746c5ff6777748cfcca99>
(17)
- *M* dask_sql/physical/rel/logical/union.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-3efb89603f8b7102c9a5282db606e2cd>
(40)
- *M* dask_sql/physical/rel/logical/values.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-02554ae2825416394630d9bf846e535e>
(29)
- *M* dask_sql/physical/rex/base.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-322bfe9fd3a1ffb88f08becc073eab43>
(4)
- *M* dask_sql/physical/rex/convert.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-51d39e0b94d92400ad1da369cf035aa1>
(7)
- *M* dask_sql/physical/rex/core/call.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-05275a1e08d68d6ed3bae9e6a2ef8c30>
(5)
- *M* dask_sql/physical/rex/core/input_ref.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-4b7634934b23e2deae8681da89b799d3>
(9)
- *M* dask_sql/physical/rex/core/literal.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-13da9a6367e3eab11f9aded422f70227>
(4)
- *M* tests/integration/test_select.py
<https://github.com/nils-braun/dask-sql/pull/29/files#diff-29a2a7eb9309e5e9f88b5c4538ac50c6>
(9)
Patch Links:
- https://github.com/nils-braun/dask-sql/pull/29.patch
- https://github.com/nils-braun/dask-sql/pull/29.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFB2XEQGJHYJ6E5Y3DSE72F5ANCNFSM4RDMAELA>
.
|
|
@mrocklin That is of course also true! However, to make life a bit easier, there are many "unneeded" renames or column reordering done in Maybe I should not call it renaming: it is also creating columns out of already present columns (again, mostly for bookkeeping) - which will always involve some data copying if I want to do it directly on the dataframe. So it is definitely not dask's fault - but more my misusing it :-) |
Increase Speed (dask-contrib#29)
Currently, a lot of dask computation is spent in renaming or assigning new columns instead of the "real" calculation, as - to be consistent with what the Relational Algebra gives to us - the columns of the intermediate dataframes are renamed quite often.
This WIP PR includes a new data type, which contains the real data as well as a frontend column mapping. Typically, physical plans will only operate on the data itself and do not care so much about the column names, so we can store the columns separately to the real data.
In my very small tests, this brings dask-sql approximately en-par with usual dask calls in simple cases, such as
The PR still needs documentation and more unittests for the new classes.