Skip to content

Conversation

@charlesbluca
Copy link
Collaborator

As pointed out by @quasiben and @jdye64, the DataContainer.assign operation could definitely be simplified. This change it to do the same thing, but with only one getitem call and a rename operation instead of assign (not sure if we should change the name to reflect this). For comparison, here are the graphs for a basic SQL operation:

SELECT
     user_id, SUM(b) AS "S"
FROM user_table_1
GROUP BY user_id

Before this PR:

HighLevelGraph with 20 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fbc344ae590>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-e29cf71863faee69539b6f78b8dbc8f3
 2. getitem-aa5bcd5e86836ebad1df2a90117e16f6
 3. fillna-94de4188d246caf9e1b4bf4f1debce03
 4. isnull-76ff5f3bc798aa3da45b6b16dae758fc
 5. inv-8c38b3dd8368fbb685b718f17180f00b
 6. aggregate-agg-838eda3f7eb4dbc264011a4f8e14a10b
 7. rename-c2e63c54b4d4fff51825029f94563613
 8. reset_index-3a34f3d8a1856ef12b299845da6afa8c
 9. rename-79dfdb294050405020a476146c8b5845
 10. getitem-be78c3b9d2c681e09cccdfdf783dbd1f
 11. eq-77e1f642c7c2a8e643f4a22b61af4e11
 12. inv-07fa50e82ef9e1e810819771a8e1ce2d
 13. getitem-1cc4911cff51b0537d276b6a9de6ebf7
 14. where-7cc4db419a8a2172e17accd935b851f4
 15. assign-8ac7e56123d06d1204acb6609e9541b6
 16. getitem-0ab1e73633488c2092c98ca87c31333c
 17. getitem-2a55368cca98a96e0d82198cd6bd7802
 18. assign-837d2902d726cbdd2f356860ab530f04
 19. getitem-f7c62a09776201bb03dae99feccd8138

And after:

HighLevelGraph with 18 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fdf157973a0>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-00613e4c7fd862c515ec9466f5853ab4
 2. getitem-fe8d06a9e480fb219e18ea645619fcf4
 3. fillna-cd6269d74d7fcf966419cb47824e2345
 4. isnull-cfb67e5887841dafb6cb0b31d862f9ec
 5. inv-dadddcdaf4eb1433e56291f05e94a3ac
 6. aggregate-agg-a11343737fc93c7c4013b7e3402a0b9f
 7. rename-b85d14b90d651de541c4894a6734961b
 8. reset_index-6d84f857c2528a09277e0c92323afb46
 9. rename-47e52d1413aef810ff415a357c4f6517
 10. getitem-a5a39823bbc0db290fff73a3f3a13e7f
 11. eq-6aef9b02442f4aeff046edb02b7d965c
 12. inv-24a323e2479dbf18adadac7633a7020b
 13. getitem-95c062ea749e3feb0f95196d85463e6e
 14. where-01585d34c6d654f45baf4cca72d713c1
 15. assign-4428e7c82f5ad753650280b081ebb8ff
 16. getitem-6b0569fa23bbbb31e18bd7dac2a7031d
 17. rename-30fe5b74d8f4f3d9e867782189528611

self.column_container._frontend_backend_mapping[out_col]
for out_col in self.column_container.columns
]
].rename(columns={v: k for k, v in self.column_container.mapping()})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the manipulation of mapping and use of internal _frontend_backend_mapping here shows that they could be changed to make this look a little simpler. I can dig into where else they are used and think about this some more

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2021

Codecov Report

Merging #271 (b6ac6b7) into main (1bb37c9) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #271      +/-   ##
==========================================
+ Coverage   95.74%   95.80%   +0.06%     
==========================================
  Files          64       64              
  Lines        2750     2791      +41     
  Branches      412      424      +12     
==========================================
+ Hits         2633     2674      +41     
  Misses         75       75              
  Partials       42       42              
Impacted Files Coverage Δ
dask_sql/datacontainer.py 94.11% <100.00%> (+0.07%) ⬆️
dask_sql/physical/rex/core/call.py 99.10% <0.00%> (-0.01%) ⬇️
dask_sql/context.py 99.09% <0.00%> (-0.01%) ⬇️
dask_sql/cmd.py 100.00% <0.00%> (ø)
dask_sql/utils.py 100.00% <0.00%> (ø)
dask_sql/server/responses.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1bb37c9...b6ac6b7. Read the comment docs.

@charlesbluca charlesbluca merged commit 5615027 into dask-contrib:main Oct 21, 2021
@charlesbluca charlesbluca deleted the simplify-assign branch January 19, 2022 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants