Skip to content

DaskSparseFrame getitem, todense and bugfix#79

Merged
kayibal merged 7 commits intomasterfrom
getitem-fixes
Jun 14, 2019
Merged

DaskSparseFrame getitem, todense and bugfix#79
kayibal merged 7 commits intomasterfrom
getitem-fixes

Conversation

@michcio1234
Copy link
Copy Markdown

No description provided.

It was possible that although computed type is SparseFrame, other type
is returned (if meta was not a SparseFrame).

Imports are not changed, just reorganized.
Support for dsp[index] syntax. Doesn't aim to work the same as in
pandas, just maps __getitem__ onto partitions.
@michcio1234 michcio1234 requested a review from kayibal June 7, 2019 06:48
Previously it returned DataFrame, even though in case of 1-column
non-empty SparseFrame, it returned Series.

Imports are only re-organized.
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 7, 2019

Codecov Report

Merging #79 into master will increase coverage by 0.27%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #79      +/-   ##
==========================================
+ Coverage   89.53%   89.81%   +0.27%     
==========================================
  Files           7        7              
  Lines        1204     1217      +13     
==========================================
+ Hits         1078     1093      +15     
+ Misses        126      124       -2
Impacted Files Coverage Δ
sparsity/sparse_frame.py 91.88% <100%> (+0.46%) ⬆️
sparsity/dask/core.py 85.3% <92.85%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6736452...4c63a0c. Read the comment docs.

It works by mapping SparseFrame.todense onto partitions.
It as necessary to allow `map_partitions` to return other types
then SparseFrame, so kwarg `cls` was added. It implies that one cannot
use `cls` kwarg as an argument to mapped function (because it will be
consumed by `map_partitions` and not passed to a mapped function).
@michcio1234 michcio1234 changed the title DaskSparseFrame getitem and bugfix DaskSparseFrame getitem, todense and bugfix Jun 7, 2019
Copy link
Copy Markdown

@kayibal kayibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new features although this introduces some breaking changes. I just have a few remarks on the todense() method.

return map_partitions(func, self, meta, *args, **kwargs)

# noinspection PyTypeChecker
def todense(self):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use to delayed here, then apply to_dense on each delayed object and use dd.from_delayed to construct the dense dask collection.

-------
res: dd.DataFrame | dd.Series
"""
meta = dd.from_pandas(self._meta.todense(), npartitions=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta could be just self._meta.todense()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we need dask object here

if idx is None:
return self.copy()
# we have a hidden zero column to replace missing indices (-1)
new_data = self._data.T[idx].T[:-1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work on empty frames if scipy>=1.0.0

This leverages the dask.delayed object api to achieve the same result
which was previously a hack between map_partitions and initializing
dd.DataFrame directy.
@kayibal kayibal merged commit e8fa03f into master Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants