Implement Dataset add_item #1870

albertvillanova · 2021-02-12T15:03:46Z

Implement Dataset.add_item.

lhoestq

Nice !

The pyarrow table created from the dict may have different feature types that the current dataset so you may want to cast the features of the table before concatenating.
To do so please use

# get the pyarrow type corresponding to the features
type = self.features.type
# build the schema of the table
schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names})
# cast the table
table = table.cast(schema)

Also adding examples this way breaks some assumptions regarding __get_state__ for pickling.
In particular one assumption is that the dataset is either fully in memory (dataset._data_files is empty), or the dataset can be reloaded from disk (using the dataset._data_files).
This assumption was convenient to handle both in-memory and on-disk dataset differently:

in-memory dataset can just be pickled/unpickled in-memory
on-disk dataset could be unloaded to only keep the filepaths when pickling, and then reloaded from the disk when unpickling

So I think we'll need to refactor these things first (this is a mandatory thing to do anyway in my opinion).

Maybe we can have a design that allows a Dataset to have a Table that can be rebuilt from heterogenous sources like in-memory tables or on-disk tables ? This could also be further extended in the future

albertvillanova · 2021-02-15T08:58:56Z

Thanks @lhoestq for your remarks. Yes, I agree there are still many issues to be tackled... This PR is just a starting point, so that we can discuss how Dataset should be generalized.

lhoestq · 2021-02-15T11:40:46Z

Sure ! I opened an issue #1877 so we can discuss this specific aspect :)

albertvillanova · 2021-03-30T12:28:41Z

src/datasets/arrow_dataset.py

+        schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names})
+        table = table.cast(schema)
+        # Concatenate tables
+        self._data = concat_tables([self._data, table])


Maybe here I could use ConcatenationTable.from_tables instead.

Or even better, ConcatenationTable.from_blocks.

Either concat_tables or ConcatenationTable.from_tables are fine :)

But ConcatenationTable.from_blocks only takes InMemoryTable or MemoryMappedTable objects as input so it may fail if self._data is already a ConcatenationTable

Good. I leave as it is then.

lhoestq

Cool !

Also that means that if someone call add_item 100 times, then we end up with 100 InMemoryTable objects.

Maybe we can have a consolidation step ?
For example we can merge successive InMemoryTable objects into one InMemoryTable in a ConcatenationTable ?
This will help speed up subsequent ConcatenationTable.slice calls for example, since it iterates on the table.blocks.
This should also speed up Dataset.__getitem__.

src/datasets/arrow_dataset.py

albertvillanova · 2021-04-01T13:58:12Z

I am going to implement this consolidation step in #2151.

lhoestq · 2021-04-01T13:59:26Z

Sounds good !

albertvillanova · 2021-04-20T09:36:09Z

I retake this PR once the consolidation step is already implemented by #2151.

lhoestq

thank you !

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <[email protected]>

albertvillanova added 3 commits February 12, 2021 15:59

Test Dataset.add_item

fec7b83

Implement Dataset.add_item

81c31dd

Merge remote-tracking branch 'upstream/master' into dataset-add-item

a0488a8

lhoestq reviewed Feb 12, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into dataset-add-item

5694993

lhoestq mentioned this pull request Feb 15, 2021

Allow concatenation of both in-memory and on-disk datasets #1877

Closed

albertvillanova added 6 commits February 15, 2021 13:27

tmp

2867548

Merge remote-tracking branch 'upstream/master' into dataset-add-item

020ee34

Use InMemoryTable for new item

05815f9

Add dataset_dict and arrow_path for tests

2b6225d

Fix test Dataset.add_item

ebab02c

Add docstring

238b371

albertvillanova commented Mar 30, 2021

View reviewed changes

albertvillanova marked this pull request as ready for review March 30, 2021 12:29

albertvillanova added the enhancement New feature or request label Mar 30, 2021

albertvillanova mentioned this pull request Mar 30, 2021

concatenate_datasets support axis=0 or 1 ？ #853

Closed

lhoestq reviewed Mar 30, 2021

View reviewed changes

src/datasets/arrow_dataset.py Show resolved Hide resolved

albertvillanova mentioned this pull request Apr 1, 2021

Add support for axis in concatenate datasets #2151

Merged

albertvillanova added 5 commits April 20, 2021 10:38

Merge remote-tracking branch 'upstream/master' into dataset-add-item

b4261ae

Return new Dataset

83f38f6

Fix test with returned new dataset

8d0e81e

Test multiple InMemoryTables are consolidated

d764da8

Test for consolidated InMemoryTables after multiple calls

8f468fd

albertvillanova added 2 commits April 20, 2021 12:32

Add versionadded to docstring

424e1bd

Add method docstring to the docs

ae3e52b

albertvillanova modified the milestones: 1.6, 1.7 Apr 20, 2021

lhoestq approved these changes Apr 23, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Simplify cast schema

8a65359

Co-authored-by: Quentin Lhoest <[email protected]>

albertvillanova merged commit 1f83a89 into huggingface:master Apr 23, 2021

Implement Dataset add_item #1870

Implement Dataset add_item #1870

Uh oh!

Conversation

albertvillanova commented Feb 12, 2021

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova commented Feb 15, 2021

Uh oh!

lhoestq commented Feb 15, 2021

Uh oh!

albertvillanova Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova commented Apr 1, 2021

Uh oh!

lhoestq commented Apr 1, 2021

Uh oh!

albertvillanova commented Apr 20, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq left a comment •

edited

Loading