Skip to content

Conversation

@albertvillanova
Copy link
Member

Implement Dataset.add_item.

Close #1854.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

The pyarrow table created from the dict may have different feature types that the current dataset so you may want to cast the features of the table before concatenating.
To do so please use

# get the pyarrow type corresponding to the features
type = self.features.type
# build the schema of the table
schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names})
# cast the table
table = table.cast(schema)

Also adding examples this way breaks some assumptions regarding __get_state__ for pickling.
In particular one assumption is that the dataset is either fully in memory (dataset._data_files is empty), or the dataset can be reloaded from disk (using the dataset._data_files).
This assumption was convenient to handle both in-memory and on-disk dataset differently:

  • in-memory dataset can just be pickled/unpickled in-memory
  • on-disk dataset could be unloaded to only keep the filepaths when pickling, and then reloaded from the disk when unpickling

So I think we'll need to refactor these things first (this is a mandatory thing to do anyway in my opinion).

Maybe we can have a design that allows a Dataset to have a Table that can be rebuilt from heterogenous sources like in-memory tables or on-disk tables ? This could also be further extended in the future

@albertvillanova
Copy link
Member Author

Thanks @lhoestq for your remarks. Yes, I agree there are still many issues to be tackled... This PR is just a starting point, so that we can discuss how Dataset should be generalized.

@lhoestq
Copy link
Member

lhoestq commented Feb 15, 2021

Sure ! I opened an issue #1877 so we can discuss this specific aspect :)

schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names})
table = table.cast(schema)
# Concatenate tables
self._data = concat_tables([self._data, table])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here I could use ConcatenationTable.from_tables instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better, ConcatenationTable.from_blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either concat_tables or ConcatenationTable.from_tables are fine :)

But ConcatenationTable.from_blocks only takes InMemoryTable or MemoryMappedTable objects as input so it may fail if self._data is already a ConcatenationTable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I leave as it is then.

@albertvillanova albertvillanova marked this pull request as ready for review March 30, 2021 12:29
@albertvillanova albertvillanova added the enhancement New feature or request label Mar 30, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool !

Also that means that if someone call add_item 100 times, then we end up with 100 InMemoryTable objects.

Maybe we can have a consolidation step ?
For example we can merge successive InMemoryTable objects into one InMemoryTable in a ConcatenationTable ?
This will help speed up subsequent ConcatenationTable.slice calls for example, since it iterates on the table.blocks.
This should also speed up Dataset.__getitem__.

@albertvillanova
Copy link
Member Author

I am going to implement this consolidation step in #2151.

@lhoestq
Copy link
Member

lhoestq commented Apr 1, 2021

Sounds good !

@albertvillanova
Copy link
Member Author

I retake this PR once the consolidation step is already implemented by #2151.

@albertvillanova albertvillanova modified the milestones: 1.6, 1.7 Apr 20, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you !

Co-authored-by: Quentin Lhoest <[email protected]>
@albertvillanova albertvillanova merged commit 1f83a89 into huggingface:master Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Dataset.add_item

2 participants