-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Implement Dataset add_item #1870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Dataset add_item #1870
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice !
The pyarrow table created from the dict may have different feature types that the current dataset so you may want to cast the features of the table before concatenating.
To do so please use
# get the pyarrow type corresponding to the features
type = self.features.type
# build the schema of the table
schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names})
# cast the table
table = table.cast(schema)Also adding examples this way breaks some assumptions regarding __get_state__ for pickling.
In particular one assumption is that the dataset is either fully in memory (dataset._data_files is empty), or the dataset can be reloaded from disk (using the dataset._data_files).
This assumption was convenient to handle both in-memory and on-disk dataset differently:
- in-memory dataset can just be pickled/unpickled in-memory
- on-disk dataset could be unloaded to only keep the filepaths when pickling, and then reloaded from the disk when unpickling
So I think we'll need to refactor these things first (this is a mandatory thing to do anyway in my opinion).
Maybe we can have a design that allows a Dataset to have a Table that can be rebuilt from heterogenous sources like in-memory tables or on-disk tables ? This could also be further extended in the future
|
Thanks @lhoestq for your remarks. Yes, I agree there are still many issues to be tackled... This PR is just a starting point, so that we can discuss how Dataset should be generalized. |
|
Sure ! I opened an issue #1877 so we can discuss this specific aspect :) |
src/datasets/arrow_dataset.py
Outdated
| schema = pa.schema({col_name: type[col_name].type for col_name in self._data.column_names}) | ||
| table = table.cast(schema) | ||
| # Concatenate tables | ||
| self._data = concat_tables([self._data, table]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here I could use ConcatenationTable.from_tables instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even better, ConcatenationTable.from_blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either concat_tables or ConcatenationTable.from_tables are fine :)
But ConcatenationTable.from_blocks only takes InMemoryTable or MemoryMappedTable objects as input so it may fail if self._data is already a ConcatenationTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. I leave as it is then.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool !
Also that means that if someone call add_item 100 times, then we end up with 100 InMemoryTable objects.
Maybe we can have a consolidation step ?
For example we can merge successive InMemoryTable objects into one InMemoryTable in a ConcatenationTable ?
This will help speed up subsequent ConcatenationTable.slice calls for example, since it iterates on the table.blocks.
This should also speed up Dataset.__getitem__.
|
I am going to implement this consolidation step in #2151. |
|
Sounds good ! |
|
I retake this PR once the consolidation step is already implemented by #2151. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you !
Co-authored-by: Quentin Lhoest <[email protected]>
Implement
Dataset.add_item.Close #1854.