Skip to content

Conversation

@andersjohanandreassen
Copy link
Contributor

This PR adds all BIG-bench json tasks to huggingface/datasets.

andersjohanandreassen and others added 30 commits March 14, 2022 00:21
@lhoestq
Copy link
Member

lhoestq commented May 6, 2022

Now the last question: let's have the dataset undergoogle/bigbench @andersjohanandreassen ?

I think it would be nicer, this way you and anyone in your team can update the dataset card whevener you want without going through a github PR. You just need to join the https://huggingface.co/google page using your google email :)

@andersjohanandreassen
Copy link
Contributor Author

Hi @lhoestq,

Thank you so much for the help! I really appreciate it!!!

After some discussion with the other bigbench organizers, I think there is a slight preference for bigbench to not be under google/bigbench since this is a collaboration with researchers from many different institutions/organizations beyond Google.

I see the drawback with the updates to the dataset card having to go through a PR, but hopefully that won't be very frequent.

We're finalizing putting the bigbench api on pip, so once that's finalized I just need to update the setup.py with the correct dependency and I think we are ready to merge.

@lhoestq
Copy link
Member

lhoestq commented May 16, 2022

Ok perfect, thank you !

@lhoestq
Copy link
Member

lhoestq commented May 16, 2022

I noticed that in the latest windows CI run it takes forever to install the dependencies, was there any change in the bigbench dependencies recently ?

@andersjohanandreassen
Copy link
Contributor Author

oh, sorry! I just did a double check on the dependencies, and it seems like there is at least one left that should have been removed. There's also one new one added.
Let me get those removed again. Will ping you here when it's updated.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks for fixing the deps :) I think this is ready to merge

@lhoestq
Copy link
Member

lhoestq commented May 30, 2022

It looks like there is a circular dependency in bigbench at https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz

>>> import bigbench.api.util as bb_utils
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/circleci/.pyenv/versions/3.6.15/lib/python3.6/site-packages/bigbench/api/util.py", line 29, in <module>
    import bigbench.models.query_logging_model as query_logging_model
  File "/home/circleci/.pyenv/versions/3.6.15/lib/python3.6/site-packages/bigbench/models/query_logging_model.py", line 23, in <module>
    import bigbench.api.util as util
AttributeError: module 'bigbench.api' has no attribute 'util'

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks for fixing it ! Let me know if there is something else to do before we merge.

In particular, let me know if you plan to make a release on PyPI so that users can use this version instead of the one hosted on GCS. For now we can merge using the GCS url, but just be careful and don't update it if possible ^^

@andersjohanandreassen
Copy link
Contributor Author

Hi @lhoestq ,
I think we are ready to merge!

I have one minor question that I haven't been able to figure out:
Is there a way to bypass the verify_infos from triggering? I have max_examples as an argument to allow for selecting a fixed subset of the datasets (some of the tasks have very many examples). But this is a variable that's not specified by the configs, so it raises an NonMatchingSplitsSizesError.
I wasn't able to work my way around this, but perhaps there is a way to bypass this that I'm not seeing?
If this cannot be done, I'm happy to ignore this for now.

Regarding pypi, we are working on a release there, but I'm told there is some issue that there is a problem regarding the upload, and we are not sure when it will be resolved, and it's not in my control.
I think merging this PR with the GCS is a great idea, and I will open a new PR when the pypi version is ready.

@lhoestq
Copy link
Member

lhoestq commented Jun 8, 2022

Cool ! Merging then :D

Is there a way to bypass the verify_infos from triggering? I have max_examples as an argument to allow for selecting a fixed subset of the datasets (some of the tasks have very many examples). But this is a variable that's not specified by the configs, so it raises an NonMatchingSplitsSizesError.

This is a bug, I opened an issue here. It should be easy to fix :)

@lhoestq lhoestq merged commit 72e8679 into huggingface:master Jun 8, 2022
@lhoestq
Copy link
Member

lhoestq commented Jun 8, 2022

The bigbench page is available here ! https://huggingface.co/datasets/bigbench

I think we can update the dataset viewer to install bigbench on it, but since this is production code I'd rather use the version on pypi for bigbench when it comes out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants