Skip to content

Conversation

@lewtun
Copy link
Member

@lewtun lewtun commented Jul 9, 2021

This PR converts the return type of all sklearn metrics to be Python float instead of numpy.float64.

The reason behind this is that our Hub evaluation framework relies on converting benchmark-specific metrics to YAML (example) and the numpy.float64 format produces garbage like:

import yaml
from datasets import load_metric

metric = load_metric("accuracy")
score = metric.compute(predictions=[0,1], references=[0,1])
print(yaml.dump(score["accuracy"])) # output below
# !!python/object/apply:numpy.core.multiarray.scalar
# - !!python/object/apply:numpy.dtype
#   args:
#   - f8
#   - false
#   - true
#   state: !!python/tuple
#   - 3
#   - <
#   - null
#   - null
#   - null
#   - -1
#   - -1
#   - 0
# - !!binary |
#   AAAAAAAA8D8=

@lewtun lewtun requested review from albertvillanova and lhoestq July 9, 2021 09:48
@lewtun
Copy link
Member Author

lewtun commented Jul 9, 2021

I opened an issue on the sklearn repo to understand why numpy.float64 is the default: scikit-learn/scikit-learn#20490

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! :)

@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

It could be surprising at first to use tolist() on numpy scalars but it works ^^

@lhoestq lhoestq merged commit 060dc85 into huggingface:master Jul 9, 2021
@lewtun lewtun deleted the cast-numpy-to-python-types branch July 9, 2021 13:05
@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

did the same for Pearsonr here: #2614

@albertvillanova albertvillanova added this to the 1.10 milestone Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants