Skip to content

Conversation

@YooSungHyun
Copy link
Contributor

@YooSungHyun YooSungHyun commented May 26, 2022

first of all, please look #4323

why i can not use {"path","array","sampling_rate"}
because sf.write(format="wav") and sf.read(BytesIO) is changed my pcm data value
maybe, i think wav got header but, pcm is not.
and variable naming, pcm data is "byte" type. so, "array" name is not fair i think

so, i use scipy lib and numpy (that is huggingface dependency)
and refer to @lhoestq answered,

  1. encode -> using sampling_rate and pcm byte -> wav style byte (scipy.wavfile.write to byte)
  2. byte converting using fairseq style pcm audio read FileAudioDataset
  3. decode -> read wavfile.read

that way is not screw up my pcm byte to float data, and another audio type(wav) safety

please check!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! Could you also add tests in tests/features/test_audio.py ?

Maybe add a small pcm file in tests/features/data and check that everything works as expected in tests cases like test_audio_encode_example_pcm and test_audio_decode_example_pcm for example.

@mariosasko mariosasko linked an issue Jun 1, 2022 that may be closed by this pull request
@mariosasko
Copy link
Collaborator

@lhoestq Maybe I'm missing something, but what's the reason to read and encode PCM files to WAV in Audio.encode_example. Isn't the whole purpose of the decodable types to operate on raw files whenever possible? IMO this PR should only modify Audio.decode_example to support PCM files/bytes decoding.

@lhoestq
Copy link
Member

lhoestq commented Jun 1, 2022

Because the PCM file is not enough, we also need the sampling_rate associated to it. Therefore the two alternatives are either:

  • convert to WAV
  • add a sampling_rate field to the Audio arrow storage (not sure how it would behave for backward compatibility though)

@mariosasko
Copy link
Collaborator

But scipy.io.wavfile.read, which is used for reading such files, returns a file's sampling rate. The only tricky part is resampling to a different sampling rate than the default one.

@lhoestq
Copy link
Member

lhoestq commented Jun 1, 2022

How does it get the sampling rate of a PCM file then ? According to SO it's not possible to infer it from the file alone

@YooSungHyun
Copy link
Contributor Author

Awesome thanks ! Could you also add tests in tests/features/test_audio.py ?

Maybe add a small pcm file in tests/features/data and check that everything works as expected in tests cases like test_audio_encode_example_pcm and test_audio_decode_example_pcm for example.

@lhoestq how can i test test_audio.py? where is "main" func?
do you have some example or guideline?

@YooSungHyun
Copy link
Contributor Author

YooSungHyun commented Jun 2, 2022

But scipy.io.wavfile.read, which is used for reading such files, returns a file's sampling rate. The only tricky part is resampling to a different sampling rate than the default one.

@mariosasko @lhoestq
thanks for comment!

First of all, "PCM file" can not read alone to any audio library.
"PCM file" has not any audio META information header. (it just purely audio byte data. therefore, we don't have to encoding and decoding)
but, "PCM file" is audio extension, so we can use datasets.Audio

if you want to read "PCM file" to audio file likely, it have to needs additional parameter. (channel, sampling_rate, else....)
but, in many situation, we only know sampling_rate for PCM

and, if we want to use datasets.Audio for "PCM file", we must process encode_example.
therefore, i have to use sampling_rate for encoding for making wav-style byte. (we only know sampling_rate)

In my source code, I don't compare sampling rate(datasets.Audio's self.sampling_rate and read pcm sampling_rate(value["sampling_rate"])) and checking mono
@mariosasko ! do you want to process resampling and making mono? then i can modify my source

@YooSungHyun YooSungHyun requested a review from lhoestq June 2, 2022 09:40
@lhoestq
Copy link
Member

lhoestq commented Jun 9, 2022

There is no "main" function in test scripts :) To run a test script you must use the pytest command:

pytest tests/features/test_audio.py

to run only one function you can also do

pytest tests/features/test_audio.py::test_audio_feature_type_to_arrow

for example

@YooSungHyun
Copy link
Contributor Author

@lhoestq
maybe, if i write test code, i have to commit test_audio.py and send pr?
because, we need to keep test_audio_encode_example_pcm and test_audio_decode_example_pcm method after my pr merged?

@lhoestq
Copy link
Member

lhoestq commented Jun 13, 2022

You can add your tests in this PR with the other changes you did

@YooSungHyun
Copy link
Contributor Author

YooSungHyun commented Jun 14, 2022

@lhoestq
test complete & commit my test_audio.py

AND, some change in my code.

audio.py
i think "sampling_rate" is already Audio object initial variable. so, we don`t have to use input parameter.

test_audio.py
we can check "PCM" file to path (exactly, extenstion)
so, test case has to know path. if only have bytes, we don`t know that is "PCM" or not

@YooSungHyun
Copy link
Contributor Author

@lhoestq
and, why circleci raised exception?
maybe, repo url is not found!
PLZ, CHK!

@YooSungHyun
Copy link
Contributor Author

@lhoestq
hello????

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests ! This is awesome :)

I left two comments. Also to fix the CI feel free need to merge the master branch into yours

@YooSungHyun
Copy link
Contributor Author

@lhoestq
test_audio.py
if we don`t use path in pcm, test-case need to be changed
so, we check path just None

@YooSungHyun
Copy link
Contributor Author

YooSungHyun commented Jun 24, 2022

i'm merge branch already and multiprocess in setup.py but circleci error only win version
image
how can i fixed it?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the multiprocess error in the CI, this is unrelated to your PR and we're working on it ;)

Thanks a lot for the changes, I spotted one thing that requires a small change: when writing the WAV data, we should use the sampling rate from the input example dictionary (the same that contains "path") - and not the sampling rate from the Audio attribute.

Indeed the sampling rate of the Audio attribute is the expected sampling rate when decoding the example. When we encode it we must specify the sampling rate of the PCM file

YooSungHyun and others added 8 commits June 29, 2022 14:27
self.sampling_rate is for decode. so, we have to get value`s sampling_rate

Co-authored-by: Quentin Lhoest <[email protected]>
we have to know sampling_rate in input values variable

Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
@YooSungHyun
Copy link
Contributor Author

@lhoestq thx for comment!
test_audio.py test complete. it runs sucessfully
and, self.get("sampling_rate") -> value.get("sampling_rate") changed

and, some comment is not agreed to me, plz check my sub comment!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! It looks all good now.

I just removed the scipy import, since we already import soundfile:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 7, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding support for PCM :)

Let's merge this one. The CI fails are unrelated to this PR and fixed on main

@lhoestq lhoestq merged commit 693418a into huggingface:main Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audio can not find value["bytes"]

4 participants