preprocess with PIL the full val dataset and save binary by lidanqing-vv · Pull Request #16529 · PaddlePaddle/Paddle

lidanqing-vv · 2019-03-28T12:44:39Z

test=develop
preprocess the full val of ILSVRC2012 with python PIL and save to binary file so as to align with test_calibration INT8v1
provide full val data for analyzer_int8_resnet50_test in PR #16399

test=develop

lidanqing-vv · 2019-03-28T12:47:37Z

@luotao1 Please check if the DATA_DIR is ok?

luotao1 · 2019-03-28T13:01:04Z

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

luotao1 · 2019-03-28T13:05:27Z

paddle/fluid/inference/tests/api/preprocess.py

+SIZE_FLOAT32 = 4
+SIZE_INT64 = 8
+
+DATA_DIR = '/data/ILSVRC2012'


If I run cd build; python ../paddle/fluid/inference/tests/api/preprocess.py, where will the output data?

Currently it outputs to /data/ILSVRC2012/data.bin. I also wanted to ask this. Where should I put the output data.bin

/data/ILSVRC2012/data.bin is not better, since we don't have /data authority. ./data/ILSVRC2012/data.bin?

@lidanqing-intel you can put in the .cache like V1 did :

Paddle/python/paddle/fluid/contrib/tests/test_calibration.py

Line 120 in 2632327

self.cache_folder = os.path.expanduser('~/.cache/paddle/dataset/' +

I think .cache is OK.

ok will use .cache

@luotao1 the data.bin path will be ~/.cache/int8_full_val_bin/data.bin Is it ok ? I am worried if I put directly in ~/.cache/ may cause some misunderstanding

Since V1 put into ~/.cache/paddle/dataset/int8/download, how about put into ~/.cache/paddle/dataset/int8/download/int8_full_val.bin

Since V1 put into ~/.cache/paddle/dataset/int8/download, how about put into ~/.cache/paddle/dataset/int8/download/int8_full_val.bin

Yes! Agree.

I see the unzip dataset is not in this location.

lidanqing-vv · 2019-03-28T13:12:01Z

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

Ok I will add wget and unzip part. I may change the name to full_ILSVRC2012_val.py, but it is about preprocess. Maybe 'full_ILSVRC2012_val_preprocess.py'? What do you think?

luotao1 · 2019-03-28T13:19:43Z

Maybe 'full_ILSVRC2012_val_preprocess.py'? What do you think?

I think it's OK.

lidanqing-vv · 2019-03-28T13:49:07Z

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

Do I need to give the option to download 100 val images? Or only downloading full val is good.

luotao1 · 2019-03-28T14:30:22Z

only downloading full val is enough

lidanqing-vv · 2019-03-28T14:54:12Z

only downloading full val is enough

ok

lidanqing-vv · 2019-03-28T15:52:23Z

@bingyanghuang The generated file is ~/.cache/paddle/dataset/int8/download/data/ILSVRC2012/int8_full_val.bin

test=develop

lidanqing-vv · 2019-03-28T23:25:07Z

only downloading full val is enough

Done

bingyanghuang · 2019-03-29T06:21:46Z

paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py

+    with open(file_list) as flist:
+        lines = [line.strip() for line in flist]
+        num_images = len(lines)
+        if not os.path.exists(output_file):


we cannot only judge the existence of the output file. Because the process of generating the the "data.bin" is too long , it is possible that the process is not finished but user stop this running or some error happens like "no space left". These kinds of interruption will leave the uncompleted file in the folder. And next time when you run again this python script, it will not generate the new output binary file.

luotao1 · 2019-03-29T07:25:34Z

paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py

+        num_images = len(lines)
+        if not os.path.exists(output_file):
+            print(
+                'Preprocessing to binary file...<num_images><all images><all labels>...\n'


This Print is hard to understand.

luotao1 · 2019-03-29T07:25:58Z

paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py

+            print(
+                'Preprocessing to binary file...<num_images><all images><all labels>...\n'
+            )
+            with open(output_file, "w+b") as of:


please add some print every 1000 images.

luotao1

I merge it at first, please refine it later.

lidanqing-vv added 3 commits March 28, 2019 06:38

preprocess with PIL the full val dataset and save binary

57f51e5

test=develop

change script file name and data_dir location

894aa9b

test=develop

add wget and unzip part and change data_dir

b46e467

test=develop

luotao1 reviewed Mar 28, 2019

View reviewed changes

lidanqing-vv closed this Mar 28, 2019

lidanqing-vv reopened this Mar 28, 2019

luotao1 added Intel int8 labels Mar 28, 2019

luotao1 mentioned this pull request Mar 28, 2019

MKLDNN INT8 v2 readme.md #16515

Merged

fix some bugs of unzip and reading val list

0d65699

test=develop

wojtuss mentioned this pull request Mar 28, 2019

create a test for quantized resnet50 - extended #16532

Closed

bingyanghuang reviewed Mar 29, 2019

View reviewed changes

luotao1 reviewed Mar 29, 2019

View reviewed changes

luotao1 approved these changes Mar 29, 2019

View reviewed changes

luotao1 merged commit 8f7b588 into PaddlePaddle:develop Mar 29, 2019

wojtuss mentioned this pull request Mar 29, 2019

fix dataset reading and add support for full dataset #16559

Merged

This was referenced Mar 29, 2019

Review fix for PR 16529 preprocess with PIL the full val dataset and save binary #16562

Closed

fix preprocess script with processbar, integrity check and logs #16608

Merged

lidanqing-vv deleted the lidanqing/preprocess-data branch June 22, 2022 06:00

Conversation

lidanqing-vv commented Mar 28, 2019

Uh oh!

lidanqing-vv commented Mar 28, 2019

Uh oh!

luotao1 commented Mar 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidanqing-vv commented Mar 28, 2019

Uh oh!

luotao1 commented Mar 28, 2019

Uh oh!

lidanqing-vv commented Mar 28, 2019

Uh oh!

luotao1 commented Mar 28, 2019

Uh oh!

lidanqing-vv commented Mar 28, 2019

Uh oh!

lidanqing-vv commented Mar 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidanqing-vv commented Mar 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luotao1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lidanqing-vv commented Mar 28, 2019 •

edited

Loading