Skip to content

preprocess with PIL the full val dataset and save binary#16529

Merged
luotao1 merged 4 commits intoPaddlePaddle:developfrom
lidanqing-vv:lidanqing/preprocess-data
Mar 29, 2019
Merged

preprocess with PIL the full val dataset and save binary#16529
luotao1 merged 4 commits intoPaddlePaddle:developfrom
lidanqing-vv:lidanqing/preprocess-data

Conversation

@lidanqing-vv
Copy link
Contributor

test=develop
preprocess the full val of ILSVRC2012 with python PIL and save to binary file so as to align with test_calibration INT8v1
provide full val data for analyzer_int8_resnet50_test in PR #16399

@lidanqing-vv
Copy link
Contributor Author

@luotao1 Please check if the DATA_DIR is ok?

@luotao1
Copy link
Contributor

luotao1 commented Mar 28, 2019

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

SIZE_FLOAT32 = 4
SIZE_INT64 = 8

DATA_DIR = '/data/ILSVRC2012'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run cd build; python ../paddle/fluid/inference/tests/api/preprocess.py, where will the output data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it outputs to /data/ILSVRC2012/data.bin. I also wanted to ask this. Where should I put the output data.bin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/data/ILSVRC2012/data.bin is not better, since we don't have /data authority. ./data/ILSVRC2012/data.bin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidanqing-intel you can put in the .cache like V1 did :

self.cache_folder = os.path.expanduser('~/.cache/paddle/dataset/' +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .cache is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will use .cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 the data.bin path will be ~/.cache/int8_full_val_bin/data.bin Is it ok ? I am worried if I put directly in ~/.cache/ may cause some misunderstanding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since V1 put into ~/.cache/paddle/dataset/int8/download, how about put into ~/.cache/paddle/dataset/int8/download/int8_full_val.bin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since V1 put into ~/.cache/paddle/dataset/int8/download, how about put into ~/.cache/paddle/dataset/int8/download/int8_full_val.bin

Yes! Agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the unzip dataset is not in this location.

@lidanqing-vv
Copy link
Contributor Author

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

Ok I will add wget and unzip part. I may change the name to full_ILSVRC2012_val.py, but it is about preprocess. Maybe 'full_ILSVRC2012_val_preprocess.py'? What do you think?

@lidanqing-vv lidanqing-vv reopened this Mar 28, 2019
@luotao1
Copy link
Contributor

luotao1 commented Mar 28, 2019

Maybe 'full_ILSVRC2012_val_preprocess.py'? What do you think?

I think it's OK.

@lidanqing-vv
Copy link
Contributor Author

Could you add the command from wget and unzip? Otherwise, users don't know how to get the original dataset. You can refer to test_calibration.py.
Then, I use preprocess.py to implement wget, unzip and preprocess. Maybe the name of preprocess.py changes to full_ ILSVRC2012_val.py?

Do I need to give the option to download 100 val images? Or only downloading full val is good.

@luotao1
Copy link
Contributor

luotao1 commented Mar 28, 2019

only downloading full val is enough

@lidanqing-vv
Copy link
Contributor Author

only downloading full val is enough

ok

@lidanqing-vv
Copy link
Contributor Author

lidanqing-vv commented Mar 28, 2019

@bingyanghuang The generated file is ~/.cache/paddle/dataset/int8/download/data/ILSVRC2012/int8_full_val.bin

@lidanqing-vv
Copy link
Contributor Author

only downloading full val is enough

Done

with open(file_list) as flist:
lines = [line.strip() for line in flist]
num_images = len(lines)
if not os.path.exists(output_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot only judge the existence of the output file. Because the process of generating the the "data.bin" is too long , it is possible that the process is not finished but user stop this running or some error happens like "no space left". These kinds of interruption will leave the uncompleted file in the folder. And next time when you run again this python script, it will not generate the new output binary file.

num_images = len(lines)
if not os.path.exists(output_file):
print(
'Preprocessing to binary file...<num_images><all images><all labels>...\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Print is hard to understand.

print(
'Preprocessing to binary file...<num_images><all images><all labels>...\n'
)
with open(output_file, "w+b") as of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some print every 1000 images.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge it at first, please refine it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants