Skip to content

Commit 5c50c8f

Browse files
AnnblessBen-Louis
authored andcommitted
[Feature] Support ViTPose (open-mmlab#1937)
1 parent df7b3e8 commit 5c50c8f

23 files changed

+2275
-35
lines changed

README.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -130,26 +130,27 @@ A summary can be found in the [Model Zoo](https://mmpose.readthedocs.io/en/0.x/m
130130
<details open>
131131
<summary><b>Supported algorithms:</b></summary>
132132

133-
- [x] [DeepPose](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#deeppose-cvpr-2014) (CVPR'2014)
134-
- [x] [CPM](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#cpm-cvpr-2016) (CVPR'2016)
135-
- [x] [Hourglass](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#hourglass-eccv-2016) (ECCV'2016)
136-
- [x] [SimpleBaseline3D](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#simplebaseline3d-iccv-2017) (ICCV'2017)
137-
- [x] [Associative Embedding](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#associative-embedding-nips-2017) (NeurIPS'2017)
138-
- [x] [HMR](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#hmr-cvpr-2018) (CVPR'2018)
139-
- [x] [SimpleBaseline2D](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#simplebaseline2d-eccv-2018) (ECCV'2018)
140-
- [x] [HRNet](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#hrnet-cvpr-2019) (CVPR'2019)
141-
- [x] [VideoPose3D](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#videopose3d-cvpr-2019) (CVPR'2019)
142-
- [x] [HRNetv2](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#hrnetv2-tpami-2019) (TPAMI'2019)
143-
- [x] [MSPN](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#mspn-arxiv-2019) (ArXiv'2019)
144-
- [x] [SCNet](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#scnet-cvpr-2020) (CVPR'2020)
145-
- [x] [HigherHRNet](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#higherhrnet-cvpr-2020) (CVPR'2020)
146-
- [x] [RSN](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#rsn-eccv-2020) (ECCV'2020)
147-
- [x] [InterNet](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#internet-eccv-2020) (ECCV'2020)
148-
- [x] [VoxelPose](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#voxelpose-eccv-2020) (ECCV'2020)
149-
- [x] [LiteHRNet](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#litehrnet-cvpr-2021) (CVPR'2021)
150-
- [x] [ViPNAS](https://mmpose.readthedocs.io/en/0.x/papers/backbones.html#vipnas-cvpr-2021) (CVPR'2021)
151-
- [x] [DEKR](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#dekr-cvpr-2021) (CVPR'2021)
152-
- [x] [CID](https://mmpose.readthedocs.io/en/0.x/papers/algorithms.html#cid-cvpr-2022) (CVPR'2022)
133+
- [x] [DeepPose](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#deeppose-cvpr-2014) (CVPR'2014)
134+
- [x] [CPM](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#cpm-cvpr-2016) (CVPR'2016)
135+
- [x] [Hourglass](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#hourglass-eccv-2016) (ECCV'2016)
136+
- [x] [SimpleBaseline3D](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#simplebaseline3d-iccv-2017) (ICCV'2017)
137+
- [x] [Associative Embedding](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#associative-embedding-nips-2017) (NeurIPS'2017)
138+
- [x] [HMR](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#hmr-cvpr-2018) (CVPR'2018)
139+
- [x] [SimpleBaseline2D](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#simplebaseline2d-eccv-2018) (ECCV'2018)
140+
- [x] [HRNet](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#hrnet-cvpr-2019) (CVPR'2019)
141+
- [x] [VideoPose3D](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#videopose3d-cvpr-2019) (CVPR'2019)
142+
- [x] [HRNetv2](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#hrnetv2-tpami-2019) (TPAMI'2019)
143+
- [x] [MSPN](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#mspn-arxiv-2019) (ArXiv'2019)
144+
- [x] [SCNet](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#scnet-cvpr-2020) (CVPR'2020)
145+
- [x] [HigherHRNet](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#higherhrnet-cvpr-2020) (CVPR'2020)
146+
- [x] [RSN](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#rsn-eccv-2020) (ECCV'2020)
147+
- [x] [InterNet](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#internet-eccv-2020) (ECCV'2020)
148+
- [x] [VoxelPose](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#voxelpose-eccv-2020) (ECCV'2020)
149+
- [x] [LiteHRNet](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#litehrnet-cvpr-2021) (CVPR'2021)
150+
- [x] [ViPNAS](https://mmpose.readthedocs.io/en/latest/papers/backbones.html#vipnas-cvpr-2021) (CVPR'2021)
151+
- [x] [DEKR](https://mmpose.readthedocs.io/zh_CN/latest/papers/algorithms.html#dekr-cvpr-2021) (CVPR'2021)
152+
- [x] [CID](https://mmpose.readthedocs.io/zh_CN/latest/papers/algorithms.html#cid-cvpr-2022) (CVPR'2022)
153+
- [x] [ViTPose](https://mmpose.readthedocs.io/en/latest/papers/algorithms.html#vitpose-neurips-2022) (Neurips'2022)
153154

154155
</details>
155156

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
_base_ = [
2+
'../../../../_base_/default_runtime.py',
3+
'../../../../_base_/datasets/coco.py'
4+
]
5+
evaluation = dict(interval=10, metric='mAP', save_best='AP')
6+
7+
optimizer = dict(
8+
type='AdamW',
9+
lr=5e-4,
10+
betas=(0.9, 0.999),
11+
weight_decay=0.1,
12+
constructor='LayerDecayOptimizerConstructor',
13+
paramwise_cfg=dict(
14+
num_layers=12,
15+
layer_decay_rate=0.75,
16+
))
17+
18+
optimizer_config = dict(grad_clip=dict(max_norm=1., norm_type=2))
19+
20+
# learning policy
21+
lr_config = dict(
22+
policy='step',
23+
warmup='linear',
24+
warmup_iters=500,
25+
warmup_ratio=0.001,
26+
step=[170, 200])
27+
total_epochs = 210
28+
target_type = 'GaussianHeatmap'
29+
channel_cfg = dict(
30+
num_output_channels=17,
31+
dataset_joints=17,
32+
dataset_channel=[
33+
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
34+
],
35+
inference_channel=[
36+
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
37+
])
38+
39+
# model settings
40+
model = dict(
41+
type='TopDown',
42+
pretrained=None,
43+
backbone=dict(
44+
type='VisionTransformer',
45+
img_size=(256, 192),
46+
patch_size=16,
47+
embed_dims=768,
48+
# Optional in train
49+
padding=2,
50+
num_layers=12,
51+
num_heads=12,
52+
mlp_ratio=4,
53+
drop_path_rate=0.3,
54+
final_norm=True,
55+
),
56+
keypoint_head=dict(
57+
type='TopdownHeatmapSimpleHead',
58+
in_channels=768,
59+
num_deconv_layers=2,
60+
num_deconv_filters=(256, 256),
61+
num_deconv_kernels=(4, 4),
62+
extra=dict(final_conv_kernel=1, ),
63+
out_channels=channel_cfg['num_output_channels'],
64+
loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
65+
train_cfg=dict(),
66+
test_cfg=dict(
67+
flip_test=True,
68+
post_process='default',
69+
shift_heatmap=False,
70+
target_type=target_type,
71+
modulate_kernel=11,
72+
use_udp=True))
73+
74+
data_cfg = dict(
75+
image_size=[192, 256],
76+
heatmap_size=[48, 64],
77+
num_output_channels=channel_cfg['num_output_channels'],
78+
num_joints=channel_cfg['dataset_joints'],
79+
dataset_channel=channel_cfg['dataset_channel'],
80+
inference_channel=channel_cfg['inference_channel'],
81+
soft_nms=False,
82+
nms_thr=1.0,
83+
oks_thr=0.9,
84+
vis_thr=0.2,
85+
use_gt_bbox=False,
86+
det_bbox_thr=0.0,
87+
bbox_file='data/coco/person_detection_results/'
88+
'COCO_val2017_detections_AP_H_56_person.json',
89+
)
90+
91+
train_pipeline = [
92+
dict(type='LoadImageFromFile'),
93+
dict(type='TopDownGetBboxCenterScale', padding=1.25),
94+
dict(type='TopDownRandomShiftBboxCenter', shift_factor=0.16, prob=0.3),
95+
dict(type='TopDownRandomFlip', flip_prob=0.5),
96+
dict(
97+
type='TopDownHalfBodyTransform',
98+
num_joints_half_body=8,
99+
prob_half_body=0.3),
100+
dict(
101+
type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5),
102+
dict(type='TopDownAffine', use_udp=True),
103+
dict(type='ToTensor'),
104+
dict(
105+
type='NormalizeTensor',
106+
mean=[0.485, 0.456, 0.406],
107+
std=[0.229, 0.224, 0.225]),
108+
dict(
109+
type='TopDownGenerateTarget',
110+
sigma=2,
111+
encoding='UDP',
112+
target_type=target_type),
113+
dict(
114+
type='Collect',
115+
keys=['img', 'target', 'target_weight'],
116+
meta_keys=[
117+
'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale',
118+
'rotation', 'bbox_score', 'flip_pairs'
119+
]),
120+
]
121+
122+
val_pipeline = [
123+
dict(type='LoadImageFromFile'),
124+
dict(type='TopDownGetBboxCenterScale', padding=1.25),
125+
dict(type='TopDownAffine', use_udp=True),
126+
dict(type='ToTensor'),
127+
dict(
128+
type='NormalizeTensor',
129+
mean=[0.485, 0.456, 0.406],
130+
std=[0.229, 0.224, 0.225]),
131+
dict(
132+
type='Collect',
133+
keys=['img'],
134+
meta_keys=[
135+
'image_file', 'center', 'scale', 'rotation', 'bbox_score',
136+
'flip_pairs'
137+
]),
138+
]
139+
140+
test_pipeline = val_pipeline
141+
142+
data_root = 'data/coco'
143+
data = dict(
144+
samples_per_gpu=64,
145+
workers_per_gpu=4,
146+
val_dataloader=dict(samples_per_gpu=32),
147+
test_dataloader=dict(samples_per_gpu=32),
148+
train=dict(
149+
type='TopDownCocoDataset',
150+
ann_file=f'{data_root}/annotations/person_keypoints_train2017.json',
151+
img_prefix=f'{data_root}/train2017/',
152+
data_cfg=data_cfg,
153+
pipeline=train_pipeline,
154+
dataset_info={{_base_.dataset_info}}),
155+
val=dict(
156+
type='TopDownCocoDataset',
157+
ann_file=f'{data_root}/annotations/person_keypoints_val2017.json',
158+
img_prefix=f'{data_root}/val2017/',
159+
data_cfg=data_cfg,
160+
pipeline=val_pipeline,
161+
dataset_info={{_base_.dataset_info}}),
162+
test=dict(
163+
type='TopDownCocoDataset',
164+
ann_file=f'{data_root}/annotations/person_keypoints_val2017.json',
165+
img_prefix=f'{data_root}/val2017/',
166+
data_cfg=data_cfg,
167+
pipeline=test_pipeline,
168+
dataset_info={{_base_.dataset_info}}),
169+
)
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
<!-- [ALGORITHM] -->
2+
3+
<details>
4+
<summary align="right"><a href="https://arxiv.org/abs/2204.12484">ViTPose (Neurips'2022)</a></summary>
5+
6+
```bibtex
7+
@inproceedings{
8+
xu2022vitpose,
9+
title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
10+
author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
11+
booktitle={Advances in Neural Information Processing Systems},
12+
year={2022},
13+
}
14+
```
15+
16+
</details>
17+
18+
<!-- [DATASET] -->
19+
20+
<details>
21+
<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48">COCO (ECCV'2014)</a></summary>
22+
23+
```bibtex
24+
@inproceedings{lin2014microsoft,
25+
title={Microsoft coco: Common objects in context},
26+
author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
27+
booktitle={European conference on computer vision},
28+
pages={740--755},
29+
year={2014},
30+
organization={Springer}
31+
}
32+
```
33+
34+
</details>
35+
36+
The backbone models are pre-trained using MAE. The small-size pre-trained backbone can be found in [link](https://github.com/ViTAE-Transformer/ViTPose). The base, large, and huge pre-trained backbones can be found in [link](https://github.com/facebookresearch/mae).
37+
38+
Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset
39+
40+
| Arch | Input Size | AP | AP<sup>50</sup> | AP<sup>75</sup> | AR | AR<sup>50</sup> | ckpt | log |
41+
| :--------------------------------------------------------------------------------------------------------------- | :--------: | :---: | :-------------: | :-------------: | :---: | :-------------: | :--------: | :-------: |
42+
| [ViTPose-S](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_small_coco_256x192.py) | 256x192 | 0.738 | 0.903 | 0.813 | 0.792 | 0.940 | [ckpt](<>) | [log](<>) |
43+
| [ViTPose-B](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_base_coco_256x192.py) | 256x192 | 0.758 | 0.907 | 0.832 | 0.811 | 0.946 | [ckpt](<>) | [log](<>) |
44+
| [ViTPose-L](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_large_coco_256x192.py) | 256x192 | 0.783 | 0.914 | 0.852 | 0.835 | 0.953 | [ckpt](<>) | [log](<>) |
45+
| [ViTPose-H](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_huge_coco_256x192.py) | 256x192 | 0.791 | 0.917 | 0.857 | 0.841 | 0.954 | [ckpt](<>) | [log](<>) |
46+
| [ViTPose-Simple-S](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_simple_small_coco_256x192.py) | 256x192 | 0.735 | 0.900 | 0.811 | 0.789 | 0.940 | [ckpt](<>) | [log](<>) |
47+
| [ViTPose-Simple-B](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_simple_base_coco_256x192.py) | 256x192 | 0.755 | 0.906 | 0.829 | 0.809 | 0.946 | [ckpt](<>) | [log](<>) |
48+
| [ViTPose-Simple-L](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_simple_large_coco_256x192.py) | 256x192 | 0.782 | 0.914 | 0.853 | 0.834 | 0.953 | [ckpt](<>) | [log](<>) |
49+
| [ViTPose-Simple-H](/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/vitpose_simple_huge_coco_256x192.py) | 256x192 | 0.789 | 0.916 | 0.856 | 0.840 | 0.954 | [ckpt](<>) | [log](<>) |

0 commit comments

Comments
 (0)