Skip to content

Commit 2036075

Browse files
authored
Distrib (#573)
* [WIP] Added cifar10 distributed example * [WIP] Metric with all reduce decorator and tests * [WIP] Added tests for accumulation metric * [WIP] Updated with reinit_is_reduced * [WIP] Distrib adaptation for other metrics * [WIP] Warnings for EpochMetric and Precision/Recall when distrib * Updated metrics and tests to run on distributed configuration - Test on 2 GPUS single node - Added cmd in .travis.yml to indicate how to test locally - Updated travis to run tests in 4 processes * Minor fixes and cosmetics * Fixed bugs and improved contrib/cifar10 example * Updated docs * Fixes issue #543 (#572) * Fixes issue #543 Previous CM implementation suffered from the problem if target contains non-contiguous indices. New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117 This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class. Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0) = no classes (background class), then confusion matrix does not count any true/false predictions. * Update confusion_matrix.py * Update metrics.rst * Updated docs and set device as "cuda" in distributed instead of raising error * [WIP] Fix missing _is_reduced in precision/recall with tests * Updated other tests * Added mlflow logger (#558) * Added mlflow logger without tests * Added mlflow tests, updated mlflow logger code and other tests * Updated docs and added mlflow in travis * Added tests for mlflow OptimizerParamsHandler - additionally added OptimizerParamsHandler for plx with tests * Update to PyTorch v1.2.0 (#580) * Update .travis.yml * Update .travis.yml * Fixed tests and improved travis * Fix SSL problem of failing travis (#581) * Update .travis.yml * Update .travis.yml * Fixed tests and improved travis * Fixes SSL problem to download model weights * Fixed travis for deploy and nightly * Fixes #583 (#584) * Fixes docs build warnings (#585) * Return removable handle from Engine.add_event_handler(). (#588) * Add tests for event removable handle. Add feature tests for engine.add_event_handler returning removable event handles. * Return RemovableEventHandle from Engine.add_event_handler. * Fixup removable event handle test in python 2.7. Explicitly trigger gc, allowing cycle detection between engine and state, in removable handle weakref test. Python 2.7 cycle detection appears to be less aggressive than python 3+. * Add removable event handler docs. Add autodoc configuration for RemovableEventHandler, expand "concepts" documentation with event remove example following event add example. * Update concepts.rst * Updated travis and renamed tbptt test gpu -> cuda
1 parent 4d13db2 commit 2036075

56 files changed

Lines changed: 3185 additions & 505 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.travis.yml

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ python:
55
- "3.6"
66

77
env:
8-
- PYTORCH_PACKAGE=pytorch-cpu
9-
- PYTORCH_PACKAGE=pytorch-nightly-cpu
8+
- PYTORCH_CHANNEL=pytorch
9+
- PYTORCH_CHANNEL=pytorch-nightly
1010

1111
stages:
1212
- Lint check
@@ -25,25 +25,27 @@ before_install: &before_install
2525
- conda update -q conda
2626
# Useful for debugging any issues with conda
2727
- conda info -a
28-
- conda create -q -n test-environment -c pytorch python=$TRAVIS_PYTHON_VERSION $PYTORCH_PACKAGE
28+
- conda create -q -n test-environment pytorch cpuonly torchvision python=$TRAVIS_PYTHON_VERSION -c $PYTORCH_CHANNEL
2929
- source activate test-environment
3030
- if [[ $TRAVIS_PYTHON_VERSION == 2.7 ]]; then pip install enum34; fi
3131
# Test contrib dependencies
32-
- pip install tqdm scikit-learn tensorboardX visdom polyaxon-client
32+
- pip install tqdm scikit-learn tensorboardX visdom polyaxon-client mlflow
3333
# Futures should be already installed via visdom -> tornado -> futures
3434
# Let's reinstall it anyway to be sure
3535
- if [[ $TRAVIS_PYTHON_VERSION == 2.7 ]]; then pip install futures; fi
3636

3737
install:
3838
- python setup.py install
39-
- pip install numpy mock pytest codecov pytest-cov
39+
- pip install numpy mock pytest codecov pytest-cov pytest-xdist
4040
# Examples dependencies
4141
- pip install matplotlib pandas
42-
- conda install torchvision-cpu -c pytorch
4342
- pip install gym==0.10.11
4443

4544
script:
46-
- py.test --cov ignite --cov-report term-missing
45+
- CUDA_VISIBLE_DEVICES="" py.test --tx 4*popen//python=python$TRAVIS_PYTHON_VERSION --cov ignite --cov-report term-missing -vvv tests/
46+
# Run test on cuda device
47+
# As no GPUs on travis -> all tests will be skipped
48+
- CUDA_VISIBLE_DEVICES=0 py.test --cov ignite --cov-append --cov-report term-missing -vvv tests/ -k "on_cuda"
4749

4850
# Smoke tests for the examples
4951
# Mnist
@@ -69,8 +71,15 @@ script:
6971

7072
#fast-neural-style
7173
#train
74+
- if [[ $TRAVIS_PYTHON_VERSION == 2.7 ]]; then mkdir -p /home/travis/.cache/torch/checkpoints/ && wget "https://download.pytorch.org/models/vgg16-397923af.pth" -O/home/travis/.cache/torch/checkpoints/vgg16-397923af.pth; fi
7275
- python examples/fast_neural_style/neural_style.py train --epochs 1 --cuda 0 --dataset test --dataroot . --image_size 32 --style_image examples/fast_neural_style/images/style_images/mosaic.jpg --style_size 32
7376

77+
# tests for distributed ops
78+
# As no GPUs on travis -> all tests will be skipped
79+
# 2 is the number of processes <-> number of available GPUs
80+
- export WORLD_SIZE=2
81+
- py.test --cov ignite --cov-append --cov-report term-missing --dist=each --tx $WORLD_SIZE*popen//python=python$TRAVIS_PYTHON_VERSION tests -m distributed -vvv
82+
7483
after_success:
7584
- codecov
7685

@@ -114,7 +123,7 @@ jobs:
114123
- stage: Deploy
115124
python: "3.6"
116125
env:
117-
- PYTORCH_PACKAGE=pytorch-cpu
126+
- PYTORCH_CHANNEL=pytorch
118127
if: tag IS present
119128

120129
# Use previously defined before_install
@@ -168,7 +177,7 @@ jobs:
168177
- stage: Nightly
169178
python: "3.6"
170179
env:
171-
- PYTORCH_PACKAGE=pytorch-nightly-cpu
180+
- PYTORCH_CHANNEL=pytorch-nightly
172181
if: branch = nightly
173182
# Use previously defined before_install
174183
before_install: *before_install

docs/source/concepts.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,33 @@ Attaching an event handler is simple using method :meth:`~ignite.engine.Engine.a
8888
8989
trainer.add_event_handler(Events.COMPLETED, on_training_ended, mydata)
9090
91+
Event handlers can be detached via :meth:`~ignite.engine.Engine.remove_event_handler` or via the :class:`~ignite.engine.RemovableEventHandler`
92+
reference returned by :meth:`~ignite.engine.Engine.add_event_handler`. This can be used to reuse a configured engine for multiple loops:
93+
94+
.. code-block:: python
95+
96+
model = ...
97+
train_loader, validation_loader, test_loader = ...
98+
99+
trainer = create_supervised_trainer(model, optimizer, loss)
100+
evaluator = create_supervised_evaluator(model, metrics={'acc': Accuracy()})
101+
102+
def log_metrics(engine, title):
103+
print("Epoch: {} - {} accuracy: {:.2f}"
104+
.format(trainer.state.epoch, title, engine.state.metrics['acc']))
105+
106+
@trainer.on(Events.EPOCH_COMPLETED)
107+
def evaluate(trainer):
108+
with evaluator.add_event_handler(Events.COMPLETED, log_metrics, "train"):
109+
evaluator.run(train_loader)
110+
111+
with evaluator.add_event_handler(Events.COMPLETED, log_metrics, "validation"):
112+
evaluator.run(validation_loader)
113+
114+
with evaluator.add_event_handler(Events.COMPLETED, log_metrics, "test"):
115+
evaluator.run(test_loader)
116+
117+
trainer.run(train_loader, max_epochs=100)
91118
92119
.. Note ::
93120

docs/source/contrib/handlers.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,29 @@ tensorboard_logger
2323
:members:
2424
:inherited-members:
2525

26+
See `tensorboardX mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_tensorboard_logger.py>`_
27+
and `CycleGAN and EfficientNet notebooks <https://github.com/pytorch/ignite/tree/master/examples/notebooks>`_ for detailed usage.
28+
29+
2630
visdom_logger
2731
-------------
2832

2933
.. automodule:: ignite.contrib.handlers.visdom_logger
3034
:members:
3135
:inherited-members:
3236

37+
See `visdom mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_visdom_logger.py>`_
38+
for detailed usage.
39+
40+
41+
mlflow_logger
42+
-------------
43+
44+
.. automodule:: ignite.contrib.handlers.mlflow_logger
45+
:members:
46+
:inherited-members:
47+
48+
3349
tqdm_logger
3450
-----------
3551

docs/source/engine.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,7 @@ ignite.engine
1414
:undoc-members:
1515

1616
.. autoclass:: State
17+
18+
.. autoclass:: RemovableEventHandler
19+
:members:
20+
:undoc-members:

docs/source/examples.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ to display how it helps to write compact and full-featured training loops in a f
88
- `DCGAN <https://github.com/pytorch/ignite/tree/master/examples/gan>`_
99
- `Reinforcement Learning <https://github.com/pytorch/ignite/tree/master/examples/reinforcement_learning>`_
1010
- `Fast Neural Style <https://github.com/pytorch/ignite/tree/master/examples/fast_neural_style>`_
11+
- `Distributed Cifar10 <https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10>`_
12+
1113

1214
Notebooks:
1315

docs/source/faq.rst

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -100,16 +100,14 @@ do this, the most simple is the following:
100100
def update_fn(engine, batch):
101101
model.train()
102102
103-
if engine.state.iteration % accumulation_steps == 0:
104-
optimizer.zero_grad()
105-
106103
x, y = prepare_batch(batch, device=device, non_blocking=non_blocking)
107104
y_pred = model(x)
108105
loss = criterion(y_pred, y) / accumulation_steps
109106
loss.backward()
110107
111108
if engine.state.iteration % accumulation_steps == 0:
112109
optimizer.step()
110+
optimizer.zero_grad()
113111
114112
return loss.item()
115113

docs/source/metrics.rst

Lines changed: 150 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,65 +7,182 @@ fashion without having to store the entire output history of a model.
77
In practice a user needs to attach the metric instance to an engine. The metric
88
value is then computed using the output of the engine's `process_function`:
99

10-
.. code-block:: python
10+
.. code-block:: python
1111
12-
def process_function(engine, batch):
13-
# ...
14-
return y_pred, y
12+
def process_function(engine, batch):
13+
# ...
14+
return y_pred, y
1515
16-
engine = Engine(process_function)
17-
metric = Accuracy()
18-
metric.attach(engine, "accuracy")
16+
engine = Engine(process_function)
17+
metric = Accuracy()
18+
metric.attach(engine, "accuracy")
1919
2020
If the engine's output is not in the format `y_pred, y`, the user can
2121
use the `output_transform` argument to transform it:
2222

23+
.. code-block:: python
24+
25+
def process_function(engine, batch):
26+
# ...
27+
return {'y_pred': y_pred, 'y_true': y, ...}
28+
29+
engine = Engine(process_function)
30+
31+
def output_transform(output):
32+
# `output` variable is returned by above `process_function`
33+
y_pred = output['y_pred']
34+
y = output['y_true']
35+
return y_pred, y # output format is according to `Accuracy` docs
36+
37+
metric = Accuracy(output_transform=output_transform)
38+
metric.attach(engine, "accuracy")
39+
40+
41+
.. Note ::
42+
43+
Most of implemented metrics are adapted to distributed computations and reduce their internal states across the GPUs
44+
before computing metric value. This can be helpful to run the evaluation on multiple nodes/GPU instances with a
45+
distributed data sampler. Following code snippet shows in detail how to adapt metrics:
46+
2347
.. code-block:: python
2448
25-
def process_function(engine, batch):
26-
# ...
27-
return {'y_pred': y_pred, 'y_true': y, ...}
49+
device = "cuda:{}".format(local_rank)
50+
model = torch.nn.parallel.DistributedDataParallel(model,
51+
device_ids=[local_rank, ],
52+
output_device=local_rank)
53+
test_sampler = DistributedSampler(test_dataset)
54+
test_loader = DataLoader(test_dataset, batch_size=batch_size, sampler=test_sampler,
55+
num_workers=num_workers, pin_memory=True)
2856
29-
engine = Engine(process_function)
57+
evaluator = create_supervised_evaluator(model, metrics={'accuracy': Accuracy(device=device)}, device=device)
3058
31-
def output_transform(output):
32-
# `output` variable is returned by above `process_function`
33-
y_pred = output['y_pred']
34-
y = output['y_true']
35-
return y_pred, y # output format is according to `Accuracy` docs
3659
37-
metric = Accuracy(output_transform=output_transform)
38-
metric.attach(engine, "accuracy")
60+
Metric arithmetics
61+
------------------
3962

4063
Metrics could be combined together to form new metrics. This could be done through arithmetics, such
4164
as ``metric1 + metric2``, use PyTorch operators, such as ``(metric1 + metric2).pow(2).mean()``,
4265
or use a lambda function, such as ``MetricsLambda(lambda a, b: torch.mean(a + b), metric1, metric2)``.
4366

4467
For example:
4568

46-
.. code-block:: python
69+
.. code-block:: python
4770
48-
precision = Precision(average=False)
49-
recall = Recall(average=False)
50-
F1 = (precision * recall * 2 / (precision + recall)).mean()
71+
precision = Precision(average=False)
72+
recall = Recall(average=False)
73+
F1 = (precision * recall * 2 / (precision + recall)).mean()
5174
52-
.. note:: This example computes the mean of F1 across classes. To combine
53-
precision and recall to get F1 or other F metrics, we have to be careful
54-
that `average=False`, i.e. to use the unaveraged precision and recall,
55-
otherwise we will not be computing F-beta metrics.
75+
.. note:: This example computes the mean of F1 across classes. To combine
76+
precision and recall to get F1 or other F metrics, we have to be careful
77+
that `average=False`, i.e. to use the unaveraged precision and recall,
78+
otherwise we will not be computing F-beta metrics.
5679

5780
Metrics also support indexing operation (if metric's result is a vector/matrix/tensor). For example, this can be useful to compute mean metric (e.g. precision, recall or IoU) ignoring the background:
5881

59-
.. code-block:: python
82+
.. code-block:: python
83+
84+
cm = ConfusionMatrix(num_classes=10)
85+
iou_metric = IoU(cm)
86+
iou_no_bg_metric = iou_metric[:9] # We assume that the background index is 9
87+
mean_iou_no_bg_metric = iou_no_bg_metric.mean()
88+
# mean_iou_no_bg_metric.compute() -> tensor(0.12345)
89+
90+
How to create a custom metric
91+
-----------------------------
92+
93+
To create a custom metric one needs to create a new class inheriting from :class:`~ignite.metrics.Metric` and override
94+
three methods :
95+
96+
- `reset()` : resets internal variables and accumulators
97+
- `update(output)` : updates internal variables and accumulators with provided batch output `(y_pred, y)`
98+
- `compute()` : computes custom metric and return the result
99+
100+
For example, we would like to implement for illustration purposes a multi-class accuracy metric with some
101+
specific condition (e.g. ignore user-defined classes):
102+
103+
.. code-block:: python
104+
105+
from ignite.metrics import Metric
106+
from ignite.exceptions import NotComputableError
107+
108+
# These decorators helps with distributed settings
109+
from ignite.metrics.metric import sync_all_reduce, reinit_is_reduced
110+
111+
112+
class CustomAccuracy(Metric):
113+
114+
def __init__(self, ignored_class, output_transform=lambda x: x, device=None):
115+
self.ignored_class = ignored_class
116+
self._num_correct = None
117+
self._num_examples = None
118+
super(CustomAccuracy, self).__init__(output_transform=output_transform, device=device)
119+
120+
@reinit_is_reduced
121+
def reset(self):
122+
self._num_correct = 0
123+
self._num_examples = 0
124+
super(CustomAccuracy, self).reset()
125+
126+
@reinit_is_reduced
127+
def update(self, output):
128+
y_pred, y = output
129+
130+
indices = torch.argmax(y_pred, dim=1)
131+
132+
mask = (y != self.ignored_class)
133+
mask &= (indices != self.ignored_class)
134+
y = y[mask]
135+
indices = indices[mask]
136+
correct = torch.eq(indices, y).view(-1)
137+
138+
self._num_correct += torch.sum(correct).item()
139+
self._num_examples += correct.shape[0]
140+
141+
@sync_all_reduce("_num_examples", "_num_correct")
142+
def compute(self):
143+
if self._num_examples == 0:
144+
raise NotComputableError('CustomAccuracy must have at least one example before it can be computed.')
145+
return self._num_correct / self._num_examples
146+
147+
148+
We imported necessary classes as :class:`~ignite.metrics.Metric`, :class:`~ignite.exceptions.NotComputableError` and
149+
decorators to adapt the metric for distributed setting. In `reset` method, we reset internal variables `_num_correct`
150+
and `_num_examples` which are used to compute the custom metric. In `updated` method we define how to update
151+
the internal variables. And finally in `compute` method, we compute metric value.
152+
153+
We can check this implementation in a simple case:
154+
155+
.. code-block:: python
156+
157+
import torch
158+
torch.manual_seed(8)
159+
160+
m = CustomAccuracy(ignored_class=3)
161+
162+
batch_size = 4
163+
num_classes = 5
164+
165+
y_pred = torch.rand(batch_size, num_classes)
166+
y = torch.randint(0, num_classes, size=(batch_size, ))
167+
168+
m.update((y_pred, y))
169+
res = m.compute()
170+
171+
print(y, torch.argmax(y_pred, dim=1))
172+
# Out: tensor([2, 2, 2, 3]) tensor([2, 1, 0, 0])
173+
174+
print(m._num_correct, m._num_examples, res)
175+
# Out: 1 3 0.3333333333333333
176+
177+
178+
Metrics and distributed computations
179+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60180

61-
cm = ConfusionMatrix(num_classes=10)
62-
iou_metric = IoU(cm)
63-
iou_no_bg_metric = iou_metric[:9] # We assume that the background index is 9
64-
mean_iou_no_bg_metric = iou_no_bg_metric.mean()
65-
# mean_iou_no_bg_metric.compute() -> tensor(0.12345)
181+
In the above example, `CustomAccuracy` constructor has `device` argument and `reset`, `update`, `compute` methods are decorated with `reinit_is_reduced`, `sync_all_reduce`. The purpose of these features is to adapt metrics in distributed computations on CUDA devices and assuming the backend to support `"all_reduce" operation <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`_. User can specify the device (by default, `cuda`) at metric's initialization. This device _can_ be used to store internal variables on and to collect all results from all participating devices. More precisely, in the above example we added `@sync_all_reduce("_num_examples", "_num_correct")` over `compute` method. This means that when `compute` method is called, metric's interal variables `self._num_examples` and `self._num_correct` are summed up over all participating devices. Therefore, once collected, these internal variables can be used to compute the final metric value.
66182

67183

68-
Complete list of metrics:
184+
Complete list of metrics
185+
------------------------
69186

70187
- :class:`~ignite.metrics.Accuracy`
71188
- :class:`~ignite.metrics.Average`
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
output
2+
cifar10

0 commit comments

Comments
 (0)