-
Notifications
You must be signed in to change notification settings - Fork 882
doc(katib): update push-based metrics collector. #3844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,16 +6,23 @@ weight = 40 | |
|
|
||
| This guide describes how Katib metrics collector works. | ||
|
|
||
| ## Metrics Collector | ||
| ## Overview | ||
|
|
||
| There are two ways to collect metrics: | ||
|
|
||
| 1. Pull-based: collects the metrics using a _sidecar_ container. A sidecar is a utility container that supports | ||
| the main container in the Kubernetes Pod. | ||
|
|
||
| 2. Push-based: users push the metrics directly to Katib DB in the training scripts. | ||
|
|
||
| In the `metricsCollectorSpec` section of the Experiment YAML configuration file, you can | ||
| define how Katib should collect the metrics from each Trial, such as the accuracy and loss metrics. | ||
|
|
||
| Your training code can record the metrics into `StdOut` or into arbitrary output files. Katib | ||
| collects the metrics using a _sidecar_ container. A sidecar is a utility container that supports | ||
| the main container in the Kubernetes Pod. | ||
| ## Pull-based Metrics Collector | ||
|
|
||
| To define the metrics collector for your Experiment: | ||
| Your training code can record the metrics into `StdOut` or into arbitrary output files. | ||
|
|
||
| To define the pull-based metrics collector for your Experiment: | ||
|
|
||
| 1. Specify the collector type in the `.collector.kind` field. | ||
| Katib's metrics collector supports the following collector types: | ||
|
|
@@ -29,7 +36,7 @@ To define the metrics collector for your Experiment: | |
| metrics must be line-separated by `epoch` or `step` as follows, and the key for timestamp must | ||
| be `timestamp`: | ||
|
|
||
| ``` | ||
| ```json | ||
| {"epoch": 0, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:51"} | ||
| {"epoch": 1, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:52"} | ||
| {"epoch": 2, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:53"} | ||
|
|
@@ -51,9 +58,6 @@ To define the metrics collector for your Experiment: | |
| in the `.collector.customCollector` field. Check the | ||
| [custom metrics collector example](https://github.com/kubeflow/katib/blob/ea46a7f2b73b2d316b6b7619f99eb440ede1909b/examples/v1beta1/metrics-collector/custom-metrics-collector.yaml#L14-L36). | ||
|
|
||
| - `None`: Specify this value if you don't need to use Katib's metrics collector. For example, | ||
| your training code may handle the persistent storage of its own metrics. | ||
|
|
||
| 2. Write code in your training container to print or save to the file metrics in the format | ||
| specified in the `.source.filter.metricsFormat` field. The default metrics format value is: | ||
|
|
||
|
|
@@ -79,3 +83,46 @@ To define the metrics collector for your Experiment: | |
| recall=0.55 | ||
| precision=.5 | ||
| ``` | ||
|
|
||
| ## Push-based Metrics Collector | ||
|
|
||
| Your training code needs to call [`report_metrics()`](https://github.com/kubeflow/katib/blob/e251a07cb9491e2d892db306d925dddf51cb0930/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L26) function in Python SDK to record metrics. | ||
| The `report_metrics()` function works by parsing the metrics in `metrics` field into a gRPC request, automatically adding the current timestamp for users, and sending the request to Katib DB Manager. | ||
|
|
||
| But before that, `kubeflow-katib` package should be installed in your training container. | ||
|
|
||
| To define the push-based metrics collector for your Experiment, you have two options: | ||
|
|
||
| - YAML File | ||
|
|
||
| 1. Specify the collector type `Push` in the `.collector.kind` field. | ||
|
|
||
| 2. Write code in your training container to call `report_metrics()` to report metrics. | ||
|
|
||
| - [`tune`](https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L166) function | ||
|
|
||
| Use tune function and specify the `metrics_collector_config` field. You can reference to the following example: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might need to explain how
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM. I'll add the explanation to |
||
|
|
||
| ``` | ||
| import kubeflow.katib as katib | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's use simple example here and remove all unnecessary parameters and function calls.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich Sorry, I may not fully understand what you mean "unnecessary parameters and function calls". I copied the example in the get-started chapter and replaced
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, we should just show example how user can use import kubeflow.katib as katib
def objective(parameters):
import time
import kubeflow.katib as katib
time.sleep(5)
result = 4 * int(parameters["a"])
# Push metrics to Katib DB.
katib.report_metrics({"result": result})
katib.KatibClient().tune(
name="push-metrics-exp",
objective=objective,
parameters= {"a": katib.search.int(min=10, max=20)}
objective_metric_name="result",
max_trial_count=2,
metrics_collector_config={"kind": "Push"},
# When SDK is released, replace it with packages_to_install=["kubeflow-katib==0.18.0"]
packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
)That should allow user to focus on important changes they need to make to try this out.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich Thanks for your clarification! I'll update the blog. |
||
|
|
||
| def objective(parameters): | ||
| import time | ||
| import kubeflow.katib as katib | ||
| time.sleep(5) | ||
| result = 4 * int(parameters["a"]) | ||
| # Push metrics to Katib DB. | ||
| katib.report_metrics({"result": result}) | ||
|
|
||
| katib.KatibClient(namespace="kubeflow").tune( | ||
| name="push-metrics-exp", | ||
| objective=objective, | ||
| parameters= {"a": katib.search.int(min=10, max=20)} | ||
| objective_metric_name="result", | ||
| max_trial_count=2, | ||
| metrics_collector_config={"kind": "Push"}, | ||
| # When SDK is released, replace it with packages_to_install=["kubeflow-katib==0.18.0"]. | ||
| # Currently, the training container should have `git` package to install this SDK. | ||
| packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"], | ||
| ) | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.