Initial version of defining the interfaces to accept metrics#15913
Initial version of defining the interfaces to accept metrics#15913wangxin merged 12 commits intosonic-net:masterfrom
Conversation
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
| # Metrics data are organized into the hierarchies below | ||
| # ResourceMetrics | ||
| # ├── ResourceID | ||
| # └── ScopeMetrics |
There was a problem hiding this comment.
I don't think we need the level of ScopeMetrics.
There was a problem hiding this comment.
My thought is
Resource level: all metrics from one test run
Scope level: all metrics belonging to one device
Metric level: all metrics belonging to one category
I might be wrong. Let's discuss this topic tomorrow.
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
|
||
| ############################## Report Metrics ############################## | ||
|
|
||
| class MetricReporterFactory: |
There was a problem hiding this comment.
move factory to another file, so we can override easily.
There was a problem hiding this comment.
with this change, we can do this in another file:
class MetricReporterFactory:
def create_metrics_reporter(self):
return OtelMetricReporter(...)
class OtelMetricReporter:
def emit(....):
# Real implementation goes here, which each customer can define their own.| # ├── TestID | ||
| # └── DeviceMetrics | ||
| # ├── DeviceID | ||
| # └── Metric |
There was a problem hiding this comment.
create a generic Metric class that represents a single metric, which contains:
- description/labels: Name, Description, unit, ....
- Value: single layer is good enough with inheritance.
- Reporter: Reference to MetricsReporter. Register itself to Reporter when created, so Reporter can gather all metrics after everything is changed.
class Metric...:
def __init__(name, ...., reporter):
reporter.add_metric(self)
....
class GaugeMetric(Metric):
def __init__(name, ...., reporter):
super.__init__(...)
self.value = 0
def set(v):
self.value = v
....
reporter = MetricReporterFactory(...).build()
port_rx = GaugeMetric(...., reporter)
port_rx.set(123)
reporter.report(time)There was a problem hiding this comment.
Hence, ultimately the final code for people to use would be:
metrics = {
"PortRx" = GaugeMetric(......, reporter)
....
}
for r in csv:
for c in r:
metric[c.title].set(c.value)
reporter.report(time)| # software version. They are also from the same test case identified by test_run_id. | ||
| class TestMetrics: | ||
| def __init__(self, testbed_name, os_version, testcase_name, test_run_id): | ||
| self.testbed_name = testbed_name |
There was a problem hiding this comment.
all these fields can be moved to reporter, since it is shared by everyone.
There was a problem hiding this comment.
TestMetrics itself can be removed, once we add the per metric class.
| # software version. They are also from the same test case identified by test_run_id. | ||
| class TestMetrics: | ||
| def __init__(self, testbed_name, os_version, testcase_name, test_run_id): | ||
| self.testbed_name = testbed_name |
There was a problem hiding this comment.
TestMetrics itself can be removed, once we add the per metric class.
|
|
||
| ############################## Report Metrics ############################## | ||
|
|
||
| class MetricReporterFactory: |
There was a problem hiding this comment.
with this change, we can do this in another file:
class MetricReporterFactory:
def create_metrics_reporter(self):
return OtelMetricReporter(...)
class OtelMetricReporter:
def emit(....):
# Real implementation goes here, which each customer can define their own.| @@ -0,0 +1,147 @@ | |||
| # This file defines the interfaces that snappi tests accept external metrics. | |||
There was a problem hiding this comment.
All common label names are missing too, e.g.: PortId, QueueId, PSUId....
otherwise it will be very hard to create unified dashboard, because each tests could use its own names, and causing problems in filters.
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
| name, | ||
| description, | ||
| unit, | ||
| timestamp, |
There was a problem hiding this comment.
The following fields is common for entire tests, so it can be move into the reporter as common metadata:
- testbed_name
- os_version
- testcase_name
- test_run_id
The following fields are common for all metrics in a single report action, so it can be lifted into the reporter's report function parameters:
- timestamp
The following fields are not clear on its purpose, we need to rename it to make it clear:
- component_id
There was a problem hiding this comment.
maybe the timestamp here means the test_start_time?
|
|
||
| class Metric: | ||
| def __init__(self, | ||
| name, |
| testcase_name, test_run_id, device_id, component_id, reporter, metadata, metrics) | ||
|
|
||
| # Additional fields for GaugeMetric | ||
| self.metrics = metrics or {} |
There was a problem hiding this comment.
each Metric should only represent a single metric. If we are trying to create something that holds all metrics, it should be 1 layer above, say MetricCollections / MetricList / Metrics or whatever.
There was a problem hiding this comment.
the purpose of this field is not too clear...
| @@ -0,0 +1,103 @@ | |||
| # This file defines the interfaces that snappi tests accept external metrics. | |||
| import logging | |||
There was a problem hiding this comment.
The file it not part of intf_utils, because it is not related to interface.
| # Temporary code to report metrics | ||
| print(f"Reporting metrics at {timestamp}") | ||
| for metric in self.metrics: | ||
| print(metric) |
There was a problem hiding this comment.
it will be great to create a new abstracted function for us to override.
| self.reporter = OtelMetricReporter(self.connection) | ||
| return self.reporter | ||
|
|
||
| class OtelMetricReporter: |
There was a problem hiding this comment.
The reporter should not be limited to Otel.
| name (str): metric name (e.g., psu power, sensor temperature, port stats, etc.) | ||
| description (str): brief description of the metric | ||
| unit (str): metric unit (e.g., seconds, bytes) | ||
| timestamp (int): UNIX Epoch time in nanoseconds when the metric is collected |
There was a problem hiding this comment.
if the timestamp is for logging the collection time, the reporter already has it and can be removed
| unit (str): metric unit (e.g., seconds, bytes) | ||
| timestamp (int): UNIX Epoch time in nanoseconds when the metric is collected | ||
| device_id (str): switch device ID | ||
| component_id (str): ID of the component (e.g., psu, sensor, port, etc.), where metrics are produced |
There was a problem hiding this comment.
this can be ignored, since the components are included in the name and we won't use it for filtering too.
There was a problem hiding this comment.
Please check out my email
| description (str): brief description of the metric | ||
| unit (str): metric unit (e.g., seconds, bytes) | ||
| timestamp (int): UNIX Epoch time in nanoseconds when the metric is collected | ||
| device_id (str): switch device ID |
There was a problem hiding this comment.
this can be lifted up to reporter, since it is common to all
| self.reporter = OtelMetricReporter(self.connection) | ||
| return self.reporter | ||
|
|
||
| class OtelMetricReporter: |
| pass | ||
|
|
||
|
|
||
| class KustoReporter: |
There was a problem hiding this comment.
let's not limit the implementation to kusto
| @@ -0,0 +1,89 @@ | |||
| # This file defines the interfaces that snappi tests accept external metrics. | |||
There was a problem hiding this comment.
the definitions of the metric names and meta are missing in the file, we need to get them defined and show a unified format. this will be used for crafting the dashboards.
| Returns: | ||
| An instance of the specified metrics reporter. | ||
| """ | ||
| if data_type == "metrics": |
There was a problem hiding this comment.
will be better to split this into 2 functions instead of using magic string.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| "psu.id": "psu1", | ||
| "model": "PWR-2422-HV-RED", | ||
| "serial": "6A011010142349Q"} | ||
|
|
| description = "PSU power reading", | ||
| unit = "W", | ||
| reporter = reporter) | ||
| power.set_gauge_metric(scope_labels, 222.00) |
There was a problem hiding this comment.
# Create a metric and pass it to the reporter
vol = GaugeMetric(name = "Voltage",
description = "PSU voltage reading",
unit = "V",
reporter = reporter)
# Create a metric and pass it to the reporter
cur = GaugeMetric(name = "Current",
description = "PSU current reading",
unit = "A",
reporter = reporter)
# Create a metric and pass it to the reporter
power = GaugeMetric(name = "Power",
description = "PSU power reading",
unit = "W",
reporter = reporter)
scope_labels["psu.id"] = "PSU 1"
vol.set_gauge_metric(scope_labels, 12.09)
cur.set_gauge_metric(scope_labels, 18.38)
power.set_gauge_metric(scope_labels, 222.00)
scope_labels["psu.id"] = "PSU 2"
vol.set_gauge_metric(scope_labels, 12.10)
cur.set_gauge_metric(scope_labels, 17.72)
power.set_gauge_metric(scope_labels, 214.00)| name: str, | ||
| description: str, | ||
| unit: str, | ||
| reporter: MetricReporterFactory): |
| return (f"Metric(name={self.name!r}, " | ||
| f"description={self.description!r}, " | ||
| f"unit={self.unit!r}, " | ||
| f"reporter={self.reporter!r})") |
There was a problem hiding this comment.
reporter might not be converted to string.
| # Initialize the base class | ||
| super().__init__(name, description, unit, reporter) | ||
|
|
||
| def set_gauge_metric(self, scope_labels: Dict[str, str], value: Union[int, str, float]): |
There was a problem hiding this comment.
rename function to record, we need to support multiple metrics.
|
|
||
| class MetricReporterFactory: | ||
| def __init__(self): | ||
| self.reporter = None |
| reporter = factory.create_metrics_reporter(resource_labels) | ||
|
|
||
| scope_labels = { | ||
| "device.id": "str-7060x6-64pe-stress-02", |
There was a problem hiding this comment.
label name needs to be standarized for our test cases. otherwise, there is no way to build standard dashboards.
| # Temporary code initializing a RecordsReporter | ||
| # will be replaced with a real initializer such as Kusto | ||
| self.resource_labels = resource_labels | ||
| self.timestamp = int(time.time() * 1_000_000_000) # epoch time in nanoseconds |
There was a problem hiding this comment.
timestamp should not be here.
There was a problem hiding this comment.
it should be report function parameter.
| self.resource_labels = resource_labels | ||
| self.timestamp = int(time.time() * 1_000_000_000) # epoch time in nanoseconds | ||
| self.records = [] | ||
|
|
There was a problem hiding this comment.
need function to push records into the self.records list.
| Abstract method to report records at a given timestamp. | ||
| Subclasses must override this method. | ||
| """ | ||
| pass |
There was a problem hiding this comment.
report function is usually written in this way:
def report(self):
incoming_records = self.records
self.records = []
self.process_incoming_records(incoming_records)| @@ -0,0 +1,64 @@ | |||
| """ | |||
There was a problem hiding this comment.
rename this file to metrics.py
There was a problem hiding this comment.
the point is to consider the usage:
from metrics_utils.metrics_accepter import Metric, GaugeMetric # The code looks weird herefrom utils.metrics import GaugeMetric # This looks more nature
from metrics_utils.metrics import GaugeMetric # This works too.|
|
||
| #from metrics_accepter import Metric, GaugeMetric | ||
|
|
||
| class MetricReporterFactory: |
There was a problem hiding this comment.
move factory to a dedicated file. for reporter, we can leave in this file or move to metrics.py, no strong opinion in that.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| @@ -0,0 +1,13 @@ | |||
| { | |||
| "allowed_labels": [ | |||
| "testbed.id", | |||
There was a problem hiding this comment.
Let's make them constants in code.
There was a problem hiding this comment.
Do you mean changing it to this in metrics.py?
Only these labels are allowed
ALLOWED_LABELS = {
"testbed.id",
"os.version",
"testrun.id",
"testcase",
"device.id",
"psu.id",
"port.id",
"sensor.id",
"queue.id",
}
class MetricsReporter:
def __init__(self, resource_labels: Dict[str, str]):
for label in resource_labels:
if label not in ALLOWED_LABELS:
raise LabelError(f"Invalid label: {label}.")
# Temporary code initializing a MetricsReporter
# will be replaced with a real initializer such as OpenTelemetry
self.resource_labels = resource_labels
self.metrics = []
tests/snappi_tests/utils/examples.py
Outdated
|
|
||
| """ | ||
| resource_labels = { | ||
| "testbed.id": "sonic_stress_testbed", |
There was a problem hiding this comment.
Label keys should be constants instead of using literals.
There was a problem hiding this comment.
2 approaches:
-
Add
ALLOWED_LABELS = {
"testbed.id",
"os.version",
"testrun.id",
"testcase",
"device.id",
"psu.id",
"port.id",
"sensor.id",
"queue.id",
}
in metrics.py and keep this place unchanged. -
Add
ALLOWED_LABELS = {
"TESTBED_ID": "testbed.id",
"OS_VERSION": "os.version",
"TESTCASE": "testcase",
"TESTRUN_ID": "testrun.id",
... ...
}
in metrics.py and change this place to
resource_labels = {
ALLOWED_LABELS["TESTBED_ID"]: "sonic_stress_testbed",
ALLOWED_LABELS["OS_VERSION"]: "11.2.3",
ALLOWED_LABELS["TESTCASE"]: "stress_test1",
ALLOWED_LABELS["TESTRUN_ID"]: "202412101217"
}
Which way do you prefer?
There was a problem hiding this comment.
take os version as an example:
from typing import Final
METRIC_LABEL_TEST_TESTBED: Final[str] = "test.testbed"
METRIC_LABEL_TEST_BRANCH: Final[str] = "test.branch"
METRIC_LABEL_TEST_CASE: Final[str] = "test.testcase"
METRIC_LABEL_TEST_FILE: Final[str] = "test.test_file"
...
METRIC_LABEL_DEVICE_ID: Final[str] = "device.id"
METRIC_LABEL_DEVICE_PORT_ID: Final[str] = "device.port.id"
METRIC_LABEL_DEVICE_QUEUE_ID: Final[str] = "device.queue.id"
METRIC_LABEL_DEVICE_PSU_ID: Final[str] = "device.psu.id"
...
resource_labels = {
METRIC_LABEL_TEST_TESTBED: "abc",
METRIC_LABEL_TEST_BRANCH: "202411",
METRIC_LABEL_TEST_CASE: "mock-case",
METRIC_LABEL_TEST_FILE: "mock-test.py",
...
}
...
scope_labels[METRIC_LABEL_DEVICE_PSU_ID] = "PSU 1"
voltage.record(scope_labels, 12.09)There was a problem hiding this comment.
please make sure to check the design doc I shared with you for adding the required labels.
tests/snappi_tests/utils/metrics.py
Outdated
| """ | ||
|
|
||
|
|
||
| class TestResultsReporter: |
There was a problem hiding this comment.
This is not test result, which usually refers to pass/fail sort of things
There was a problem hiding this comment.
What do we want to name it then? How about TestStatus?
tests/snappi_tests/utils/metrics.py
Outdated
| stashed_test_results = self.test_results | ||
| self.test_results = [] | ||
|
|
||
| """ |
There was a problem hiding this comment.
Are these removed accidentally and forgot to put back?
There was a problem hiding this comment.
I don't quite understand you. In the commented code
"""
print(f"Current time (ns): {current_time}")
pprint(self.resource_labels)
pprint(stashed_metrics)
process_stashed_metrics(current_time, stashed_metrics)
"""
The first 3 lines are for my own testing purpose only. process_stashed_metrics() will later be replaced with real code to emit the metrics to InfluxDB.
There was a problem hiding this comment.
there is no way in language level to override the commented code, in here we need to provide a "virtual function" for the subclass to implement.
tests/snappi_tests/utils/metrics.py
Outdated
| if timestamp is not None: | ||
| current_time = timestamp | ||
| else: | ||
| current_time = time.time_ns() |
There was a problem hiding this comment.
Can this be moved to parameter?
There was a problem hiding this comment.
Is this what you meant?
current_time = timestamp or time.time_ns()
There was a problem hiding this comment.
have you tried this?
def report(self, timestamp=time.time_ns()):
tests/snappi_tests/utils/metrics.py
Outdated
| self.resource_labels = resource_labels | ||
| self.test_results = [] | ||
|
|
||
| def stash_test_results(self, labels: Dict[str, str], value: Union[int, str, float]): |
tests/snappi_tests/utils/metrics.py
Outdated
| self.resource_labels = resource_labels | ||
| self.metrics = [] | ||
|
|
||
| def stash_metric(self, new_metric: 'GaugeMetric', labels: Dict[str, str], value: Union[int, str, float]): |
There was a problem hiding this comment.
second parameter type is better to be the base class
tests/snappi_tests/utils/metrics.py
Outdated
|
|
||
| def stash_metric(self, new_metric: 'GaugeMetric', labels: Dict[str, str], value: Union[int, str, float]): | ||
| # add a new metric | ||
| self.metrics.append({"labels": labels, "value": value}) |
There was a problem hiding this comment.
labels will need to be deep copied
There was a problem hiding this comment.
Change it to
# Deep copy the labels to ensure stored data is immutable
copied_labels = deepcopy(labels)
# Add the new metric
self.metrics.append({"labels": copied_labels, "value": value})
Do I understand you correctly?
sm-xu
left a comment
There was a problem hiding this comment.
Please review. Thanks!
tests/snappi_tests/utils/metrics.py
Outdated
| stashed_test_results = self.test_results | ||
| self.test_results = [] | ||
|
|
||
| """ |
There was a problem hiding this comment.
there is no way in language level to override the commented code, in here we need to provide a "virtual function" for the subclass to implement.
| @@ -0,0 +1,20 @@ | |||
|
|
|||
There was a problem hiding this comment.
nit: remove empty line.
I wonder why pre-commit didn't fail for this.... CI does failed due to static analysis. might be better to check that.
tests/snappi_tests/utils/metrics.py
Outdated
| from typing import List, Dict, Union | ||
|
|
||
| # Function to load allowed labels from a JSON file | ||
| def load_allowed_labels(filename="allowed_labels.json"): |
There was a problem hiding this comment.
this could be removed once moved to constants.
tests/snappi_tests/utils/examples.py
Outdated
|
|
||
| """ | ||
| resource_labels = { | ||
| "testbed.id": "sonic_stress_testbed", |
There was a problem hiding this comment.
please make sure to check the design doc I shared with you for adding the required labels.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
As discussed, please make the design doc update in the README file for this folder in a separate PR. |
| def __init__(self): | ||
| return | ||
|
|
||
| def create_periodic_metrics_reporter(common_labels: Dict[str, str]): |
There was a problem hiding this comment.
Should it be @staticmethod?
| def create_periodic_metrics_reporter(common_labels: Dict[str, str]): | ||
| return (PeriodicMetricsReporter(common_labels)) | ||
|
|
||
| def create_final_metrics_reporter(common_labels: Dict[str, str]): |
There was a problem hiding this comment.
Should it be @staticmethod?
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Cherry-pick PR to msft-202412: Azure/sonic-mgmt.msft#45 |
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation