Skip to content

[202411][Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64#22224

Closed
ayurkiv-nvda wants to merge 1 commit intosonic-net:202411from
ayurkiv-nvda:master_upstream_bulk_counter_v64
Closed

[202411][Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64#22224
ayurkiv-nvda wants to merge 1 commit intosonic-net:202411from
ayurkiv-nvda:master_upstream_bulk_counter_v64

Conversation

@ayurkiv-nvda
Copy link
Contributor

Why I did it

Support new field "supporting_bulk_counter_groups" in DEVICE_METADATA|localhost for Mellanox-SN4700. This change is required to optimize and speed up counters initialization.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Support new field "supporting_bulk_counter_groups" in DEVICE_METADATA|localhost for Mellanox-SN4280-O28. This change is required to optimize and speed up counters initialization.

How to verify it

run config reload and check initialization sequence

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ayurkiv-nvda ayurkiv-nvda changed the title [Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64 [202411_RC][Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64 Apr 3, 2025
@ayurkiv-nvda ayurkiv-nvda changed the title [202411_RC][Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64 [202411][Mellanox] Support new field supporting_bulk_counter_groups in DEVICE__METADATA|localhost for Mellanox-SN4700-V64 Apr 3, 2025
@ayurkiv-nvda ayurkiv-nvda changed the base branch from master to 202411 April 3, 2025 12:39
…_METADATA|localhost for Mellanox-SN4700-V64

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
@ayurkiv-nvda ayurkiv-nvda force-pushed the master_upstream_bulk_counter_v64 branch from cdc56cb to d55a4b1 Compare April 3, 2025 12:47
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bingwang-ms
Copy link
Contributor

bingwang-ms commented Apr 12, 2025

@qiluo-msft @SuvarnaMeenakshi Could you help review? I remember earlier the create_only_config_db_buffers change caused some problem in snmp queue counters (16 queue counters are hardcoded).

@stephenxs
Copy link
Collaborator

@qiluo-msft @SuvarnaMeenakshi Could you help review? I remember earlier the create_only_config_db_buffers change caused some problem in snmp queue counters (16 queue counters are hardcoded).

@bingwang-ms Could you please remind me what the issue was? If it was that the queue 7 counters were not polled, it should be fixed by the following PR sonic-net/sonic-swss#3334

Thanks

@qiluo-msft
Copy link
Collaborator

@bingwang-ms @kperumalbfn Seems like the purpose is to change 202411 Mellanox-SN4700-V64 golden config. Changing image default behavior is not encourage, instead we should change our golden config genration logic.

@bingwang-ms
Copy link
Contributor

bingwang-ms commented Apr 16, 2025

@qiluo-msft @SuvarnaMeenakshi Could you help review? I remember earlier the create_only_config_db_buffers change caused some problem in snmp queue counters (16 queue counters are hardcoded).

@bingwang-ms Could you please remind me what the issue was? If it was that the queue 7 counters were not polled, it should be fixed by the following PR sonic-net/sonic-swss#3334

Thanks

Hi @stephenxs
I'm talking about an issue in SNMP code. Earlier there was some hardcoded logic in SNMP to always read 16 queues (8 UC + 8 MC). If we enabled create_only_config_db_buffers, then the number of queues is changed since we don't have MC queues. Not sure if that issue has been fixed. @SuvarnaMeenakshi

Added the link to the issue #17448

@bingwang-ms
Copy link
Contributor

@bingwang-ms @kperumalbfn Seems like the purpose is to change 202411 Mellanox-SN4700-V64 golden config. Changing image default behavior is not encourage, instead we should change our golden config genration logic.

Hi @ayurkiv-nvda, can you please reply to the comment from Qi?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior if only supporting_bulk_counter_groups is present while create_only_config_db_buffers.json is missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bingwang-ms
In that case the bulk won't work correctly because

  1. We only use configured queues. All those queues support bulk counter polling
  2. Without create_only_config_db_buffers, flex counter will bulk-poll all queue counters, some of which do not support bulk polling.
  3. It can fail if we try to bulk-poll objects with mixed supporting and unsupporting objects. There is a logic to dynamically determine whether bulk is supported on an object and remove the objects that do not support bulk poll during initialization. This guarantees that all objects polled in bulk mode at run time support bulk polling, like the scenario where create_only_config_db_buffers is not provided.
  4. However, by providing supporting_bulk_counter_groups, the logic in 3 is disabled. There will be runtime error.

In master, create_only_config_db_buffers is not a must but still recommended because it is faster

@ayurkiv-nvda
Copy link
Contributor Author

Seems like the purpose is to change 202411 Mellanox-SN4700-V64 golden config. Changing image default behavior is not encourage, instead we should change our golden config genration logic.

Hello @qiluo-msft @bingwang-ms
Can you please provide more details?
As far as I know, we can put /etc/sonic/golden_config_db.json to override configDB after load minigraph.
But I am afraid I am not familiar with golden config generation logic. Is that something MSFT internal?

@bingwang-ms
Copy link
Contributor

Seems like the purpose is to change 202411 Mellanox-SN4700-V64 golden config. Changing image default behavior is not encourage, instead we should change our golden config genration logic.

Hello @qiluo-msft @bingwang-ms Can you please provide more details? As far as I know, we can put /etc/sonic/golden_config_db.json to override configDB after load minigraph. But I am afraid I am not familiar with golden config generation logic. Is that something MSFT internal?

Hi @ayurkiv-nvda , it's not expected to update config_db dynamically as config_db is the ground truth of configuration. If a feature needs to be enabled/disabled by config_db, then the suggestion is to let user make the decision.
Coming back to this change, I think we can only keep the change in supporting_bulk_counter_groups and remove the create_only_config_db_buffers.json change. The feature can be enabled by injecting configuration by user. Please let me know how do you think.
cc @qiluo-msft

@ayurkiv-nvda
Copy link
Contributor Author

Hello @bingwang-ms. Thank you for response!

Coming back to this change, I think we can only keep the change in supporting_bulk_counter_groups and remove the create_only_config_db_buffers.json change. The feature can be enabled by injecting configuration by user. Please let me know how do you think.

Just want to make sure that we are aligned
@stephenxs recently added details regarding #22224 (comment) . There are might be problems in this scenario. I think in this case, the user has more responsibility to properly setup config

Thanks

@bingwang-ms
Copy link
Contributor

Just to make sure I fully understand the problem.
If we only have supporting_bulk_counter_groups without create_only_config_db_buffers.json, my assumption is the feature is disabled but no function impact (expect for the slowness). Is that true? As @stephenxs said there is runtime error. That's a little concerning.

@stephenxs
Copy link
Collaborator

Just to make sure I fully understand the problem.

If we only have supporting_bulk_counter_groups without create_only_config_db_buffers.json, my assumption is the feature is disabled but no function impact (expect for the slowness). Is that true? As @stephenxs said there is runtime error. That's a little concerning.

Hi @bingwang-ms
This is the limitations of current solution. We do not check bulk capabilities during initiation so we need to guarantee all counters support bulk. So we need to have create_only_config_db_buffers together
We also have a formal solution which is only in 202412 and above. create_only_config_db_buffers is no longer mandatory but is still recommended

@bingwang-ms
Copy link
Contributor

Just to make sure I fully understand the problem.
If we only have supporting_bulk_counter_groups without create_only_config_db_buffers.json, my assumption is the feature is disabled but no function impact (expect for the slowness). Is that true? As @stephenxs said there is runtime error. That's a little concerning.

Hi @bingwang-ms This is the limitations of current solution. We do not check bulk capabilities during initiation so we need to guarantee all counters support bulk. So we need to have create_only_config_db_buffers together We also have a formal solution which is only in 202412 and above. create_only_config_db_buffers is no longer mandatory but is still recommended

Hi @stephenxs, what will happen if create_only_config_db_buffers is not present while supporting_bulk_counter_groups is present? It's concerning as we can't guarantee that create_only_config_db_buffers is always consumed by orchanget before supporting_bulk_counter_groups.

@stephenxs
Copy link
Collaborator

Just to make sure I fully understand the problem.
If we only have supporting_bulk_counter_groups without create_only_config_db_buffers.json, my assumption is the feature is disabled but no function impact (expect for the slowness). Is that true? As @stephenxs said there is runtime error. That's a little concerning.

Hi @bingwang-ms This is the limitations of current solution. We do not check bulk capabilities during initiation so we need to guarantee all counters support bulk. So we need to have create_only_config_db_buffers together We also have a formal solution which is only in 202412 and above. create_only_config_db_buffers is no longer mandatory but is still recommended

Hi @stephenxs, what will happen if create_only_config_db_buffers is not present while supporting_bulk_counter_groups is present? It's concerning as we can't guarantee that create_only_config_db_buffers is always consumed by orchanget before supporting_bulk_counter_groups.

@bingwang-ms
The flow guarantees it.

  1. OA consumes create_only_config_db_buffers and then initializes the corresponding counters.
  2. At the same time, SAIredis handles supporting_bulk_counter_groups, storing this information in its internal data structure.
  3. SAIredis receives "initialize counter" operations from OA and starts to handle them.
    In this stage, SAIredis needs the information stored in step 2.

It does not guarantee the order between 1 and 2.
But it guarantees the order between 1 and 3, and 2 and 3.

@bingwang-ms
Copy link
Contributor

bingwang-ms commented May 6, 2025

Discussed offline

  1. In 202411, both create_only_config_db_buffers and supporting_bulk_counter_groups are required. If only supporting_bulk_counter_groups is present, the query counter for MC queue will result in exception as bulk query is not support for MC queue.
  2. In master and above, only create_only_config_db_buffers is required. supporting_bulk_counter_groups is no longer required. But setting it will not result in exception.
  3. Base on 1 and 2, the decision is MSFT to generate configuration to enable create_only_config_db_buffers and supporting_bulk_counter_groups for 202405 and above.

I will check if there is a way in test to enable both create_only_config_db_buffers and supporting_bulk_counter_groups.

cc @qiluo-msft @stephenxs @lolyu

@ayurkiv-nvda
Copy link
Contributor Author

ayurkiv-nvda commented Jun 3, 2025

Close this PR as it is not needed for MSFT for 202411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants