[Mellanox] Enabling Per-Queue, Per-Priority, Per-Direction QoS Buffer/Scheduler/WRED Configurations#9
[Mellanox] Enabling Per-Queue, Per-Priority, Per-Direction QoS Buffer/Scheduler/WRED Configurations#9
Conversation
60db428 to
bc25b1d
Compare
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/buffers_defaults_objects.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I don't think we should change this part.
The logic should be in per SKU templates. Otherwise, all other vendors with system port will be broken.
Let's move it to per SKU templates.
There was a problem hiding this comment.
OK, will keep egress_lossy_profile, and make egress_lossy_profile_uplink per sku.
There was a problem hiding this comment.
This should be in per SKU templates as well.
files/build_templates/qos_config.j2
Outdated
There was a problem hiding this comment.
We should not test platform/SKU name in the common code.
I suggest defining such a flag in a per-SKU template file like trimming.j2 and then including it in per-SKU qos and buffer templates.
There was a problem hiding this comment.
OK, agree, that is good way to differentiate.
There was a problem hiding this comment.
Done, added trimming_moose.j2 and trimming_bison.j2, and make symbolic links with trimming.j2
files/build_templates/qos_config.j2
Outdated
There was a problem hiding this comment.
Why do you change common files?
We already have per SKU WRED profile defined in per SKU QoS mappings.
There was a problem hiding this comment.
OK, will keep common file as original, to not impact other vendors.
files/build_templates/qos_config.j2
Outdated
There was a problem hiding this comment.
need to check backward compatibility. I'm not sure whether other vendors ahve been using it.
There was a problem hiding this comment.
Yes, other vendors like Arista is using it. Maybe I can update naming to generate_direction_based_queue_per_sku()
There was a problem hiding this comment.
Done, add new macro generate_direction_based_queue_per_sku(port, 'uplink') instead.
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/buffers_defaults_objects.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
backward compatibility?
There was a problem hiding this comment.
Done, use generate_direction_based_queue_per_sku(port, 'uplink') instead.
02a3727 to
3d48587
Compare
There was a problem hiding this comment.
we do not need it for generic SKU. MSFT SKU only.
There was a problem hiding this comment.
No need for a generic SKU. MSFT SKU only.
There was a problem hiding this comment.
Usually, we define such files inside one SKU folder and make symbolic links in the rest SKU folders.
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
let's use dynamic 0 as HBA is not disabled
There was a problem hiding this comment.
let's use dynamic 0 as HBA is not disabled
There was a problem hiding this comment.
let's use dynamic 7 as HBA is not disabled
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Don't touch irrelevant code
f353b44 to
da3439e
Compare
jianyuewu
left a comment
There was a problem hiding this comment.
Local tested with config qos reload, seems it is good, no impact to swss.
cef6906 to
8ee4fe8
Compare
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/buffers_defaults_objects.j2
Outdated
Show resolved
Hide resolved
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/buffers_defaults_objects.j2
Outdated
Show resolved
Hide resolved
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
The issue isn't fixed.
Please check it.
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
You removed rest fields in PORT_QOS_MAP. Everything will be broken.
You should make sure existing fields in port qos map won't be impacted.
on top of it, you need to add tc_to_dscp_map.
There was a problem hiding this comment.
Understand, I'll still keep original dscp_to_tc_map.
There was a problem hiding this comment.
not only dscp_to_tc, but all the maps.
There was a problem hiding this comment.
I see, indeed, there are many tables there.
{% if (generate_port_qos_map is defined) %}
{{- generate_port_qos_map(port_names_active) }}
{% else %}
"PORT_QOS_MAP": {
{% if generate_global_dscp_to_tc_map is defined %}
{{- generate_global_dscp_to_tc_map() }}
{% elif require_global_dscp_to_tc_map %}
"global": {
"dscp_to_tc_map" : "AZURE"
}{% if PORT_ACTIVE %},{% endif %}
{% endif %}
{% for port in PORT_ACTIVE %}
"{{ port }}": {
{% if 'type' in DEVICE_METADATA['localhost'] and DEVICE_METADATA['localhost']['type'] in backend_device_types and 'storage_device' in DEVICE_METADATA['localhost'] and DEVICE_METADATA['localhost']['storage_device'] == 'true' %}
"dot1p_to_tc_map" : "AZURE",
{% else %}
{# Apply separated DSCP_TO_TC_MAP to uplink ports on ToR and Leaf #}
{% if different_dscp_to_tc_map and tunnel_qos_remap_enable %}
{% if ('type' in DEVICE_METADATA['localhost']) and (DEVICE_METADATA['localhost']['type'] == 'LeafRouter') and (port not in port_names_list_extra_queues) %}
"dscp_to_tc_map" : "AZURE_UPLINK",
{% elif ('subtype' in DEVICE_METADATA['localhost']) and (DEVICE_METADATA['localhost']['subtype'] == 'DualToR') and (port in port_names_list_extra_queues) %}
"dscp_to_tc_map" : "AZURE_UPLINK",
{% else %}
"dscp_to_tc_map" : "AZURE",
{% endif %}
{% else %}
"dscp_to_tc_map" : "AZURE",
{% endif %}
{% endif %}
{# Apply separated TC_TO_QUEUE_MAP to uplink ports on ToR #}
{% if different_tc_to_queue_map and tunnel_qos_remap_enable and port in port_names_list_extra_queues %}
"tc_to_queue_map" : "AZURE_UPLINK",
{% else %}
"tc_to_queue_map" : "AZURE",
{% endif %}
{% if asic_type in pfc_to_pg_map_supported_asics %}
{% if port in port_names_list_extra_queues %}
"pfc_to_pg_map" : "AZURE_DUALTOR",
{% else %}
"pfc_to_pg_map" : "AZURE",
{% endif %}
{% endif %}
{% if port not in PORT_DPC %}
{% if port in port_names_list_extra_queues %}
"pfc_enable" : "2,3,4,6",
{% else %}
"pfc_enable" : "{{ LOSSLESS_TC|join(',') }}",
{% endif %}
"pfcwd_sw_enable" : "{{ LOSSLESS_TC|join(',') }}",
{% endif %}
{% if port not in PORT_DPC %}
"tc_to_pg_map" : "AZURE",
{% else %}
"tc_to_pg_map" : "AZURE_DPC",
{% endif %}
"pfc_to_queue_map": "AZURE"
}{% if not loop.last %},{% endif %}
{% endfor %}
},
{% endif %}There was a problem hiding this comment.
where is this macro called?
There was a problem hiding this comment.
Removed this macro in new patch.
There was a problem hiding this comment.
Let's change switch trimming enable to "support_tc_8"?
They don't want to expose words like "trimming"
There was a problem hiding this comment.
Unlike PFC relevant mappings which are bit maps, setting such a mapping to an empty string is problematic because it is not defined.
Let's not to define tc to dscp field if it is not applicable.
There was a problem hiding this comment.
We don't need to change anything for O128 SKU
It's not a MRC SKU
files/build_templates/qos_config.j2
Outdated
There was a problem hiding this comment.
You may want to remove it
They don't want to expose trimming stuffs
device/mellanox/x86_64-nvidia_sn5600-r0/Mellanox-SN5600-C256S1/qos.json.j2
Outdated
Show resolved
Hide resolved
67da368 to
dcef779
Compare
a2e7cbc to
ddc7356
Compare
Update BUFFER_QUEUE, QUEUE, SCHEDULER, WRED_PROFILE, BUFFER_PROFILE. Phase 1 change without trimming related.
Add TC_TO_DSCP and SWITCH_TRIMMING.
Add per direction config. Add trimming flag. Add size field in BUFFER_PROFILE. Restore common config, put new update in per sku config. Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Update dynamic_th, DSCP_TO_TC_MAP, WRED_PROFILE. Update generate_port_qos_map for tc_to_dscp_map. Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Update port qos map. Update wred map Q4, Q5, same as Q1-3. Remove port_config.j2, put content in qos_config.j2 directly. Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Update traffic config. Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
ddc7356 to
637d917
Compare
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
cc2746b to
652df64
Compare
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Also update wred profile value for moose and bison.
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
<!--
Please make sure you've read and understood our contributing guidelines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md
** Make sure all your commits include a signature generated with `git commit -s` **
If this is a bug fix, make sure your description includes "fixes #xxxx", or
"closes #xxxx" or "resolves #xxxx"
Please provide the following information:
-->
#### Why I did it
Adding the below fix from FRR FRRouting/frr#17297
This is to fix the following crash which is a statistical issue
```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6 route_next (node=<optimized out>) at ../lib/table.c:436
#7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
at ../zebra/interface.c:312
#9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
```
##### Work item tracking
- Microsoft ADO **(number only)**:
#### How I did it
Added patch.
#### How to verify it
Running BGP tests.
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305
#### Tested branch (Please provide the tested image version)
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->
- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->
#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
<!--
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
…omatically (sonic-net#638) #### Why I did it src/sonic-swss-common ``` * 8820804 - (HEAD -> 202412, origin/HEAD, origin/202412) Handle 'bulkget' in consumer_table_pops.lua (sonic-net#970) (#11) (7 hours ago) [mssonicbld] * 8ee90ad - Handle 'bulkget' in consumer_table_pops.lua (sonic-net#970) (#10) (31 hours ago) [mssonicbld] * d302b11 - Handle 'bulkget' in consumer_table_pops.lua (sonic-net#970) (#9) (2 days ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
…tomatically (sonic-net#637) #### Why I did it src/sonic-linux-kernel ``` * 8c0b9cf - (HEAD -> 202412, origin/HEAD, origin/202412) [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#11) (7 hours ago) [mssonicbld] * a68d9f2 - [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#10) (31 hours ago) [mssonicbld] * f18e59c - [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#9) (2 days ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
…D automatically (sonic-net#898) #### Why I did it src/sonic-platform-daemons ``` * b19ad82 - (HEAD -> 202412, origin/202412) Refactor port_port_sfp_info_to_db() to read key:value pairs directly from port_info_dict (#9) (20 hours ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
Why I did it
It would provide flexibility in the configurations and fine tuning of QoS Parameters.
How I did it
Update BUFFER_QUEUE, QUEUE, SCHEDULER, WRED_PROFILE, BUFFER_PROFILE and trimming related configs.
How to verify it
Verify with config qos reload, and check related json files, and SDK config.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
202412