Skip to content

[chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function#576

Merged
yxieca merged 1 commit intosonic-net:masterfrom
yejianquan:jianquanye/fix_psud
Dec 17, 2024
Merged

[chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function#576
yxieca merged 1 commit intosonic-net:masterfrom
yejianquan:jianquanye/fix_psud

Conversation

@yejianquan
Copy link
Contributor

Description

Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context

Fixes #575

How Has This Been Tested?

Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:

admin@x-sup-2:~$ date && sudo ip netns exec asic0 /usr/bin/redis-cli INFO memory
Mon Dec 16 12:54:24 AM UTC 2024
# Memory
used_memory:3102440
used_memory_human:2.96M
used_memory_rss:19279872
used_memory_rss_human:18.39M
used_memory_peak:6644264
used_memory_peak_human:6.34M
used_memory_peak_perc:46.69%
used_memory_overhead:1595144
used_memory_startup:914032
used_memory_dataset:1507296
used_memory_dataset_perc:68.88%
allocator_allocated:3591336
allocator_active:4112384
allocator_resident:7749632
total_system_memory:32826040320
total_system_memory_human:30.57G
used_memory_lua:34816
used_memory_vm_eval:34816
used_memory_lua_human:34.00K
used_memory_scripts_eval:632
number_of_cached_scripts:1
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:67584
used_memory_vm_total_human:66.00K
used_memory_functions:200
used_memory_scripts:832
used_memory_scripts_human:832B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.15
allocator_frag_bytes:521048
allocator_rss_ratio:1.88
allocator_rss_bytes:3637248
rss_overhead_ratio:2.49
rss_overhead_bytes:11530240
mem_fragmentation_ratio:6.29
mem_fragmentation_bytes:16215752
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:554040
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

admin@x-sup-2:~$ date && sudo ip netns exec asic0 /usr/bin/redis-cli INFO memory
Mon Dec 16 03:05:14 AM UTC 2024
# Memory
used_memory:3182072
used_memory_human:3.03M
used_memory_rss:19447808
used_memory_rss_human:18.55M
used_memory_peak:6644264
used_memory_peak_human:6.34M
used_memory_peak_perc:47.89%
used_memory_overhead:1694744
used_memory_startup:914032
used_memory_dataset:1487328
used_memory_dataset_perc:65.58%
allocator_allocated:3607320
allocator_active:4112384
allocator_resident:7954432
total_system_memory:32826040320
total_system_memory_human:30.57G
used_memory_lua:34816
used_memory_vm_eval:34816
used_memory_lua_human:34.00K
used_memory_scripts_eval:632
number_of_cached_scripts:1
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:67584
used_memory_vm_total_human:66.00K
used_memory_functions:200
used_memory_scripts:832
used_memory_scripts_human:832B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.14
allocator_frag_bytes:505064
allocator_rss_ratio:1.93
allocator_rss_bytes:3842048
rss_overhead_ratio:2.44
rss_overhead_bytes:11493376
mem_fragmentation_ratio:6.19
mem_fragmentation_bytes:16304072
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:654600
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

Additional Information (Optional)

…p run function from the initialization function
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yejianquan
Copy link
Contributor Author

@abdosi @sdszhang @cyw233 @vperumal @anamehra for viz

@abdosi
Copy link
Contributor

abdosi commented Dec 16, 2024

@mlok-nokia @arlakshm : please check if their is any other impact of this change,

@yxieca yxieca merged commit 0d79916 into sonic-net:master Dec 17, 2024
mssonicbld pushed a commit to mssonicbld/sonic-platform-daemons that referenced this pull request Dec 18, 2024
…p run function from the initialization function (sonic-net#576)

Description
Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context
Fixes sonic-net#575

How Has This Been Tested?
Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #578

mssonicbld pushed a commit that referenced this pull request Dec 18, 2024
…p run function from the initialization function (#576)

Description
Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context
Fixes #575

How Has This Been Tested?
Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:
vvolam pushed a commit to vvolam/sonic-platform-daemons that referenced this pull request Jan 3, 2025
…p run function from the initialization function (sonic-net#576)

Description
Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context
Fixes sonic-net#575

How Has This Been Tested?
Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:
Copy link
Collaborator

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yejianquan did we investigate why thermalctld was deleting the PSU entry from the DB on restart ? This code seem to be inefficient in that the PSUd will keep updating the DB even if there is NO PSU update.

We should have instead fixed the thermalctld

@yejianquan
Copy link
Contributor Author

yejianquan commented Jan 27, 2025

@yejianquan did we investigate why thermalctld was deleting the PSU entry from the DB on restart ? This code seem to be inefficient in that the PSUd will keep updating the DB even if there is NO PSU update.

We should have instead fixed the thermalctld

Hi @prgeor , yes, please look into #575
the PHYSICAL_ENTITY_INFO|PSU * got removed when thermalctld restarted.
And during my test, it's a lightweight function, and it only affects chassis devices.
But yes, if we can have fix in thermalctld it would be better.

@yejianquan yejianquan deleted the jianquanye/fix_psud branch January 27, 2025 23:21
prgeor pushed a commit that referenced this pull request Feb 6, 2025
…evice is in detaching mode (#546)

* Skip logging the warning, if device is in detaching mode

* Add detach_info table and unittests

* Fix unit tests

* Increase code coverage

* Remove unused header import

* Fix dict get values

* Increase code coverage

* Increase test coverage

* [SmartSwitch] Extend implementation of the DPU chassis daemon. (#563)

* Addition of DPU Chassis for thermalctld (#564)

* [stormond] Added new dynamic field 'last_sync_time' to STATE_DB (#535)

* Added new dynamic field 'last_sync_time' that shows when STORAGE_INFO for disk was last synced to STATE_DB

* Moved 'start' message to actual starting point of the daemon

* Added functions for formatted and epoch time for user friendly time display

* Made changes per prgeor review comments

* Pivot to SysLogger for all logging

* Increased log level so that they are seen in syslogs

* Code coverage improvement

* [lag_id] Add lagid to free_list when LC absent for 30 minutes (#542)

When LC is absent for 30 minutes, the database cleanup kicks in. When LagId is released, it needs to be appended to the SYSTEM_LAG_IDS_FREE_LIST

This PR works with the following 2 PRs:
sonic-net/sonic-swss#3303
sonic-net/sonic-buildimage#20369

Signed-off-by: mlok <[email protected]>

* Fixed bug in chassisd causing incorrect number of ASICs in CHASSIS_STATE_DB (#560)

Fixed the bug in chassisd due to which incorrect number of ASICs were being pushed to CHASSIS_STATE_DB.

* thermalctld: Add support for fans on non-CPU modules (#555)

* thermalctld: Add support for fans on non-CPU modules

* Add module fan to unit tests

* Advanced Azure pipeline to Bookworm (#572)

Description
This PR advances the azure pipeline on sonic_platform_daemons from bullseye to bookworm. This fixes the issue where sonic-platform-daemons azp is having some issues due to upgrade to bookworm. See Pipelines - Run 20241210.8 logs for details.

* Take non-CMIS xcvrs out of lpmode in SFF Manager (#565)

Description
Fix non-CMIS transceivers in down state by bringing them out of low power mode in the SFF Manager Task.
This is intended to work together with the change in sonic-net/sonic-buildimage#20886.

Motivation and Context
Non-CMIS transceivers were not functioning correctly when put into Low Power mode. So XCVRD now brings them out of lpmode.

How Has This Been Tested?
Loaded an image containing this change alongside the change from sonic-net/sonic-buildimage#20886 on an Arista chassis containing a Clearwater2 linecard.
Verified that without this image some interfaces were in a down state but with the image all interfaces came up as expected.

* Added SmartSwitch support in chassisd and enabling chassisd  (#467)

Added SmartSwitch support in chassisd and enabling chassisd

* [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function (#576)

Description
Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context
Fixes #575

How Has This Been Tested?
Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:

* [chassisd] Address the chassisd crash issue and add UT for it (#573)

Description
On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131

Motivation and Context
Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it.

How Has This Been Tested?
Tested on 202405 branch


Signed-off-by: mlok <[email protected]>

* Fix a comment

---------

Signed-off-by: mlok <[email protected]>
Co-authored-by: Oleksandr Ivantsiv <[email protected]>
Co-authored-by: Gagan Punathil Ellath <[email protected]>
Co-authored-by: Ashwin Srinivasan <[email protected]>
Co-authored-by: Marty Y. Lok <[email protected]>
Co-authored-by: Vivek Verma <[email protected]>
Co-authored-by: Patrick MacArthur <[email protected]>
Co-authored-by: Peter Bailey <[email protected]>
Co-authored-by: rameshraghupathy <[email protected]>
Co-authored-by: Jianquan Ye <[email protected]>
@gregoryboudreau
Copy link
Contributor

Can this fix be brought into 202411? It was brought into master and 202405 but the underlying issue referenced in #575 is still present in 202411: https://github.com/sonic-net/sonic-platform-daemons/blob/202411/sonic-psud/scripts/psud#L409

@prgeor @yejianquan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status
Status: Done

Development

Successfully merging this pull request may close these issues.

[chassis] PSU keys(generated by psud) got removed by the restart of thermalctld and won't auto recover.

6 participants