Skip to content

Commit b865f3b

Browse files
JibinBaomssonicbld
authored andcommitted
Add test plan and tests for liquid cooling leakage detection (#20792)
Signed-off-by: mssonicbld <sonicbld@microsoft.com>
1 parent d495447 commit b865f3b

15 files changed

Lines changed: 1040 additions & 255 deletions
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Liquid cooling leakage detection test plan
2+
3+
* [Overview](#Overview)
4+
* [HLD](#HLD)
5+
* [Scope](#Scope)
6+
* [Testbed](#Testbed)
7+
* [Setup configuration](#Setup%20configuration)
8+
* [Test](#Test)
9+
* [TODO](#TODO)
10+
* [Open questions](#Open%20questions)
11+
12+
## Overview
13+
Liquid cooling technology has become essential for efficiently cooling equipment and ensuring its proper operation. To address the potential dangers associated with liquid cooling leakage, it is crucial to implement a monitoring mechanism that can instantly alert the system when such an event occurs.The purpose of this test is to verify the functionality of leakage detection.
14+
15+
### HLD
16+
- Feature HLD: https://github.com/sonic-net/SONiC/pull/2032/
17+
18+
### Scope
19+
The test is targeting on the verification of the functionality of leakage detection on device has liquid cooling system.
20+
21+
### Testbed
22+
Any
23+
24+
### Setup configuration
25+
Common tests configuration:
26+
- Check whether the device has liquid cooling system. If yes, do the following tests, else skip them.
27+
- When device has liquid cooling system: The key of enable_liquid_cooling exsits in pmon_daemon_control.json and the value is true
28+
29+
Common tests cleanup:
30+
- No.
31+
32+
33+
## Test
34+
### Test case #1 test_verify_liquid_senors_number_and_status
35+
#### Test objective
36+
Verify the number of the liquid sensors equals the configured number and the corresponding status is ok
37+
#### Test steps
38+
* Verify the number of the liquid sensors equals the configured number
39+
* Verify there are no leaks
40+
* Check that the status of all leak sensors is 'NO' in the output of the 'show platform leakage status' command
41+
* Check that the status of all leak sensors is 'OK' in the output of the 'show system-health detail' command
42+
43+
### Test case #2 test_mock_liquid_leak_event
44+
#### Test objective
45+
1. Mock liquid leak event and verify the dut has the correct response
46+
2. Mock liquid leak event is fixed and verify the dut has the correct response
47+
#### Test steps
48+
* Randomly select one or serveral sensors to mock leak event. Take leakage1 as example:
49+
* Save the value of /var/run/hw-management/system/leakage1 and unlink it
50+
* Create a file /var/run/hw-management/system/leakage1
51+
* Echo 0 to /var/run/hw-management/system/leakage1 to mock leak event
52+
* sleep liguid_cooling_update_interval (The default value is 0.5s)
53+
* Verify state db has been updated to 'YES' for the mocked sensors
54+
* Verify syslog has the corresponding GNMI event log indicating the liquid leakage event occurs, and msg has been sent out
55+
* Verify there are leaks for the mocked sensors
56+
* Check that the status of the mocked sensors is 'Yes' in the output of the 'show platform leakage status' command
57+
* Check that the status of the mocked sensors is 'Not OK' in the output of the 'show system-health detail' command
58+
* Restore the liquid sensor
59+
* sleep liguid_cooling_update_interval
60+
* Verify state db has been updated to 'No' for the mocked sensors
61+
* Verify syslog has the corresponding GNMI event log indicating liquid leakgae event has been fixed
62+
* Verify the leaks for the mocked sensors has been fixed
63+
* Check that the status of the mocked sensors is 'NO' in the output of the 'show platform leakage status' command
64+
* Check that the status of the mocked sensors is 'OK' in the output of the 'show system-health detail' command
65+
66+
### Test case #3 Extend check_sysfs
67+
#### Test objective
68+
Extend check_sysfs so that when dut do reboot and config reload, the liquid cooling leakage sysfs can be verified
69+
#### Test steps
70+
* Extend the function of check_sysfs to check the sysfs related to liquid cooling leakage
71+
72+
### Test case #4 Platfform API get_name
73+
#### Test objective
74+
Verify get_name gets the correct value
75+
#### Test steps
76+
* Call get_name, and verify it returns the correct value like leakage1,leakage2...
77+
78+
### Test case #5 Platfform API is_leak
79+
#### Test objective
80+
Verify is_leak gets the correct value
81+
#### Test steps
82+
* Call is_leak, and verify it returns Flase
83+
84+
### Test case #6 Platfform API get_leak_sensor_status
85+
#### Test objective
86+
Verify get_leak_sensor_status gets the correct value
87+
#### Test steps
88+
* Call get_leak_sensor_status, and verify it return the emtpy list
89+
90+
### Test case #7 Platfform API get_num_leak_sensors
91+
#### Test objective
92+
Verify get_num_leak_sensors gets the correct value
93+
#### Test steps
94+
* Call get_num_leak_sensors, and verify the return vlaue equals to the leak sensros number defined in pltform.json
95+
96+
### Test case #8 Platfform API get_all_leak_sensors
97+
#### Test objective
98+
Verify get_all_leak_sensors gets the correct value
99+
#### Test steps
100+
* Call get_all_leak_sensors, and verify the return vlaue equals to the leak sensros number defined in pltform.json
101+
102+
103+
## TODO
104+
105+
106+
## Open questions
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
import logging
2+
import json
3+
import os
4+
import re
5+
import pytest
6+
import ast
7+
from tests.common.helpers.sensor_control_test_helper import BaseMocker
8+
from tests.common.helpers.assertions import pytest_require as pyrequire
9+
from tests.common.helpers.dut_utils import check_container_state
10+
from tests.common.helpers.gnmi_utils import gnmi_container
11+
from tests.common import config_reload
12+
# The interval of EVENT_PUBLISHED is 60 seconds by default.
13+
# To left some buffer, the timeout for gnmi LD event is set to 90 seconds
14+
WAIT_GNMI_LD_EVENT_TIMEOUT = 90
15+
# To left some buffer for the thread timeout,the timeout for gnmi event is set to 120 seconds
16+
WAIT_GNMI_EVENT_TIMEOUT = WAIT_GNMI_LD_EVENT_TIMEOUT + 30
17+
18+
19+
class LiquidLeakageMocker(BaseMocker):
20+
"""
21+
Liquid leakage mocker. Vendor should implement this class to provide a liquid leakage mocker.
22+
This class could mock liquid leakage detection status.
23+
"""
24+
25+
def mock_leakage(self):
26+
"""
27+
Change the mocked liquid leakage detection status to 'Leakage'.
28+
:return:
29+
"""
30+
pass
31+
32+
def mock_no_leakage(self):
33+
"""
34+
Change the mocked liquid leakage detection status to 'No Leakage'.
35+
:return:
36+
"""
37+
pass
38+
39+
def verify_leakage(self):
40+
"""
41+
Verify the leakage status of the DUT.
42+
:return:
43+
"""
44+
pass
45+
46+
def verify_no_leakage(self):
47+
"""
48+
Verify the leakage status of the DUT.
49+
:return:
50+
"""
51+
pass
52+
53+
54+
def get_leakage_status(dut):
55+
"""
56+
Get the leakage status of the DUT.
57+
:param dut: DUT object representing a SONiC switch under test.
58+
:return: The leakage status of the DUT.
59+
"""
60+
return dut.show_and_parse("show platform leakage status")
61+
62+
63+
def get_leakage_status_in_health_system(dut):
64+
"""
65+
Get the health system status of the DUT.
66+
:param dut: DUT object representing a SONiC switch under test.
67+
:return: The health system status of the DUT.
68+
"""
69+
system_health_status = dut.show_and_parse("sudo show system-health detail")
70+
system_health_leakage_status_list = []
71+
for status in system_health_status:
72+
if status['name'].startswith('leakage'):
73+
system_health_leakage_status_list.append(status)
74+
logging.info(f"System health leakage status list: {system_health_leakage_status_list}")
75+
return system_health_leakage_status_list
76+
77+
78+
def get_state_db(dut):
79+
return ast.literal_eval(dut.shell('sonic-db-dump -n STATE_DB -y')['stdout'])
80+
81+
82+
def verify_leakage_status(dut, leakage_index_list, expected_status):
83+
"""
84+
Verify the leak status of the DUT.
85+
:param dut: DUT object representing a SONiC switch under test.
86+
:param expected_status: Expected status of the DUT.
87+
:return:
88+
"""
89+
logging.info(f"Verify leakage status of {leakage_index_list} is : {expected_status}")
90+
leakage_status_list = get_leakage_status(dut)
91+
failed_leakage_list = []
92+
success_leakage_list = []
93+
for index in leakage_index_list:
94+
for leak_status in leakage_status_list:
95+
if leak_status['name'] == f"leakage{index}":
96+
if leak_status['leak'].lower() != expected_status.lower():
97+
failed_leakage_list.append(index)
98+
logging.info(f"Leakage status is not as expected: {leak_status}")
99+
else:
100+
success_leakage_list.append(index)
101+
logging.info(f"Leakage status is as expected: {leak_status}")
102+
assert len(failed_leakage_list) == 0, f"Leakage status is not as expected: {failed_leakage_list}"
103+
assert len(success_leakage_list) == len(leakage_index_list), \
104+
f"Not all leakage status are detected: test leakage index list: {leakage_index_list}, " \
105+
f"success leakage index list: {success_leakage_list}"
106+
return True
107+
108+
109+
def verify_leakage_status_in_health_system(dut, leakage_index_list, expected_status):
110+
"""
111+
Verify the leakage status in health system of the DUT.
112+
:param dut: DUT object representing a SONiC switch under test.
113+
:param expected_status: Expected status of the DUT.
114+
:return:
115+
"""
116+
logging.info(f"Verify leakage status in health system of {leakage_index_list} is: {expected_status}")
117+
health_system_leakage_status_list = get_leakage_status_in_health_system(dut)
118+
failed_leakage_list = []
119+
success_leakage_list = []
120+
for index in leakage_index_list:
121+
for leak_status in health_system_leakage_status_list:
122+
if f"leakage{index}" == leak_status['name']:
123+
if leak_status['status'].lower() != expected_status.lower():
124+
failed_leakage_list.append(index)
125+
logging.info(f"Leakage status in health system is not as expected: {leak_status}")
126+
else:
127+
success_leakage_list.append(index)
128+
logging.info(f"Leakage status in health system is as expected: {leak_status}")
129+
assert len(failed_leakage_list) == 0, f"Leakage status is not as expected: {failed_leakage_list}"
130+
assert len(success_leakage_list) == len(leakage_index_list), \
131+
f"Not all leakage status are detected: test leakage index list: {leakage_index_list}, " \
132+
f"success leakage index list: {success_leakage_list}"
133+
return True
134+
135+
136+
def verify_leakage_status_in_state_db(dut, leakage_index_list, expected_status):
137+
"""
138+
Verify the leakage status in state db of the DUT.
139+
:param dut: DUT object representing a SONiC switch under test.
140+
:param expected_status: Expected status of the DUT.
141+
:return:
142+
"""
143+
logging.info(f"Verify leakage status in state db of {leakage_index_list} is: {expected_status}")
144+
state_db = get_state_db(dut)
145+
failed_leakage_list = []
146+
success_leakage_list = []
147+
for index in leakage_index_list:
148+
leak_status = state_db.get(f"LIQUID_COOLING_INFO|leakage{index}", {}).get("value", {}).get("leak_status")
149+
if leak_status != expected_status:
150+
failed_leakage_list.append(index)
151+
logging.info(f"Leakage status in state db is not as expected: {leak_status}")
152+
else:
153+
success_leakage_list.append(index)
154+
logging.info(f"Leakage status in state db is as expected: {leak_status}")
155+
assert len(failed_leakage_list) == 0, f"Leakage status is not as expected: {failed_leakage_list}"
156+
assert len(success_leakage_list) == len(leakage_index_list), \
157+
f"Not all leakage status are detected: test leakage index list: {leakage_index_list}, " \
158+
f"success leakage index list: {success_leakage_list}"
159+
return True
160+
161+
162+
def verify_gnmi_msg_is_sent(leakage_index_list, gnmi_result, msg_type):
163+
"""
164+
Verify the gnmi msg of the DUT.
165+
:param dut: DUT object representing a SONiC switch under test.
166+
:param gnmi_result: gnmi result of the DUT.
167+
:return:
168+
"""
169+
logging.info(
170+
f"Verify gnmi msg is sent for {leakage_index_list} with type: {msg_type} \n gnmi result: {gnmi_result}")
171+
msg_common_prefix = "sonic-events-host:liquid-cooling-leak"
172+
for index in leakage_index_list:
173+
if msg_type == "leaking":
174+
expected_msg_regex = f".*{msg_common_prefix}.*sensor report leaking event.*leakage{index}.*"
175+
else:
176+
expected_msg_regex = f".*{msg_common_prefix}.*leaking sensor report recoveried.*leakage{index}.*"
177+
assert re.search(expected_msg_regex, gnmi_result), f"Gnmi msg is not as expected: {gnmi_result}"
178+
return True
179+
180+
181+
def startmonitor_gnmi_event(duthost, ptfhost):
182+
"""
183+
Monitor the gnmi event of the DUT.
184+
:param dut: DUT object representing a SONiC switch under test.
185+
:param ptfhost: PTF object representing a PTF switch under test.
186+
:param result_queue: Queue object to store the result.
187+
:return:
188+
"""
189+
dut_mgmt_ip = duthost.mgmt_ip
190+
timeout = WAIT_GNMI_LD_EVENT_TIMEOUT
191+
gnmi_subscribe_cmd = f"python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t {dut_mgmt_ip} -p 50052 -m subscribe \
192+
-x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest --subscribe_mode 0 --submode 1 --interval 0 \
193+
--update_count 0 --create_connections 1 --filter_event_regex sonic-events-host --timeout {timeout} "
194+
result = ptfhost.shell(gnmi_subscribe_cmd, module_ignore_errors=True)['stdout']
195+
logging.info(f"gnmi subscribe cmd: {gnmi_subscribe_cmd} \n gnmi event result: {result}")
196+
return result
197+
198+
199+
def get_pmon_daemon_control_dict(dut):
200+
"""
201+
Get the pmon daemon control dict of the DUT.
202+
:param dut: DUT object representing a SONiC switch under test.
203+
:return: The pmon daemon control dict of the DUT.
204+
"""
205+
pmon_daemon_control_file_path = os.path.join(
206+
"/usr/share/sonic/device", dut.facts["platform"], "pmon_daemon_control.json")
207+
return json.loads(dut.shell(f"cat {pmon_daemon_control_file_path} ")['stdout'])
208+
209+
210+
def is_liquid_cooling_system_supported(dut):
211+
"""
212+
Check if the liquid cooling system is supported on the DUT.
213+
:param dut: DUT object representing a SONiC switch under test.
214+
:return: True if the liquid cooling system is supported, False otherwise.
215+
"""
216+
pmon_daemon_control_dict = get_pmon_daemon_control_dict(dut)
217+
if pmon_daemon_control_dict.get("enable_liquid_cooling"):
218+
logging.info("Liquid cooling system is supported")
219+
return True
220+
else:
221+
logging.info("Liquid cooling system is not supported")
222+
return False
223+
224+
225+
def get_liquid_cooling_update_interval(dut):
226+
"""
227+
Get the liquid cooling update interval of the DUT.
228+
:param dut: DUT object representing a SONiC switch under test.
229+
:return: The liquid cooling update interval of the DUT.
230+
"""
231+
pmon_daemon_control_dict = get_pmon_daemon_control_dict(dut)
232+
return pmon_daemon_control_dict.get("liquid_cooling_update_interval")
233+
234+
235+
@pytest.fixture(scope="function")
236+
def setup_gnmi_server(duthosts, rand_one_dut_hostname, localhost, ptfhost):
237+
'''
238+
Setup GNMI server with client certificates
239+
'''
240+
duthost = duthosts[rand_one_dut_hostname]
241+
242+
# Check if GNMI is enabled on the device
243+
pyrequire(
244+
check_container_state(duthost, gnmi_container(duthost), should_be_running=True),
245+
"Test was not supported on devices which do not support GNMI!")
246+
duthost.shell("sonic-db-cli CONFIG_DB hset 'GNMI|gnmi' port 50052")
247+
duthost.shell("sonic-db-cli CONFIG_DB hset 'GNMI|gnmi' client_auth true")
248+
duthost.shell("sonic-db-cli CONFIG_DB hset 'GNMI|certs' ca_crt /etc/sonic/telemetry/dsmsroot.cer")
249+
duthost.shell(
250+
"sonic-db-cli CONFIG_DB hset 'GNMI|certs' server_crt /etc/sonic/telemetry/streamingtelemetryserver.cer")
251+
duthost.shell(
252+
"sonic-db-cli CONFIG_DB hset 'GNMI|certs' server_key /etc/sonic/telemetry/streamingtelemetryserver.key")
253+
duthost.shell('sonic-db-cli CONFIG_DB HSET "GNMI|gnmi" "client_auth" "false"')
254+
duthost.shell('sudo systemctl reset-failed gnmi')
255+
duthost.shell('sudo service gnmi restart')
256+
257+
yield
258+
259+
logging.info("Recover gnmi config")
260+
config_reload(duthost, safe_reload=True)

0 commit comments

Comments
 (0)