[xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crashing#283
[xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crashing#283keboliu wants to merge 1 commit intosonic-net:masterfrom
Conversation
Signed-off-by: Kebo Liu <[email protected]>
|
@prgeor this is critical fix for 202205. can you please approve? |
| ('supported_min_laser_freq', str(port_info_dict['supported_min_laser_freq']) | ||
| if 'supported_min_laser_freq' in port_info_dict else 'N/A') | ||
| ]) | ||
| except (KeyError, TypeError) as e: |
There was a problem hiding this comment.
I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?
There was a problem hiding this comment.
I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?
This try to fix below crash
syslog.1:Jul 15 10:42:31.594198 r-leopard-56 INFO pmon#supervisord: xcvrd Process Process-2:
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd self.run()
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd self._target(*self._args, **self._kwargs)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1679, in task_worker
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd rc = post_port_sfp_info_to_db(logical_port, self.port_mapping, self.xcvr_table_helper.get_intf_tbl(asic_index), transceiver_dict)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 306, in post_port_sfp_info_to_db
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd fvs = swsscommon.FieldValuePairs(
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 229, in __init__
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd _swsscommon.FieldValuePairs_swiginit(self, _swsscommon.new_FieldValuePairs(*args))
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: Wrong number or type of arguments for overloaded function 'new_FieldValuePairs'.
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd Possible C/C++ prototypes are:
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector()
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > > const &)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type,std::vector< std::pair< std::string,std::string > >::value_type const &)
from this backtrace, I suppose at least one of the port_info_dict[] failed to fetch the value, either because of key error, or because none object of port_info_dict. I add the 'try' ... catch try to see what exactly the exception is, but the crash will disappear and no exception caught, if I remove the try..,catch I can see it again, I am not able to know the exact exception, so I have to add both of the exceptions here.
There was a problem hiding this comment.
DOM table will be update with wrong values and user won't come to know this (i don't see anyone check the log unless any issue). Can we dig more which filed is causing the failure? I guess this is something to do with a specific optics
There was a problem hiding this comment.
@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect type and vendor_rev with change ('type', port_info_dict['type'] if type in port_info_dict else 'N/A') and ('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A') , seems I am not able to reproduce the crash anymore. So I would suggest catching KeyError, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?
There was a problem hiding this comment.
@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect
typeandvendor_revwith change('type', port_info_dict['type'] if type in port_info_dict else 'N/A')and('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A'), seems I am not able to reproduce the crash anymore. So I would suggest catchingKeyError, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?
@keboliu lets proceed with catching KeyError and also print the error message on keyError to know which field is giving error.
There was a problem hiding this comment.
@keboliu do you mind using dictionary's get() method for each of these fields,? That would make the code more readable?
There was a problem hiding this comment.
I see some more failures during the test, and need some more time to investigate.
|
close this PR and open sonic-net/sonic-platform-common#305 to fix the crash |
Signed-off-by: Kebo Liu [email protected]
Description
Catch TypeError and KeyError exception to protect xcvrd from crashing
Motivation and Context
This is to fix a crash issue: sonic-net/sonic-buildimage#11525
How Has This Been Tested?
run sonic-mgmt sfp reset test case: platform_tests/sfp/test_sfputil.py::test_check_sfputil_reset to make sure xcvrd will not crash
Additional Information (Optional)