Skip to content

Fix LogAnalyzer to force load regular expressions from common files#5193

Merged
wangxin merged 1 commit intosonic-net:masterfrom
ppikh:fix_loganalyzer
Mar 23, 2022
Merged

Fix LogAnalyzer to force load regular expressions from common files#5193
wangxin merged 1 commit intosonic-net:masterfrom
ppikh:fix_loganalyzer

Conversation

@ppikh
Copy link
Contributor

@ppikh ppikh commented Feb 22, 2022

Signed-off-by: Petro Pikh [email protected]

Description of PR

Fix LogAnalyzer to force load regular expressions from common files

Looks like commit #3235 caused issue that LA has empty list of match errors expressions.

Summary: Fix LogAnalyzer to force load regular expressions from common files
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911

Approach

What is the motivation for this PR?

LogAnalyzer did not work, this PR is fix for LA

How did you do it?

See code

How did you verify/test it?

Executed tests which using LA, checked that LA failed in case when we have errors in logs

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@ppikh ppikh requested a review from a team as a code owner February 22, 2022 08:25
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to run "load_common_config" in lines 55-56 while creating analyzer?

And line 33-35 should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wangxin If add "load_common_config" into line 55-56 - self.match_regex will be empty after line 57.
"load_common_config" - should be called after "parallel_run"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppikh I tested below patch, it works from my side. Can you try that?

diff --git a/tests/common/plugins/loganalyzer/__init__.py b/tests/common/plugins/loganalyzer/__init__.py
index 32015d16f7..9f80fb3901 100644
--- a/tests/common/plugins/loganalyzer/__init__.py
+++ b/tests/common/plugins/loganalyzer/__init__.py
@@ -30,18 +30,13 @@ def analyzer_add_marker(analyzers, node=None, results=None):
     loganalyzer = analyzers[node.hostname]
     logging.info("Add start marker into DUT syslog for host {}".format(node.hostname))
     marker = loganalyzer.init()
-    logging.info("Load config and analyze log for host {}".format(node.hostname))
-    # Read existed common regular expressions located with legacy loganalyzer module
-    loganalyzer.load_common_config()
     results[node.hostname] = marker
 
-
 @reset_ansible_local_tmp
 def analyze_logs(analyzers, markers, node=None, results=None):
     dut_analyzer = analyzers[node.hostname]
     dut_analyzer.analyze(markers[node.hostname])
 
-
 @pytest.fixture(autouse=True)
 def loganalyzer(duthosts, request):
     if request.config.getoption("--disable_loganalyzer") or "disable_loganalyzer" in request.keywords:
@@ -53,7 +48,10 @@ def loganalyzer(duthosts, request):
     analyzers = {}
     parallel_run(analyzer_logrotate, [], {}, duthosts, timeout=120)
     for duthost in duthosts:
-        analyzers[duthost.hostname] = LogAnalyzer(ansible_host=duthost, marker_prefix=request.node.name)
+        analyzer = LogAnalyzer(ansible_host=duthost, marker_prefix=request.node.name)
+        analyzer.load_common_config()
+        analyzers[duthost.hostname] = analyzer
+
     markers = parallel_run(analyzer_add_marker, [analyzers], {}, duthosts, timeout=120)
 
     yield analyzers
@@ -63,4 +61,3 @@ def loganalyzer(duthosts, request):
         return
     logging.info("Starting to analyse on all DUTs")
     parallel_run(analyze_logs, [analyzers, markers], {}, duthosts, timeout=120)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wangxin
I did fixes, it works.
Thanks

@wangxin
Copy link
Collaborator

wangxin commented Feb 23, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

/azp run Azure.sonic-mgmt

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force loading regular expression from the common files

@ppikh ppikh force-pushed the fix_loganalyzer branch from 18ea0b9 to 468875b Compare March 18, 2022 16:17
@wangxin
Copy link
Collaborator

wangxin commented Mar 21, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ppikh
Copy link
Contributor Author

ppikh commented Mar 21, 2022

Hi @wangxin @roysr-nv @liat-grozovik
Checkers after this change will never pass due to errors in logs:

LogAnalyzerError: {'match_messages': {'/tmp/syslog.vlab-01.2022-03-21-05:27:42': ["Mar 21 05:27:36.714638 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping\n"]}, 'total': {'expected_match': 0, 'expected_missing_match': 0, 'match': 1}, 'match_files': {'/tmp/syslog.vlab-01.2022-03-21-05:27:42': {'expected_match': 0, 'match': 1}}, 'expect_messages': {'/tmp/syslog.vlab-01.2022-03-21-05:27:42': []}, 'unused_expected_regexp': []}
LogAnalyzerError: {'match_messages': {'/tmp/syslog.vlab-01.2022-03-21-05:04:08': ['Mar 21 05:03:41.679273 vlab-01 ERR sonic_yang: Data Loading Failed:Leafref "../../LOOPBACK_INTERFACE_LIST/name" of value "Loopback0" points to a non-existing leaf.\n', 'Mar 21 05:03:41.742485 vlab-01 ERR sonic_yang: Data Loading Failed:Leafref "../../LOOPBACK_INTERFACE_LIST/name" of value "Loopback0" points to a non-existing leaf.\n']}, 'total': {'expected_match': 0, 'expected_missing_match': 0, 'match': 2}, 'match_files': {'/tmp/syslog.vlab-01.2022-03-21-05:04:08': {'expected_match': 0, 'match': 2}}, 'expect_messages': {'/tmp/syslog.vlab-01.2022-03-21-05:04:08': []}, 'unused_expected_regexp': []}
LogAnalyzerError: {'match_messages': {'/tmp/syslog.vlab-01.2022-03-21-04:42:01': ['Mar 21 04:40:11.071343 vlab-01 ERR syncd#syncd: :- translateVidToRid: unable to get RID for VID oid:0x5c0000000006c9\n']}, 'total': {'expected_match': 0, 'expected_missing_match': 0, 'match': 1}, 'match_files': {'/tmp/syslog.vlab-01.2022-03-21-04:42:01': {'expected_match': 0, 'match': 1}}, 'expect_messages': {'/tmp/syslog.vlab-01.2022-03-21-04:42:01': []}, 'unused_expected_regexp': []}

Can someone merge this commit?

@wangxin
Copy link
Collaborator

wangxin commented Mar 23, 2022

Because loganalyzer was broken for a long time, many error logs sneaked in. We should have this PR merged anyway because PR test failures were caused by error logs sneaked in recently. Extra efforts are needed to address those error logs. We should not blame loganalyzer for the surfaced new test errors after this PR is merged.

@wangxin wangxin merged commit 03cccf7 into sonic-net:master Mar 23, 2022
@liuh-80
Copy link
Contributor

liuh-80 commented Mar 23, 2022

@ppikh, seems this PR break some UT, here is one of my test PR: https://github.com/Azure/sonic-buildimage/pull/10329/checks?check_run_id=5656011837

Issue seems related with change in tests/common/plugins/loganalyzer/init.py

@ppikh
Copy link
Contributor Author

ppikh commented Mar 23, 2022

Hi @liuh-80
It did not break.
Your tests failed due to errors in logs:

LogAnalyzerError: {'match_messages': {'/tmp/syslog.vlab-01.2022-03-23-06:31:04': ["Mar 23 06:30:48.077089 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel999'. Skipping\n"]}, 'total': {'expected_match': 0, 'expected_missing_match': 0, 'match': 1}, 'match_files': {'/tmp/syslog.vlab-01.2022-03-23-06:31:04': {'expected_match': 0, 'match': 1}}, 'expect_messages': {'/tmp/syslog.vlab-01.2022-03-23-06:31:04': []}, 'unused_expected_regexp': []}

@wen587
Copy link
Contributor

wen587 commented Mar 23, 2022

Hi I met similiar issue. https://dev.azure.com/mssonic/build/_build/results?buildId=83555&view=logs&j=7ecf62ce-878c-5860-615d-9d0f5ab291f1&t=50051680-b4c8-5a0e-b7f9-3520ce72ee4a&l=1860

ERR sonic_yang: Data Loading Failed:All Keys are not parsed in TACPLUS#012dict_keys(['global'])\n"] I met this failure before during my local test, it is because configDB contains empty table which result the test does not pass yang validation.
However, previous generic_config_updater/test_aaa.py tests will also validate yang each time. So previous passing test, such as test_aaa_tc4_remove PASSED, means there is no empty table during the testing environment, otherwise it will fail anyway.
So it confused me that why the testing environment was changed unexpected during the test somehow.

@liuh-80
Copy link
Contributor

liuh-80 commented Mar 24, 2022

@ppikh, seems this PR break some UT, here is one of my test PR: https://github.com/Azure/sonic-buildimage/pull/10329/checks?check_run_id=5656011837

Issue seems related with change in tests/common/plugins/loganalyzer/init.py

Thanks, I check the UT source code and found the UT failed by design, https://github.com/Azure/sonic-mgmt/blob/2333bbcc25ef419580e744fe2c5d79463f5c9afc/tests/pc/test_po_update.py

Seems the UT not failed before, because the log analyzer not catch the message before this change.

I'm not author of this UT and there seems other UT also failed. I will wait for the UT fixed.

@wen587
Copy link
Contributor

wen587 commented Mar 24, 2022

Hi I met similiar issue. https://dev.azure.com/mssonic/build/_build/results?buildId=83555&view=logs&j=7ecf62ce-878c-5860-615d-9d0f5ab291f1&t=50051680-b4c8-5a0e-b7f9-3520ce72ee4a&l=1860

ERR sonic_yang: Data Loading Failed:All Keys are not parsed in TACPLUS#012dict_keys(['global'])\n"] I met this failure before during my local test, it is because configDB contains empty table which result the test does not pass yang validation. However, previous generic_config_updater/test_aaa.py tests will also validate yang each time. So previous passing test, such as test_aaa_tc4_remove PASSED, means there is no empty table during the testing environment, otherwise it will fail anyway. So it confused me that why the testing environment was changed unexpected during the test somehow.

I did some research and found that the GCU will first sort the json_change to make sure it follows yang model rules. When some sorting doesn't comfort to the yang model rules, it will stop current sorting and just try other sorting method.
The log analyzer caught the ERR and fail immediately.
To fix GCU test, I can add disable log analyzer for GCU tests.

@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Mar 24, 2022 via email

@wen587
Copy link
Contributor

wen587 commented Mar 24, 2022

Hi @liat-grozovik , for GCU test it is by design. As it will discard incorrect ordering and finally provide a correct ordering for JsonPatch. It is the previous incorrect ordering fail the log analyzer check. However, it doesn't fail because it will find a correct ordering and apply it.
I will discuss with GCU team to see if there is better way to handle it.

wangxin pushed a commit that referenced this pull request Mar 27, 2022
…5193)

What is the motivation for this PR?
Looks like commit #3235 caused issue that LA has empty list of match errors expressions.

How did you do it?
Fix LogAnalyzer to force load regular expressions from common files

How did you verify/test it?
Executed tests which using LA, checked that LA failed in case when we have errors in logs

Signed-off-by: Petro Pikh <[email protected]>
wangxin added a commit that referenced this pull request Mar 30, 2022
wangxin added a commit that referenced this pull request Mar 30, 2022
… files (#5193)" (#5433)

This reverts commit 03cccf7.

Reverts #5193

After this fix was merged, PR test keeps failing because of errors in syslog.

We spent some effort trying to temporarily ignore the errors. However, the list seems endless.
Please refer to:

[loganalyzer]add log patterns to the common ignore #5411
Add loganalyzer ignore regex for GCU #5391
We need a way to temporarily unblock PR testing. Let's revert this fix for now. Then I'll submit another PR to fix the
loganalyzer issue together with a complete ignore list.
wangxin added a commit that referenced this pull request Mar 30, 2022
… files (#5193)" (#5433)

This reverts commit 03cccf7.

Reverts #5193

After this fix was merged, PR test keeps failing because of errors in syslog.

We spent some effort trying to temporarily ignore the errors. However, the list seems endless.
Please refer to:

[loganalyzer]add log patterns to the common ignore #5411
Add loganalyzer ignore regex for GCU #5391
We need a way to temporarily unblock PR testing. Let's revert this fix for now. Then I'll submit another PR to fix the
loganalyzer issue together with a complete ignore list.
wangxin pushed a commit to wangxin/sonic-mgmt that referenced this pull request Apr 2, 2022
Loganalyzer was broken in PR sonic-net#3235. The issue is that common config was loaded
in subprocess for adding marks to syslog. After the subprocess exited, the
common config is lost.

PR sonic-net#5193 tried to fix this issue. However, because of many new error logs sneaked
in when log analyzer was not working, PR testing started to fail by these error
logs after PR sonic-net#5193 was merged.

PR sonic-net#5191 and sonic-net#5411 tried to workaround the PR testing failure to unblock PR
testing. PR sonic-net#5191 is to address the GCU related error logs and was merged.
PR sonic-net#5411 tried to add other error logs to the common ignore list. But the effort
took too long because the ignore list seemed endless.

To unblock PR testing as soon as possible, the original fix sonic-net#5193 was reverted
in sonic-net#5433.

This PR tries to complete the work left over from sonic-net#5411 and sonic-net#5433.

Changes:
1. Fix the log analyzer common config not loaded issue.
2. Temporarily add error logs to the common ignore list.
3. Improve the logging of log analyzer and parallel_run
4. PR testing t0_part2 takes much more time than t0_part1 after the GCU test
   scripts are added. This change re-balanced t0 part1&part2 testing by moving
   some of the tests from part2 to part1.
5. Sorted the PR testing scripts in alphabetic order.

Signed-off-by: Xin Wang <[email protected]>
wangxin added a commit that referenced this pull request Apr 11, 2022
What is the motivation for this PR?
Loganalyzer was broken in PR #3235. The issue is that common config was loaded
in subprocess for adding marks to syslog. After the subprocess exited, the
common config is lost.

PR #5193 tried to fix this issue. However, because of many new error logs sneaked
in when log analyzer was not working, PR testing started to fail by these error
logs after PR #5193 was merged.

PR #5391 and #5411 tried to work around the PR testing failure to unblock PR
testing. PR #5391 is to address the GCU related error logs and was merged.
PR #5411 tried to add other error logs to the common ignore list. But the effort
took too long because the ignore list seemed endless.

To unblock PR testing as soon as possible, the original fix #5193 was reverted
in #5433.

This PR tries to complete the work left over from #5411 and #5433.

How did you do it?
Changes:
* Fix the log analyzer common config not loaded issue.
* Temporarily add error logs to the common ignore list.
* Improve the logging of log analyzer and parallel_run
* PR testing t0_part2 takes much more time than t0_part1 after the GCU test
  scripts are added. This change re-balanced t0 part1&part2 testing by moving
  some of the tests from part2 to part1.
* Sorted the PR testing scripts in alphabetic order.

How did you verify/test it?
Tested run a few test scripts with log analyzer enabled on KVM testbed.

Signed-off-by: Xin Wang <[email protected]>
wangxin added a commit that referenced this pull request Apr 11, 2022
What is the motivation for this PR?
Loganalyzer was broken in PR #3235. The issue is that common config was loaded
in subprocess for adding marks to syslog. After the subprocess exited, the
common config is lost.

PR #5193 tried to fix this issue. However, because of many new error logs sneaked
in when log analyzer was not working, PR testing started to fail by these error
logs after PR #5193 was merged.

PR #5391 and #5411 tried to work around the PR testing failure to unblock PR
testing. PR #5391 is to address the GCU related error logs and was merged.
PR #5411 tried to add other error logs to the common ignore list. But the effort
took too long because the ignore list seemed endless.

To unblock PR testing as soon as possible, the original fix #5193 was reverted
in #5433.

This PR tries to complete the work left over from #5411 and #5433.

How did you do it?
Changes:
* Fix the log analyzer common config not loaded issue.
* Temporarily add error logs to the common ignore list.
* Improve the logging of log analyzer and parallel_run
* PR testing t0_part2 takes much more time than t0_part1 after the GCU test
  scripts are added. This change re-balanced t0 part1&part2 testing by moving
  some of the tests from part2 to part1.
* Sorted the PR testing scripts in alphabetic order.

How did you verify/test it?
Tested run a few test scripts with log analyzer enabled on KVM testbed.

Signed-off-by: Xin Wang <[email protected]>
xwjiang-ms pushed a commit to xwjiang-ms/sonic-mgmt that referenced this pull request Apr 13, 2022
…onic-net#5193)

What is the motivation for this PR?
Looks like commit sonic-net#3235 caused issue that LA has empty list of match errors expressions.

How did you do it?
Fix LogAnalyzer to force load regular expressions from common files

How did you verify/test it?
Executed tests which using LA, checked that LA failed in case when we have errors in logs

Signed-off-by: Petro Pikh <[email protected]>
xwjiang-ms pushed a commit to xwjiang-ms/sonic-mgmt that referenced this pull request Apr 13, 2022
… files (sonic-net#5193)" (sonic-net#5433)

This reverts commit 03cccf7.

Reverts sonic-net#5193

After this fix was merged, PR test keeps failing because of errors in syslog.

We spent some effort trying to temporarily ignore the errors. However, the list seems endless.
Please refer to:

[loganalyzer]add log patterns to the common ignore sonic-net#5411
Add loganalyzer ignore regex for GCU sonic-net#5391
We need a way to temporarily unblock PR testing. Let's revert this fix for now. Then I'll submit another PR to fix the
loganalyzer issue together with a complete ignore list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants