[portsorch] fix PortsOrch::allPortsReady() returns true when it should not#1103
[portsorch] fix PortsOrch::allPortsReady() returns true when it should not#1103qiluo-msft merged 11 commits intosonic-net:masterfrom
Conversation
orchagent/orchdaemon.cpp
Outdated
| * That is ensured implicitly by the order of map key, "LAG_TABLE" is smaller than "VLAN_TABLE" in lexicographic order. | ||
| */ | ||
| m_orchList = { gSwitchOrch, gCrmOrch, gBufferOrch, gPortsOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch }; | ||
| m_orchList = { gSwitchOrch, gCrmOrch, gPortsOrch, gBufferOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch }; |
There was a problem hiding this comment.
This is a better order as BufferOrch::doTask() needs to wait for PortsOrch::isInitDone() to be true to proceed
There was a problem hiding this comment.
It is interesting that BufferOrch was placed behind PortsPorch earlier as of this PR #515
|
retest this please |
|
Retest this please |
orchagent/orchdaemon.cpp
Outdated
| * That is ensured implicitly by the order of map key, "LAG_TABLE" is smaller than "VLAN_TABLE" in lexicographic order. | ||
| */ | ||
| m_orchList = { gSwitchOrch, gCrmOrch, gBufferOrch, gPortsOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch }; | ||
| m_orchList = { gSwitchOrch, gCrmOrch, gPortsOrch, gBufferOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch }; |
There was a problem hiding this comment.
gPortsOrch [](start = 42, length = 10)
Do not reorder if you don't have a strong reason. Any order should work for warm-reboot if we iterate enough rounds. #Closed
There was a problem hiding this comment.
@qiluo-msft @prsunny I agree that it is just a matter of enough iterations, however this order makes more sense at least to me, so it's more a cosmetic change.
Besides, now I think with this order 3 instead of 4 iterations are needed in warm boot, let me check.
Do you have concern that this may break something? #Closed
There was a problem hiding this comment.
Let me know if you have proof that 4 iterations -> 3. Otherwise I don't see any improvement.
In reply to: 339371158 [](ancestors = 339371158)
There was a problem hiding this comment.
Please check one more commit to have 3 iterations.
There was a problem hiding this comment.
In any case, let me separate this change in another PR
|
This PR will be an import bug fix. To make lives easier, could you please provide a vs test case, which failed currently version but fixed by this PR? #Closed |
|
Stepan, as you analyze in sonic-net/sonic-mgmt#834 (comment), design of m_pendingPortSet has an issue. If we can populate all physical ports to m_pendingPortSet in the PortsOrch::bake() phase, we can drop the change to move the section of code down pasted below: |
82207af to
51e8707
Compare
|
@qiluo-msft @wendani Not everything in this PR can be fixed by initializing m_pendingPortSet in bake(). Another issue is with Pfc wd start action is not protected by allPortsReady() causing errors in logs in warm boot and it is fixed as part of this PR, not related to m_pendingPortSet. |
|
@qiluo-msft |
|
retest this please |
|
Regarding test, I don't mean an end-to-end simulation of a warm-reboot plus PFC storm. Maybe an google test (unit test) could help here. You could just prepare some mock data in Redis or mock redis, and let orchagent consume them. This test case should fail old code, but pass your new code. In reply to: 546554272 [](ancestors = 546554272) |
|
@qiluo-msft google test (unit test) would be simpler to create than VS, but it fails to compile tests in tests/mock_tests on recent master and I don't see it is run on PR. Are these tests expected to work? #Resolved |
|
I recently fixed the google test. If you check tests/mock_tests are not currently included In reply to: 547004643 [](ancestors = 547004643) |
…d not
Warm start flow before the change:
1st iteration:
- BufferOrch::doTask(): returns since PortInitDone hasn't arived yet
- PortsOrch::doTask(): processes all PORT_TABLE untill PortInitDone flag
m_pendingPortSet is empty yet and m_portInitDone is true
so allPortsReady() will return true
- AnyOrch::doTask(): check g_portsOrch->allPortsRead()
2nd iteration:
- BufferOrch::doTask(): now buffers are applied
This causes BufferOrch override PfcWdOrch's zero-buffer profile.
The change swaps BufferOrch and PortsOrch in m_orchList, because 1st
BufferOrch iteration will always skip processing and eliminates possibility
of having m_pendingPortSet not filled with ports after m_initDone is set to true.
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
… started in warm boot It appeared that pfc watchdog relied on a buggy behaviour of PortsOrch::allPortsReady(). In fixed PortsOrch::allPortsReady() you'll see that watchdog action is trying to start before watchdog was started, because allPortsReady() in PfcWdOrch::doTask() returned false. Before the fix watchdog was started before, because allPortsReady() lied that ports are ready when they were not. Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
This reverts commit 84f80e0.
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
8de2a78 to
6a2073b
Compare
Signed-off-by: Stepan Blyschak <[email protected]>
Without fix: |
|
retest this please |
Signed-off-by: Stepan Blyschak <[email protected]>
|
@stepanblyschak This change cannot be cherry-picked cleanly into 201811 branch. Please create an PR for 201811 branch. Thanks! |
…onic-net#1103) Signed-off-by: Danny Allen <[email protected]>
| $(top_srcdir)/orchagent/orchdaemon.cpp \ | ||
| $(top_srcdir)/orchagent/orch.cpp \ | ||
| $(top_srcdir)/orchagent/notifications.cpp \ | ||
| $(top_srcdir)/orchagent/routeorch.cpp \ | ||
| $(top_srcdir)/orchagent/neighorch.cpp \ | ||
| $(top_srcdir)/orchagent/intfsorch.cpp \ | ||
| $(top_srcdir)/orchagent/portsorch.cpp \ | ||
| $(top_srcdir)/orchagent/copporch.cpp \ | ||
| $(top_srcdir)/orchagent/tunneldecaporch.cpp \ | ||
| $(top_srcdir)/orchagent/qosorch.cpp \ | ||
| $(top_srcdir)/orchagent/bufferorch.cpp \ | ||
| $(top_srcdir)/orchagent/mirrororch.cpp \ | ||
| $(top_srcdir)/orchagent/fdborch.cpp \ | ||
| $(top_srcdir)/orchagent/aclorch.cpp \ | ||
| $(top_srcdir)/orchagent/saihelper.cpp \ | ||
| $(top_srcdir)/orchagent/switchorch.cpp \ | ||
| $(top_srcdir)/orchagent/pfcwdorch.cpp \ | ||
| $(top_srcdir)/orchagent/pfcactionhandler.cpp \ | ||
| $(top_srcdir)/orchagent/policerorch.cpp \ | ||
| $(top_srcdir)/orchagent/crmorch.cpp \ | ||
| $(top_srcdir)/orchagent/request_parser.cpp \ | ||
| $(top_srcdir)/orchagent/vrforch.cpp \ | ||
| $(top_srcdir)/orchagent/countercheckorch.cpp \ | ||
| $(top_srcdir)/orchagent/vxlanorch.cpp \ | ||
| $(top_srcdir)/orchagent/vnetorch.cpp \ | ||
| $(top_srcdir)/orchagent/dtelorch.cpp \ | ||
| $(top_srcdir)/orchagent/flexcounterorch.cpp \ | ||
| $(top_srcdir)/orchagent/watermarkorch.cpp \ | ||
| $(top_srcdir)/orchagent/chassisorch.cpp \ | ||
| $(top_srcdir)/orchagent/sfloworch.cpp |
There was a problem hiding this comment.
@stepanblyschak @Pterosaur @nazariig @lguohan please make orchagent as library *.la, it will be one file, orchagent is already compiled, and here in tests you are compiling it again, which is twice the same compilation, it extend compilation time twice !, now it takes 19 minutes to compile and using it as a lib it could take 10 min
…d not (sonic-net#1103) * [portsorch] fix PortsOrch::allPortsReady() returns true when it should not Warm start flow before the change: 1st iteration: - BufferOrch::doTask(): returns since PortInitDone hasn't arived yet - PortsOrch::doTask(): processes all PORT_TABLE untill PortInitDone flag m_pendingPortSet is empty yet and m_portInitDone is true so allPortsReady() will return true - AnyOrch::doTask(): check g_portsOrch->allPortsRead() 2nd iteration: - BufferOrch::doTask(): now buffers are applied This causes BufferOrch override PfcWdOrch's zero-buffer profile. The change swaps BufferOrch and PortsOrch in m_orchList, because 1st BufferOrch iteration will always skip processing and eliminates possibility of having m_pendingPortSet not filled with ports after m_initDone is set to true. * remove extra newline * [pfcwdorch] fix PfcWdSwOrch::doTask() starts WD action when WD wasn't started in warm boot It appeared that pfc watchdog relied on a buggy behaviour of PortsOrch::allPortsReady(). In fixed PortsOrch::allPortsReady() you'll see that watchdog action is trying to start before watchdog was started, because allPortsReady() in PfcWdOrch::doTask() returned false. Before the fix watchdog was started before, because allPortsReady() lied that ports are ready when they were not. * [portsorch] populate m_pendingPortSet in PortsOrch::bake() * [portsorch] optimize to 3 iterations instead of 4 * Revert "[portsorch] optimize to 3 iterations instead of 4" * revert change of order in m_orchList * [mock_tests] fix tests build * [mock_test] create unittest for PortsOrch::allPortsReady cold/warm flows * [orchdaemon] fix removed sfloworch * [mock_tests] make mock_tests run on "make check"
Warm start flow before the change:
1st iteration:
2nd iteration:
This causes BufferOrch override PfcWdOrch's zero-buffer profile.
The change swaps BufferOrch and PortsOrch in m_orchList, because 1st
BufferOrch iteration will always skip processing and eliminates possibility
of having m_pendingPortSet not filled with ports after m_initDone is set to true.
Signed-off-by: Stepan Blyschak [email protected]
What I did
Why I did it
How I verified it
Details if related