Conversation
|
please fix vs test failure. #Resolved |
|
can you explain the idea in the commit message, it will be better for us to understand and do the review #Resolved |
| DBConnector cfgDb(CONFIG_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| DBConnector appl_db(APPL_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| DBConnector state_db(STATE_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| ProducerStateTable p(&appl_db, APP_PORT_TABLE_NAME); |
There was a problem hiding this comment.
why change to producer state table? #Resolved
There was a problem hiding this comment.
Actually change to ProducerTable. The entries in APP_PORT_TABLE_NAME are sequence sensitive because there are event entries.
In reply to: 205855555 [](ancestors = 205855555)
There was a problem hiding this comment.
what event entries? #Resolved
There was a problem hiding this comment.
orchagent/orch.cpp
Outdated
| } | ||
|
|
||
| // TODO: Table should be const | ||
| void Orch::addExistingData(Table *table) |
There was a problem hiding this comment.
can you add one line description what addExistingData do? #Resolved
There was a problem hiding this comment.
jipanyang
left a comment
There was a problem hiding this comment.
Another change needed for portorch warm start is bool PortsOrch::initializePort(Port &p). If it is warm start, db oper_status should not be set down. We may use bool isWarmStart() common function as that for other orchagent as needed.
| DBConnector cfgDb(CONFIG_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| DBConnector appl_db(APPL_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| DBConnector state_db(STATE_DB, DBConnector::DEFAULT_UNIXSOCKET, 0); | ||
| ProducerStateTable p(&appl_db, APP_PORT_TABLE_NAME); |
There was a problem hiding this comment.
Could you do the ProducerStateTable to ProducerTable change in a separate PR? It is not strictly related to warm reboot. #Resolved
There was a problem hiding this comment.
I could, however, this PR will depend on that new one. ProducerTable is correct choice for port table.
In reply to: 205866896 [](ancestors = 205866896)
There was a problem hiding this comment.
Your first suggestion is actually pretty good. Let's keep ProducerStateTable in this PR.
In reply to: 206734165 [](ancestors = 206734165,205866896)
orchagent/orch.cpp
Outdated
|
|
||
| while (it != m_consumerMap.end()) | ||
| { | ||
| consumer = (Consumer*)(it->second.get()); |
There was a problem hiding this comment.
The change overlap with part of #546 . One of the problem here is that other types of class may be in m_consumerMap too and they don't have getTableName() methods. casting them to Consumer may cause problem.
The other reason to add getName()/setName() for Executor is to facilitate debugging. #Resolved
There was a problem hiding this comment.
Actually the name is inside ConsumerMap keys. Refine the code so the conversion is safe now.
In reply to: 205867876 [](ancestors = 205867876)
| if (!foundPortConfigDone || !foundPortInitDone) | ||
| { | ||
| SWSS_LOG_NOTICE("No port table, fallback to cold start"); | ||
| cleanPortTable(keys); |
There was a problem hiding this comment.
There is #547 to have common function for all swss processes to check whether a process is doing warm start .
We agreed to discuss the schema design. The basic idea there is to assume that for cold start, the corresponding WARM_RESTART_TABLE entry cleared, then we create the entry and set restart_count to 0.
Every time the process is warm restarted, restart_count is incremented by 1.
Not sure about the purpose of cleanPortTable() here.
#Resolved
There was a problem hiding this comment.
We could still discuss that design. No confliction here.
Not sure about the purpose of cleanPortTable() here.
If the redis leftover data for warm reboot is corrupted, clean up everything left and fallback to cold start.
In reply to: 205869453 [](ancestors = 205869453)
| return false; | ||
| } | ||
|
|
||
| addExistingData(m_portTable.get()); |
There was a problem hiding this comment.
portorch also handling LAG, LAG member, vlan and vlan members:
addExistingData(db, APP_LAG_TABLE_NAME);
addExistingData(db, APP_LAG_MEMBER_TABLE_NAME);
addExistingData(db, APP_VLAN_TABLE_NAME);
addExistingData(db, APP_VLAN_MEMBER_TABLE_NAME); #Resolved
| vector<Selectable*> getSelectables(); | ||
|
|
||
| // add the existing table data (left by warm reboot) to the consumer todo task list. | ||
| bool addExistingData(Table *table); |
There was a problem hiding this comment.
There are quite a few places to use addExistingData() where only table name is available. It should be better to use table name directly. #Resolved
| if (m_portCount != keys.size() - 2) | ||
| { | ||
| // Invalid port table | ||
| SWSS_LOG_ERROR("Invalid port table: m_portCount"); |
There was a problem hiding this comment.
In what scenario will this case be hit? #Resolved
There was a problem hiding this comment.
There was a problem hiding this comment.
If redis database is corrupted, should not a whole system cold boot be performed instead of doing partial fix here and putting the system into unknown state? Supporting Individual process cold boot alone is another area. #Pending
There was a problem hiding this comment.
It is another design choice, and I am ok with it.
If it will become our final decision, I am happy to iterate on it.
In reply to: 206646838 [](ancestors = 206646838)
| Orch::addExecutor("PORT_STATUS_NOTIFICATIONS", portStatusNotificatier); | ||
|
|
||
| // Try warm start | ||
| bake(); |
There was a problem hiding this comment.
Within PortsOrch::initializePort(Port &p), for warm start, oper_status should not be changed. #Resolved
There was a problem hiding this comment.
I checked code in void PortsOrch::doPortTask(Consumer &consumer).
During warm start, m_portListLaneMap will have all existing ports from SAI query. So there is no new port created, and no new port initialized.
The function PortsOrch::initializePort(Port &p) will not be called.
In reply to: 206311005 [](ancestors = 206311005)
There was a problem hiding this comment.
Not sure if there is misunderstanding here, the warm restart testing I have done so far didn't skip PortsOrch::initializePort(Port &p). Even for cold start, m_portListLaneMap will have all the exiting physical ports. #Resolved
There was a problem hiding this comment.
I am talking about this code snippet:
for (auto it = m_lanesAliasSpeedMap.begin(); it != m_lanesAliasSpeedMap.end();)
{
bool port_created = false;
if (m_portListLaneMap.find(it->first) == m_portListLaneMap.end())
{In reply to: 206363163 [](ancestors = 206363163)
There was a problem hiding this comment.
This is for port breakout. PortsOrch::initPort() still will be called. #Resolved
ebdaba9 to
27e0b6b
Compare
27e0b6b to
c394548
Compare
| } | ||
|
|
||
| // TODO: Table should be const | ||
| void Consumer::refillToSync(Table* table) |
There was a problem hiding this comment.
It looks to me if we remove "bool Orch::addExistingData(Table table)" and use bool Orch::addExistingData(const string& tableName) only,
void Consumer::refillToSync(Table table) is not needed and quite a few line of redundant code function may be eliminated. #Resolved
There was a problem hiding this comment.
void Consumer::refillToSync(Table* table)
is used for optimization if we have a table object already.
In reply to: 207438102 [](ancestors = 207438102)
| } | ||
|
|
||
| doPortConfigDoneTask(tuples); | ||
| if (m_portCount != keys.size() - 2) |
There was a problem hiding this comment.
After "PortConfigDone", for some platform, port may be removed with removePort(it->second), could it cause this check invalid in case of warm start?
if (platform && (strstr(platform, BFN_PLATFORM_SUBSTRING) || strstr(platform, MLNX_PLATFORM_SUBSTRING)))
{
if (!removePort(it->second))
{
throw runtime_error("PortsOrch initialization failure.");
}
} #Resolved
There was a problem hiding this comment.
removePort() will never del entry in PORT_TABLE. So the answer is no.
In reply to: 207439672 [](ancestors = 207439672)
There was a problem hiding this comment.
In your change, doPortConfigDoneTask() is also called at construct phase,
having m_portConfigDone set prematurely will change the processing flow of void PortsOrch::doPortTask(Consumer &consumer).
/* Once all ports received, go through the each port and perform appropriate actions:
* 1. Remove ports which don't exist anymore
* 2. Create new ports
* 3. Initialize all ports
*/
if (m_portConfigDone && (m_lanesAliasSpeedMap.size() == m_portCount)) <--- will be false before all of the ports handled.
{
}
if (!m_portConfigDone) <------- m_portConfigDone is true. it continue the flow.
{
it = consumer.m_toSync.erase(it); <----- While if m_portConfigDone is false, the request will be deleted, not good for the one phase warm restore.
continue;
}
........
Port p;
if (!getPort(alias, p))
{
SWSS_LOG_ERROR("Failed to get port id by alias:%s", alias.c_str()); <--- will hit here, since initPort not called yet.
}
Why do you want to mess with the port init orders? We might hit more subtle sequence problems.
This also reminds me that the previous switching from ProducerState/ConsumerState to Producer/Consumer for portTable might have some potential issue for cold boot,
the processing after "PortConfigDone" will not get all fields of the port and make the config like fec, AN void. The old ProducerState/ConsumerState happened to get all
fields of a port key every time. #Resolved
There was a problem hiding this comment.
You are correct. I removed calling doPortConfigDoneTask in ctor. You are talking about other PR, please follow up in that PR page so I can follow up and iterate. Thanks!
In reply to: 207479060 [](ancestors = 207479060)
41ac079 to
995b195
Compare
Signed-off-by: Qi Luo <[email protected]>
Signed-off-by: Qi Luo <[email protected]>
Signed-off-by: Qi Luo <[email protected]>
Signed-off-by: Qi Luo <[email protected]>
Signed-off-by: Qi Luo <[email protected]>
995b195 to
2840988
Compare
jipanyang
left a comment
There was a problem hiding this comment.
Though I have different opinions on some of the function implementation details as listed in previous comments, I didn't see logical issue with current version, functionally it should work.
With the latest PortsOrch::doPortTask(), one of the side effects we probably missed is that:
After PortConfigDone, while waiting for "PortInitDone" and the first gBufferOrch->isPortReady(alias),
the complete m_lanesAliasSpeedMap may be populated again, so initPort() will be called more than once for the same port.
We may verify that via enabling INFO level debug, likely "SWSS_LOG_INFO("Port has already been initialized before alias:%s", alias.c_str());" will be hit. It is not fatal.
orchagent/portsorch.h
Outdated
| map<string, Port>& getAllPorts(); | ||
| bool bake(); | ||
| void cleanPortTable(const vector<string>& keys); | ||
| void doPortConfigDoneTask(const vector<FieldValueTuple>& tuples); |
There was a problem hiding this comment.
Unused function signature. #Resolved
Signed-off-by: Qi Luo <[email protected]>
Signed-off-by: Qi Luo <[email protected]>
…nic-net#551) * [scripts]: add support to db_migrator for non-default unix socket * add '-s' option to db_migrator Signed-off-by: Lawrence Lee <[email protected]> * [scripts]: make db_migrator follow python convention * change variable names Signed-off-by: Lawrence Lee <[email protected]>
* Fix addExistingData consumer converstion * Add more addExistingData() * Warm reboot for PortsOrch * Remove calling doPortConfigDoneTask in ctor * Remove unused function signature
…t#551) Adding missing FABRIC_PORT_TABLE, COUNTERS_FABRIC_QUEUE_NAME_MAP and COUNTERS_FABRIC_PORT_NAME_MAP to schema.h Signed-off-by: Maxime Lorrillere [email protected] Signed-off-by: Maxime Lorrillere <[email protected]>
The non-warm reboot behavior is backward compatible, and tested in lab.
The idea is best effort warm reboot based on left over entries in PORT_TABLE.
The warm reboot is not end-to-end tested.