Skip to content

Warm reboot for PortsOrch#551

Merged
qiluo-msft merged 8 commits intosonic-net:masterfrom
qiluo-msft:qiluo/warmport
Aug 8, 2018
Merged

Warm reboot for PortsOrch#551
qiluo-msft merged 8 commits intosonic-net:masterfrom
qiluo-msft:qiluo/warmport

Conversation

@qiluo-msft
Copy link
Copy Markdown
Contributor

@qiluo-msft qiluo-msft commented Jul 27, 2018

The non-warm reboot behavior is backward compatible, and tested in lab.

The idea is best effort warm reboot based on left over entries in PORT_TABLE.

  1. During a cold reboot, the whole table is empty, so keep original behavior
  2. During a warm reboot, the previous port entries are left in the table, together with event entries such as PortConfigDone, PortInitDone. So PortsOrch will add them into PortsOrch's consumer queue to propagate to downstream syncd and ASIC.
  3. During a warm reboot, if any corruption found in the table, clean up the table and fallback to cold reboot.

The warm reboot is not end-to-end tested.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Jul 27, 2018

please fix vs test failure. #Resolved

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Jul 27, 2018

can you explain the idea in the commit message, it will be better for us to understand and do the review #Resolved

DBConnector cfgDb(CONFIG_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
DBConnector appl_db(APPL_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
DBConnector state_db(STATE_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
ProducerStateTable p(&appl_db, APP_PORT_TABLE_NAME);
Copy link
Copy Markdown
Contributor

@lguohan lguohan Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change to producer state table? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually change to ProducerTable. The entries in APP_PORT_TABLE_NAME are sequence sensitive because there are event entries.


In reply to: 205855555 [](ancestors = 205855555)

Copy link
Copy Markdown
Contributor

@lguohan lguohan Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what event entries? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PortConfigDone, PortInitDone


In reply to: 205859948 [](ancestors = 205859948)

}

// TODO: Table should be const
void Orch::addExistingData(Table *table)
Copy link
Copy Markdown
Contributor

@lguohan lguohan Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add one line description what addExistingData do? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is in the header file.


In reply to: 205862889 [](ancestors = 205862889)

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Jul 27, 2018

		notify(SUBJECT_TYPE_PORT_CHANGE, static_cast<void *>(&update));

change tab to spaces. #Resolved


Refers to: orchagent/portsorch.cpp:1163 in d363262. [](commit_id = d363262, deletion_comment = False)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another change needed for portorch warm start is bool PortsOrch::initializePort(Port &p). If it is warm start, db oper_status should not be set down. We may use bool isWarmStart() common function as that for other orchagent as needed.

DBConnector cfgDb(CONFIG_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
DBConnector appl_db(APPL_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
DBConnector state_db(STATE_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
ProducerStateTable p(&appl_db, APP_PORT_TABLE_NAME);
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do the ProducerStateTable to ProducerTable change in a separate PR? It is not strictly related to warm reboot. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, however, this PR will depend on that new one. ProducerTable is correct choice for port table.


In reply to: 205866896 [](ancestors = 205866896)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start another PR for it. Please help #556


In reply to: 205866896 [](ancestors = 205866896)

Copy link
Copy Markdown
Contributor Author

@qiluo-msft qiluo-msft Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your first suggestion is actually pretty good. Let's keep ProducerStateTable in this PR.


In reply to: 206734165 [](ancestors = 206734165,205866896)


while (it != m_consumerMap.end())
{
consumer = (Consumer*)(it->second.get());
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change overlap with part of #546 . One of the problem here is that other types of class may be in m_consumerMap too and they don't have getTableName() methods. casting them to Consumer may cause problem.
The other reason to add getName()/setName() for Executor is to facilitate debugging. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the name is inside ConsumerMap keys. Refine the code so the conversion is safe now.


In reply to: 205867876 [](ancestors = 205867876)

if (!foundPortConfigDone || !foundPortInitDone)
{
SWSS_LOG_NOTICE("No port table, fallback to cold start");
cleanPortTable(keys);
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is #547 to have common function for all swss processes to check whether a process is doing warm start .
We agreed to discuss the schema design. The basic idea there is to assume that for cold start, the corresponding WARM_RESTART_TABLE entry cleared, then we create the entry and set restart_count to 0.
Every time the process is warm restarted, restart_count is incremented by 1.

Not sure about the purpose of cleanPortTable() here.

#Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still discuss that design. No confliction here.

Not sure about the purpose of cleanPortTable() here.
If the redis leftover data for warm reboot is corrupted, clean up everything left and fallback to cold start.


In reply to: 205869453 [](ancestors = 205869453)

return false;
}

addExistingData(m_portTable.get());
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

portorch also handling LAG, LAG member, vlan and vlan members:

addExistingData(db, APP_LAG_TABLE_NAME);
addExistingData(db, APP_LAG_MEMBER_TABLE_NAME);
addExistingData(db, APP_VLAN_TABLE_NAME);
addExistingData(db, APP_VLAN_MEMBER_TABLE_NAME); #Resolved

vector<Selectable*> getSelectables();

// add the existing table data (left by warm reboot) to the consumer todo task list.
bool addExistingData(Table *table);
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few places to use addExistingData() where only table name is available. It should be better to use table name directly. #Resolved

if (m_portCount != keys.size() - 2)
{
// Invalid port table
SWSS_LOG_ERROR("Invalid port table: m_portCount");
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario will this case be hit? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis database corrupted.


In reply to: 206024024 [](ancestors = 206024024)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If redis database is corrupted, should not a whole system cold boot be performed instead of doing partial fix here and putting the system into unknown state? Supporting Individual process cold boot alone is another area. #Pending

Copy link
Copy Markdown
Contributor Author

@qiluo-msft qiluo-msft Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is another design choice, and I am ok with it.
If it will become our final decision, I am happy to iterate on it.


In reply to: 206646838 [](ancestors = 206646838)

Orch::addExecutor("PORT_STATUS_NOTIFICATIONS", portStatusNotificatier);

// Try warm start
bake();
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within PortsOrch::initializePort(Port &p), for warm start, oper_status should not be changed. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked code in void PortsOrch::doPortTask(Consumer &consumer).

During warm start, m_portListLaneMap will have all existing ports from SAI query. So there is no new port created, and no new port initialized.

The function PortsOrch::initializePort(Port &p) will not be called.


In reply to: 206311005 [](ancestors = 206311005)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is misunderstanding here, the warm restart testing I have done so far didn't skip PortsOrch::initializePort(Port &p). Even for cold start, m_portListLaneMap will have all the exiting physical ports. #Resolved

Copy link
Copy Markdown
Contributor Author

@qiluo-msft qiluo-msft Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about this code snippet:

for (auto it = m_lanesAliasSpeedMap.begin(); it != m_lanesAliasSpeedMap.end();)
{
    bool port_created = false;

    if (m_portListLaneMap.find(it->first) == m_portListLaneMap.end())
    {

In reply to: 206363163 [](ancestors = 206363163)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for port breakout. PortsOrch::initPort() still will be called. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Got your idea, and it is really a bug. I am using another PR to fix it. Please help #555


In reply to: 206644401 [](ancestors = 206644401)

@qiluo-msft qiluo-msft force-pushed the qiluo/warmport branch 2 times, most recently from ebdaba9 to 27e0b6b Compare August 1, 2018 01:44
@sonic-net sonic-net deleted a comment from lguohan Aug 1, 2018
}

// TODO: Table should be const
void Consumer::refillToSync(Table* table)
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me if we remove "bool Orch::addExistingData(Table table)" and use bool Orch::addExistingData(const string& tableName) only,
void Consumer::refillToSync(Table
table) is not needed and quite a few line of redundant code function may be eliminated. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void Consumer::refillToSync(Table* table)

is used for optimization if we have a table object already.


In reply to: 207438102 [](ancestors = 207438102)

}

doPortConfigDoneTask(tuples);
if (m_portCount != keys.size() - 2)
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After "PortConfigDone", for some platform, port may be removed with removePort(it->second), could it cause this check invalid in case of warm start?

 if (platform && (strstr(platform, BFN_PLATFORM_SUBSTRING) || strstr(platform, MLNX_PLATFORM_SUBSTRING)))
                    {
                        if (!removePort(it->second))
                        {
                            throw runtime_error("PortsOrch initialization failure.");
                        }
                    } #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removePort() will never del entry in PORT_TABLE. So the answer is no.


In reply to: 207439672 [](ancestors = 207439672)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your change, doPortConfigDoneTask() is also called at construct phase,
having m_portConfigDone set prematurely will change the processing flow of void PortsOrch::doPortTask(Consumer &consumer).

            /* Once all ports received, go through the each port and perform appropriate actions:
             * 1. Remove ports which don't exist anymore
             * 2. Create new ports
             * 3. Initialize all ports
             */
            if (m_portConfigDone && (m_lanesAliasSpeedMap.size() == m_portCount))     <--- will be false before all of the ports handled.
            {
            }
            
            if (!m_portConfigDone)              <------- m_portConfigDone is true. it continue the flow.
            {
                it = consumer.m_toSync.erase(it);      <----- While if m_portConfigDone is false, the request will be deleted, not good for the one phase warm restore.
                continue;
            }
            
            ........
            Port p;
            if (!getPort(alias, p))
            {
                SWSS_LOG_ERROR("Failed to get port id by alias:%s", alias.c_str());   <--- will hit here, since initPort not called yet.
            }

Why do you want to mess with the port init orders? We might hit more subtle sequence problems.

This also reminds me that the previous switching from ProducerState/ConsumerState to Producer/Consumer for portTable might have some potential issue for cold boot,
the processing after "PortConfigDone" will not get all fields of the port and make the config like fec, AN void. The old ProducerState/ConsumerState happened to get all
fields of a port key every time. #Resolved

Copy link
Copy Markdown
Contributor Author

@qiluo-msft qiluo-msft Aug 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I removed calling doPortConfigDoneTask in ctor. You are talking about other PR, please follow up in that PR page so I can follow up and iterate. Thanks!


In reply to: 207479060 [](ancestors = 207479060)

@qiluo-msft qiluo-msft force-pushed the qiluo/warmport branch 2 times, most recently from 41ac079 to 995b195 Compare August 8, 2018 00:01
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I have different opinions on some of the function implementation details as listed in previous comments, I didn't see logical issue with current version, functionally it should work.

With the latest PortsOrch::doPortTask(), one of the side effects we probably missed is that:
After PortConfigDone, while waiting for "PortInitDone" and the first gBufferOrch->isPortReady(alias),
the complete m_lanesAliasSpeedMap may be populated again, so initPort() will be called more than once for the same port.
We may verify that via enabling INFO level debug, likely "SWSS_LOG_INFO("Port has already been initialized before alias:%s", alias.c_str());" will be hit. It is not fatal.

map<string, Port>& getAllPorts();
bool bake();
void cleanPortTable(const vector<string>& keys);
void doPortConfigDoneTask(const vector<FieldValueTuple>& tuples);
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function signature. #Resolved

EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
…nic-net#551)

* [scripts]: add support to db_migrator for non-default unix socket

* add '-s' option to db_migrator

Signed-off-by: Lawrence Lee <[email protected]>

* [scripts]: make db_migrator follow python convention

* change variable names

Signed-off-by: Lawrence Lee <[email protected]>
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
* Fix addExistingData consumer converstion
* Add more addExistingData()
* Warm reboot for PortsOrch
* Remove calling doPortConfigDoneTask in ctor
* Remove unused function signature
jianyuewu pushed a commit to jianyuewu/sonic-swss that referenced this pull request Dec 24, 2025
…t#551)

Adding missing FABRIC_PORT_TABLE, COUNTERS_FABRIC_QUEUE_NAME_MAP and COUNTERS_FABRIC_PORT_NAME_MAP to schema.h

Signed-off-by: Maxime Lorrillere [email protected]
Signed-off-by: Maxime Lorrillere <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants