Skip to content

Warm reboot: store FDB entries and warm start#615

Merged
lguohan merged 20 commits intosonic-net:masterfrom
qiluo-msft:qiluo/fdbstate
Sep 24, 2018
Merged

Warm reboot: store FDB entries and warm start#615
lguohan merged 20 commits intosonic-net:masterfrom
qiluo-msft:qiluo/fdbstate

Conversation

@qiluo-msft
Copy link
Copy Markdown
Contributor

@qiluo-msft qiluo-msft commented Sep 12, 2018

Inspired by #558
The motivation of this PR:

  1. orchagent should not directly read ASIC DB
  2. orchagent should not assume that syncd recording FDB entries at all. It should store FDB entries based on observation on FDB notifications, and use them during warm start
  3. We believe orchagent holds the ground truth of FDB entries during warm start
  4. Orchagent will create_fdb_entry for dynamic ones, so sync don’t need special treatment on them during comparison logic
  5. Orchagent will receive fdb notifications, mapping bridge_port_id to port name and store it in StateDB for warm restore
  6. Orchagent will stop FDB learning/moving/aging before warm stop. (TBD in another PR)

Tested:

  1. vs test: store FDB notification in StateDB
  2. vs test: warm start with FDB in StateDB

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Sep 12, 2018

where do you use refreshFdbEntries? #Resolved

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the FDB entries retstored? It looks FDB entries are saved in stateDB.

Could you separate code restructuring with the function implementation in general?

There are quite some changes not related to FDB restore.

}

void FdbOrch::update(sai_fdb_event_t type, const sai_fdb_entry_t* entry, sai_object_id_t bridge_port_id)
void FdbOrch::refreshFdbEntries()
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is for FDB reconciliation, separate PR should be prepared.
FDB reconciliation requires stale entry removal, which may be done in syncd implicitly or explicitly in orchagent. #Resolved

Copy link
Copy Markdown
Contributor

@lguohan lguohan Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we plan to this reconciliation by disabling hw learn during warm-reboot on the asic. #Resolved

@qiluo-msft
Copy link
Copy Markdown
Contributor Author

qiluo-msft commented Sep 14, 2018

To answer @jipanyang's question

How are the FDB entries retstored? It looks FDB entries are saved in stateDB.
Could you separate code restructuring with the function implementation in general?
There are quite some changes not related to FDB restore.

The function refreshFdbEntries() will restore FDB from StateDB. The real actions are actually deserializing and update observing orchs.
Other code changes are just refactoring. #Closed

@jipanyang
Copy link
Copy Markdown
Contributor

jipanyang commented Sep 14, 2018

@qiluo-msft I didn't find the place where refreshFdbEntries() is called. Mixing code refactoring with new feature implementation makes it pretty hard to do code review.

@lguohan "we plan to this reconciliation by disabling hw learn during warm-reboot on the asic", in this case, unplanned warm restart won't work. swss won't be able to signal asic for unplanned restart. #Resolved

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Sep 17, 2018

@jipanyang , unplanned restart is not a goal here. It is very difficult to recover the system start from a crashed scenario. #Resolved

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Sep 17, 2018

@qiluo-msft , please indicate us to do the code review when you use the refreshfdbentries, both jipan and me asked the same question. not much to review for now. #Resolved

@qiluo-msft
Copy link
Copy Markdown
Contributor Author

In latest iteration, all warm start logic is moved into bake(), which called by orchdaemon.


In reply to: 420544357 [](ancestors = 420544357)

@qiluo-msft
Copy link
Copy Markdown
Contributor Author

In latest iteration, all warm start logic is moved into bake(), which called by orchdaemon. Refactoring removed.


In reply to: 421503216 [](ancestors = 421503216)

@qiluo-msft
Copy link
Copy Markdown
Contributor Author

qiluo-msft commented Sep 19, 2018

Please help review and meanwhile I will fix the unit test. #Closed

@sonic-net sonic-net deleted a comment from lguohan Sep 19, 2018
}
WarmStart::setWarmStartState("orchagent", WarmStart::RESTORED);
return true;
return ts.empty();
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true is 1. "return !ts.empty();" or do further update? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ts empty, warm restore is successful.


In reply to: 218952757 [](ancestors = 218952757)

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks #Resolved

std::vector<FieldValueTuple> fvs;
fvs.push_back(FieldValueTuple("port", portName));
fvs.push_back(FieldValueTuple("type", "dynamic"));
m_fdbStateTable.set(key, fvs);
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FDB entries are saved to stateDB. During system warm reboot, will stateDB be restored? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean will stateDB in redis persistent during warm reboot? Yes, it should be.

Or do you mean if fdborch will restore from stateDB? Yes, restored in bake().


In reply to: 218964895 [](ancestors = 218964895)

Copy link
Copy Markdown
Contributor

@lguohan lguohan Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stateDB will be cleared during state warm reboot. it is a problem. #Resolved

Copy link
Copy Markdown
Contributor

@lguohan lguohan Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will save the fdb entries in state db in the warm-reboot script and recover it in the going up path. @jipanyang , thanks for the catch. #Pending

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier if the data is put in a separate table in appDB so no special handling is needed for stateDB? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree easier because of the implementation detail. The design of stateDB is for SWSS state, and applDB is very old which is originally designed for inter-process communication between SWSS applications. So I made the decision to store in stateDB.

BTW, there is already a FDB_TABLE (ProviderStateTable) in applDB for user custom FDB.


In reply to: 218993140 [](ancestors = 218993140)

[],
[("SAI_FDB_ENTRY_ATTR_TYPE", "SAI_FDB_ENTRY_TYPE_DYNAMIC"),
("SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID", iface_2_bridge_port_id["Ethernet64"]),
]
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the entry removed during orchagent restart? If not, probably other check like CRM counter value might be better. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are asking entry in ASIC DB, no. orchagent warm start will create dynamic fdb, and syncd apply view will match it with old view, and keep it.
I will add CRM test.


In reply to: 219020615 [](ancestors = 219020615)


finally:
# disable warm restart
# TODO: use cfg command to config it
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"config warm_restart enable swss" is available. #Pending

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following tests in test_warm_reboot.py. I will update later if those tests updated.


In reply to: 219020740 [](ancestors = 219020740)


try:
# restart orchagent
dvs.runcmd(['sh', '-c', 'supervisorctl restart orchagent'])
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to do swss start/stop? If it is for warm restart testing, test_warm_reboot.py might be a good place for all related test cases.

# start processes in SWSS
def start_swss(dvs):
    dvs.runcmd(['sh', '-c', 'supervisorctl start orchagent; supervisorctl start portsyncd; supervisorctl start intfsyncd; \
        supervisorctl start neighsyncd; supervisorctl start intfmgrd; supervisorctl start vlanmgrd; \
        supervisorctl start buffermgrd; supervisorctl start arp_update'])

# stop processes in SWSS
def stop_swss(dvs):
    dvs.runcmd(['sh', '-c', 'supervisorctl stop orchagent; supervisorctl stop portsyncd; supervisorctl stop intfsyncd; \
        supervisorctl stop neighsyncd;  supervisorctl stop intfmgrd; supervisorctl stop vlanmgrd; \
        supervisorctl stop buffermgrd; supervisorctl stop arp_update'])

``` #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it to restart swss.
It is also for fdb test, so I keep it here.


In reply to: 219020965 [](ancestors = 219020965)

string portName = port.m_alias;

// ref: https://github.com/Azure/sonic-swss/blob/master/doc/swss-schema.md#fdb_table
string key = "Vlan" + to_string(vlan_id) + ":" + mac.to_string();
Copy link
Copy Markdown
Contributor

@lguohan lguohan Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: [](start = 48, length = 1)

should this be '|' #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is a trick. All entries stored in StateDB FDB_TABLE will be fed into ApplDB during warm starting. So I keep them the same pattern.


In reply to: 219357862 [](ancestors = 219357862)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel it is more convenient to save the data in appDB using a new table like ASIC_FDB_TABLE, so to eliminate all the exception handling (stateDB save/restore during system warm reboot, and the special ":" separator in stateDB).


size_t refilled = consumer->refillToSync(&m_fdbStateTable);
SWSS_LOG_NOTICE("Add warm input FDB State: %s, %zd", APP_FDB_TABLE_NAME, refilled);
return true;
Copy link
Copy Markdown
Contributor

@lguohan lguohan Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this call fdb notification? here the consumer is the app db? how can the fdb entries in state db got translated into fdb notificastions? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are you asking syncd receive this fdb entry and send orchagent notification? No.
  2. Yes, the consumer is the app db
  3. Never translated.

In reply to: 219421390 [](ancestors = 219421390)

Signed-off-by: Qi Luo <[email protected]>
@sonic-net sonic-net deleted a comment from lguohan Sep 23, 2018
@lguohan lguohan merged commit f1e1109 into sonic-net:master Sep 24, 2018
@qiluo-msft qiluo-msft deleted the qiluo/fdbstate branch September 24, 2018 17:56
@qiluo-msft qiluo-msft restored the qiluo/fdbstate branch September 24, 2018 17:56
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
* Refactor

Signed-off-by: Qi Luo <[email protected]>

* Refactor

Signed-off-by: Qi Luo <[email protected]>

* Refactor

Signed-off-by: Qi Luo <[email protected]>

* Refactor

Signed-off-by: Qi Luo <[email protected]>

* Store fdb notification immediately after popping from NotificationConsumer

Signed-off-by: Qi Luo <[email protected]>

* Add syncUpFdb()

Signed-off-by: Qi Luo <[email protected]>

* Refactor

Signed-off-by: Qi Luo <[email protected]>

* Add vlan ping test

Signed-off-by: Qi Luo <[email protected]>

* Refine test: verify AsicDB and StateDB on fdb entries

Signed-off-by: Qi Luo <[email protected]>

* Refactor test

* Add FDB state sync up and test

Signed-off-by: Qi Luo <[email protected]>

* Revert many refactoring, and store FDB notification with port/vlan info

Signed-off-by: Qi Luo <[email protected]>

* Fix test

Signed-off-by: Qi Luo <[email protected]>

* Refine macro name

Signed-off-by: Qi Luo <[email protected]>

* Add test on CRM counters

Signed-off-by: Qi Luo <[email protected]>

* Restart swss in test

Signed-off-by: Qi Luo <[email protected]>

* Fix bugs: increase iterator before erase, no double dec CRM counter

Signed-off-by: Qi Luo <[email protected]>

* Fix test

Signed-off-by: Qi Luo <[email protected]>

* Split fdb tests into 2 files

Signed-off-by: Qi Luo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants