While making required changes for Pectra, learning in details how Beacon works (and how we should use it for ephemeral content), and doing recent port-Pectra deployment, I discovered that Beacon network is not very robust.
Here are some of the things that I think we should improve/fix,. in no particular order:
1. Sync
Light Client sync still fails from time to time. If we are not following the head of the chain, we can't support ephemeral content.
I think that Light Client sync should be prerequisite for starting the rest of the Portal Network activity.
Two main issues that cause sync to fail (to my observation):
- If we are talking about fresh client, they could fail because they can't find Bootstrap or one of the
LightClientUpdate for an period (~27h) since Bootstrap.
- This might not be so big problem because:
- User can always specify trusted root that is close to the head of the chain (or we can obtain one from centralized sources)
- If we fix other problems, network will become more robust and content will be available on the network and this problem will go away
- When syncing (even after just restarting the trin), sync will fail if we can't obtain
LightClientOptimisticUpdate that corresponds to the most recent slot
- This is somewhat related to the issue 2 below
- We can loosen this restriction and require update that is not necessarily at the very head of the chain, but close to it (e.g. last 32 slots)
2. Keeping up with the head of the chain
Unclear why, but LightClient sometimes lags behind the head of the chain for some time. Sometimes even for 5-10 minutes.
One of the reasons is that we always try to get the most recent LightClientOptimisticUpdate, while I believe we should try to get any that is more recent that the one we know.
3. Random gossip / retrieval
According to spec, Beacon network should use random gossip and retrieval.
I didn't check, but I believe that we do neighbourhood gossip/retrieval, just because I think that Overlay Service doesn't have both implemented (or a way for us to say which one to do).
This might not be big problem, as long as we are consistent. But it hurts some other parts, see 4 below.
4. Light-Client should be first class citizen in trin
Currently the light-client implementation (helios fork) sits separately on it's own and uses portal network as a replacement for http endpoint.
This makes interaction between Portal Network and LightClient not so easy and there are few downsides to it, for example:
Locking issue
Every slot (more precisely, at slot_timestamp + 8 secs), we call light-client's advance() function (operation that fetches most recent finality and optimistic updates and updates internal state. During the entire process, we hold a lock (code), and this can take a while because we frequently have to do RecursiveFindContent (because we don't do random gossip, we are likely not going to have content available locally).
That means, that if we need any information from the LightClient (which we do when we are offered some content and we have to verify it), we are waiting for this lock to finish
Async
Accessing most recent known finalized/optimistic header and/or historical summaries shouldn't require async call, as we should have this info available in memory and ready. Fixing this probably means refactoring how HeaderOracle works as well, but LightClient being first-class member of beacon network should help.
5. Beacon network content validation
While making changes, I observed several issues with BeaconValidator. Some of the issues:
- we might have to validate the content without relying on LightClient being in sync, and we can't' do that
- we don't always even check that content value matches the content key, or that the merkle proof is correct
6. Fork dependancy
Most of the Beacon and LightClient logic assumes that content format matches current fork (used to be Deneb, now is Electra). That makes it impossible to have smooth fork transition.
This includes Beacon bridge, validator, store and some parts of LightClient.
While making required changes for Pectra, learning in details how Beacon works (and how we should use it for ephemeral content), and doing recent port-Pectra deployment, I discovered that Beacon network is not very robust.
Here are some of the things that I think we should improve/fix,. in no particular order:
1. Sync
Light Client sync still fails from time to time. If we are not following the head of the chain, we can't support ephemeral content.
I think that Light Client sync should be prerequisite for starting the rest of the Portal Network activity.
Two main issues that cause sync to fail (to my observation):
LightClientUpdatefor an period (~27h) since Bootstrap.LightClientOptimisticUpdatethat corresponds to the most recent slot2. Keeping up with the head of the chain
Unclear why, but LightClient sometimes lags behind the head of the chain for some time. Sometimes even for 5-10 minutes.
One of the reasons is that we always try to get the most recent
LightClientOptimisticUpdate, while I believe we should try to get any that is more recent that the one we know.3. Random gossip / retrieval
According to spec, Beacon network should use random gossip and retrieval.
I didn't check, but I believe that we do neighbourhood gossip/retrieval, just because I think that Overlay Service doesn't have both implemented (or a way for us to say which one to do).
This might not be big problem, as long as we are consistent. But it hurts some other parts, see 4 below.
4. Light-Client should be first class citizen in trin
Currently the light-client implementation (helios fork) sits separately on it's own and uses portal network as a replacement for http endpoint.
This makes interaction between Portal Network and LightClient not so easy and there are few downsides to it, for example:
Locking issue
Every slot (more precisely, at
slot_timestamp + 8 secs), we call light-client'sadvance()function (operation that fetches most recent finality and optimistic updates and updates internal state. During the entire process, we hold a lock (code), and this can take a while because we frequently have to do RecursiveFindContent (because we don't do random gossip, we are likely not going to have content available locally).That means, that if we need any information from the LightClient (which we do when we are offered some content and we have to verify it), we are waiting for this lock to finish
Async
Accessing most recent known finalized/optimistic header and/or historical summaries shouldn't require
asynccall, as we should have this info available in memory and ready. Fixing this probably means refactoring howHeaderOracleworks as well, but LightClient being first-class member of beacon network should help.5. Beacon network content validation
While making changes, I observed several issues with BeaconValidator. Some of the issues:
6. Fork dependancy
Most of the Beacon and LightClient logic assumes that content format matches current fork (used to be Deneb, now is Electra). That makes it impossible to have smooth fork transition.
This includes Beacon bridge, validator, store and some parts of LightClient.