Skip to content

Make Beacon Network more robust #1815

@morph-dev

Description

@morph-dev

While making required changes for Pectra, learning in details how Beacon works (and how we should use it for ephemeral content), and doing recent port-Pectra deployment, I discovered that Beacon network is not very robust.

Here are some of the things that I think we should improve/fix,. in no particular order:

1. Sync

Light Client sync still fails from time to time. If we are not following the head of the chain, we can't support ephemeral content.

I think that Light Client sync should be prerequisite for starting the rest of the Portal Network activity.

Two main issues that cause sync to fail (to my observation):

  • If we are talking about fresh client, they could fail because they can't find Bootstrap or one of the LightClientUpdate for an period (~27h) since Bootstrap.
    • This might not be so big problem because:
      • User can always specify trusted root that is close to the head of the chain (or we can obtain one from centralized sources)
      • If we fix other problems, network will become more robust and content will be available on the network and this problem will go away
  • When syncing (even after just restarting the trin), sync will fail if we can't obtain LightClientOptimisticUpdate that corresponds to the most recent slot
    • This is somewhat related to the issue 2 below
    • We can loosen this restriction and require update that is not necessarily at the very head of the chain, but close to it (e.g. last 32 slots)

2. Keeping up with the head of the chain

Unclear why, but LightClient sometimes lags behind the head of the chain for some time. Sometimes even for 5-10 minutes.

One of the reasons is that we always try to get the most recent LightClientOptimisticUpdate, while I believe we should try to get any that is more recent that the one we know.

3. Random gossip / retrieval

According to spec, Beacon network should use random gossip and retrieval.
I didn't check, but I believe that we do neighbourhood gossip/retrieval, just because I think that Overlay Service doesn't have both implemented (or a way for us to say which one to do).

This might not be big problem, as long as we are consistent. But it hurts some other parts, see 4 below.

4. Light-Client should be first class citizen in trin

Currently the light-client implementation (helios fork) sits separately on it's own and uses portal network as a replacement for http endpoint.

This makes interaction between Portal Network and LightClient not so easy and there are few downsides to it, for example:

Locking issue

Every slot (more precisely, at slot_timestamp + 8 secs), we call light-client's advance() function (operation that fetches most recent finality and optimistic updates and updates internal state. During the entire process, we hold a lock (code), and this can take a while because we frequently have to do RecursiveFindContent (because we don't do random gossip, we are likely not going to have content available locally).

That means, that if we need any information from the LightClient (which we do when we are offered some content and we have to verify it), we are waiting for this lock to finish

Async

Accessing most recent known finalized/optimistic header and/or historical summaries shouldn't require async call, as we should have this info available in memory and ready. Fixing this probably means refactoring how HeaderOracle works as well, but LightClient being first-class member of beacon network should help.

5. Beacon network content validation

While making changes, I observed several issues with BeaconValidator. Some of the issues:

  • we might have to validate the content without relying on LightClient being in sync, and we can't' do that
  • we don't always even check that content value matches the content key, or that the merkle proof is correct

6. Fork dependancy

Most of the Beacon and LightClient logic assumes that content format matches current fork (used to be Deneb, now is Electra). That makes it impossible to have smooth fork transition.

This includes Beacon bridge, validator, store and some parts of LightClient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    beacon networkIssue related to portal beacon networkshelf-stableWill not be closed by stale-bot

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions