Skip to content

Same pubsub messages propagated multiple times #1043

@oed

Description

@oed

Incident report from 3Box Labs team

The following issue was observed in our js-ipfs nodes which serves the Ceramic network. We have one shared pubsub topic across these nodes (and nodes run by other parties in the Ceramic network).

  • Version:
"libp2p": "^0.32.0",
"libp2p-gossipsub": "^0.11.1",
"ipfs-core": "~0.7.1",
"ipfs-core-types": "~0.5.1",
"ipfs-http-gateway": "~0.4.2",
"ipfs-http-server": "~0.5.1"
  • Platform:
    We run js-ipfs in a Docker container hosted on AWS Fargate on Ubuntu Linux. The container instance has 4096 vCPU units and 8192 GiB RAM. An application load balancer sits in front of the instance to send/receive internet traffic.
    Our ipfs daemon is configured with the DHT disabled and pubsub enabled. We swarm connect via wss multiaddresses.

  • Subsystem:
    pubsub

Severity:

High

Description:

Nodes in the Ceramic network use js-libp2p-pubsub as exposed by js-ipfs to propagate updates etc. about data in the network. Nov 30 we observed some very strange behavior of the pubsub system which caused our nodes to continuously crash as they were restarted.

What we observed

In a time period of roughly 6h we observed multiple pubsub messages being sent over and over again in the network. We know these are duplicates because they share the same seqno (a field generated by pubsub). The peers that create the messages also include a timestamp in the message payload when they are first being sent, and we observed messages created at the beginning of the 6h time period still being propagated at the end of it.

image
In the plot above messages that are duplicated are shown. As can be observed the number of duplicate messages starts after 12 PM and steadily grows steadily until around 5 PM. We believe this is because older messages are still being sent around as new ones are introduced. As can be see in the table below the plot many messages are seen over 100 time by a single node.

Additional information

  • By the end of this 6h window we saw over 25k messages per minute, which is a lot given our normal steady state around 100.
  • We believe there are only ~20 nodes on our network subscribed to the pubsub topic in question
  • The js-ipfs nodes we run for Ceramic run with the DHT turned off. We instead have "peerlist" containing the swarm peers for every node on the network, and on startup we manually swarm connect to all peers.
  • We observed that two multiaddresses in the above peerlist shared the same PeerId. We are 90% certain that these nodes do not share the same private key and instead one of these peers simply incorrectly copied their multiaddress.

Open questions

We still don't really know what the inciting incident was that kicked off this whole event. We've been running for months now without ever seeing anything like this. There haven't been any changes to our ipfs deployment or setup in several weeks/months, nor have there been any changes to the ceramic code that affects our use of pubsub or IPFS in several weeks/months either.

We also don't really know how/why the system recovered 6 hours after the problem started. Question for Libp2p maintainers: are there any 6 hour timeouts in the system anywhere that might be relevant?

Happy to provide any additional information that would be useful to decipher what happened here, let us know!

cc @stbrody @v-stickykeys @smrz2001 @zachferland

Paging @vasco-santos @wemeetagain, maybe you have some insights as to what could have been going on here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    need/analysisNeeds further analysis before proceedingneed/triageNeeds initial labeling and prioritization

    Type

    No type

    Projects

    Status

    🎉Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions