-
Notifications
You must be signed in to change notification settings - Fork 508
Description
Incident report from 3Box Labs team
The following issue was observed in our js-ipfs nodes which serves the Ceramic network. We have one shared pubsub topic across these nodes (and nodes run by other parties in the Ceramic network).
- Version:
"libp2p": "^0.32.0",
"libp2p-gossipsub": "^0.11.1",
"ipfs-core": "~0.7.1",
"ipfs-core-types": "~0.5.1",
"ipfs-http-gateway": "~0.4.2",
"ipfs-http-server": "~0.5.1"-
Platform:
We run js-ipfs in a Docker container hosted on AWS Fargate on Ubuntu Linux. The container instance has 4096 vCPU units and 8192 GiB RAM. An application load balancer sits in front of the instance to send/receive internet traffic.
Our ipfs daemon is configured with the DHT disabled and pubsub enabled. We swarm connect via wss multiaddresses. -
Subsystem:
pubsub
Severity:
High
Description:
Nodes in the Ceramic network use js-libp2p-pubsub as exposed by js-ipfs to propagate updates etc. about data in the network. Nov 30 we observed some very strange behavior of the pubsub system which caused our nodes to continuously crash as they were restarted.
What we observed
In a time period of roughly 6h we observed multiple pubsub messages being sent over and over again in the network. We know these are duplicates because they share the same seqno (a field generated by pubsub). The peers that create the messages also include a timestamp in the message payload when they are first being sent, and we observed messages created at the beginning of the 6h time period still being propagated at the end of it.

In the plot above messages that are duplicated are shown. As can be observed the number of duplicate messages starts after 12 PM and steadily grows steadily until around 5 PM. We believe this is because older messages are still being sent around as new ones are introduced. As can be see in the table below the plot many messages are seen over 100 time by a single node.
Additional information
- By the end of this 6h window we saw over 25k messages per minute, which is a lot given our normal steady state around 100.
- We believe there are only ~20 nodes on our network subscribed to the pubsub topic in question
- The js-ipfs nodes we run for Ceramic run with the DHT turned off. We instead have "peerlist" containing the swarm peers for every node on the network, and on startup we manually swarm connect to all peers.
- We observed that two multiaddresses in the above peerlist shared the same PeerId. We are 90% certain that these nodes do not share the same private key and instead one of these peers simply incorrectly copied their multiaddress.
Open questions
We still don't really know what the inciting incident was that kicked off this whole event. We've been running for months now without ever seeing anything like this. There haven't been any changes to our ipfs deployment or setup in several weeks/months, nor have there been any changes to the ceramic code that affects our use of pubsub or IPFS in several weeks/months either.
We also don't really know how/why the system recovered 6 hours after the problem started. Question for Libp2p maintainers: are there any 6 hour timeouts in the system anywhere that might be relevant?
Happy to provide any additional information that would be useful to decipher what happened here, let us know!
cc @stbrody @v-stickykeys @smrz2001 @zachferland
Paging @vasco-santos @wemeetagain, maybe you have some insights as to what could have been going on here?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status