Skip to content

Conversation

@KyleAMathews
Copy link
Contributor

No description provided.

This commit implements a fallback mode that allows Electric to serve
shape data even when logical replication is not available. When the
replication client is not ready, shape requests will query the database
directly (similar to initial snapshots) and return the data to clients.

Key changes:

Server-side:
- StatusMonitor: Track replication availability in status response
- Api: Detect fallback mode and serve data via direct DB queries
- Request/Response: Add fallback_mode field to track request state
- Response headers: Add 'electric-fallback-mode' header to indicate
  fallback polling mode to clients

Client-side:
- Add FALLBACK_MODE_HEADER constant for detecting fallback responses

Implementation details:
- When replication_client_ready is false, requests enter fallback mode
- Fallback requests use Shapes.query_subset to query DB directly
- Data is formatted as insert operations in the shape log format
- Responses include up_to_date=true and fallback_mode=true
- Clients receive 'electric-fallback-mode: true' header

This allows clients to continue receiving data during replication
failures and provides a foundation for polling-based fallback
mechanisms with configurable intervals.

Related to implementing status monitoring and graceful degradation
when Electric cannot connect to logical replication.
This commit adds a status endpoint and client-side support for
detecting and handling fallback mode with automatic recovery.

Server-side changes:
- StatusPlug: New /v1/status endpoint returning server status
- Cache-Control: 5-second caching for CDN efficiency
- Returns: status (live/fallback/starting), replication_available,
  connection state, and shape subsystem state

Client-side changes (TypeScript):
- Fallback mode detection: Reads electric-fallback-mode header
- Auto status polling: Polls /v1/status every 60 seconds when in
  fallback mode
- Auto-recovery: Automatically switches back to live mode when
  server replication is restored
- Cleanup: Stops polling on unsubscribe or reset

Client behavior:
- When fallback mode detected via header, starts status polling
- Status endpoint polled every 60s (configurable)
- When server returns to "live" status, triggers shape refresh
- Reconnects to live replication automatically
- CDN caches status responses for 5s to minimize server load

This provides a complete fallback solution:
1. Server detects replication unavailable → returns fallback data
2. Client detects fallback header → starts polling status
3. Server replication restored → status endpoint reflects change
4. Client polls status → detects live mode → auto-reconnects
5. Seamless transition back to real-time replication

Example usage:
```typescript
const stream = new ShapeStream({
  url: 'http://localhost:3000/v1/shape',
  params: { table: 'items' }
})

// Automatically handles fallback mode and recovery
stream.subscribe(messages => {
  // Receives data in both live and fallback modes
  console.log(messages)
})
```
CDN caching (5s cache-control) will handle the load, so clients can
poll more frequently for faster recovery when replication is restored.
@KyleAMathews KyleAMathews marked this pull request as draft November 6, 2025 19:58
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.94%. Comparing base (2747a71) to head (73ce117).
⚠️ Report is 41 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3402      +/-   ##
==========================================
+ Coverage   69.70%   73.94%   +4.23%     
==========================================
  Files         181       21     -160     
  Lines        9826      756    -9070     
  Branches      352        0     -352     
==========================================
- Hits         6849      559    -6290     
+ Misses       2975      197    -2778     
+ Partials        2        0       -2     
Flag Coverage Δ
elixir 73.94% <ø> (+7.21%) ⬆️
elixir-client 73.94% <ø> (-0.53%) ⬇️
packages/experimental ?
packages/react-hooks ?
packages/typescript-client ?
packages/y-electric ?
postgres-140000 ?
postgres-150000 ?
postgres-170000 ?
postgres-180000 ?
sync-service ?
typescript ?
unit-tests 73.94% <ø> (+4.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add comprehensive RFC document explaining the query fallback mode feature:

- Problem statement and motivation
- Architecture and component design
- Detailed implementation specifics
- Data flow diagrams
- Trade-offs and alternatives considered
- Testing and deployment strategy
- Performance impact analysis
- Future work and extensibility

The RFC serves as both documentation and design rationale for the
feature, covering server-side (Elixir) and client-side (TypeScript)
components.
@balegas
Copy link
Contributor

balegas commented Nov 10, 2025

Mixed feelings about this one. Obviously it's nice to continue serving data, but wouldn't it provide inconsistent experience? How does it work to retrieve snapshots but not have live updates? E.g. you get the first page, no more live updates for it but then you get a second page that is more recent in time.

I understand we do progressive snapshots, but I presume we don't want to rely on that. Even if we do, it means we're putting more and more load on the database with fallback query for each page.

@msfstef
Copy link
Contributor

msfstef commented Nov 10, 2025

I feel that the PG replication is as reliable as a PG read replica - with the exception that we currently don't do read only mode yet so as to serve data while the replication stream is inactive.

If we knew a replication slot exists and where we last left it, we could also create new shapes (i.e. just the snapshots) without replication active since we know that we can "resume" them once we resume replication, and that brings us closer to a read replica behaviour as well.

In short I think we should aim to provide what someone would expect from a read replica, but in change stream format.

@KyleAMathews
Copy link
Contributor Author

If we knew a replication slot exists and where we last left it, we could also create new shapes (i.e. just the snapshots) without replication active since we know that we can "resume" them once we resume replication, and that brings us closer to a read replica behaviour as well.

Ooo yes! That's a great point.

@KyleAMathews
Copy link
Contributor Author

Closing this PR and converting it to issue #3470 for further discussion and design consideration. The issue includes the original RFC content, team feedback from @balegas and @msfstef, and additional context from internal discussions about resync constraints and architectural alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants