Add query fallback when Electric disconnects #3402

KyleAMathews · 2025-11-06T19:58:26Z

No description provided.

This commit implements a fallback mode that allows Electric to serve shape data even when logical replication is not available. When the replication client is not ready, shape requests will query the database directly (similar to initial snapshots) and return the data to clients. Key changes: Server-side: - StatusMonitor: Track replication availability in status response - Api: Detect fallback mode and serve data via direct DB queries - Request/Response: Add fallback_mode field to track request state - Response headers: Add 'electric-fallback-mode' header to indicate fallback polling mode to clients Client-side: - Add FALLBACK_MODE_HEADER constant for detecting fallback responses Implementation details: - When replication_client_ready is false, requests enter fallback mode - Fallback requests use Shapes.query_subset to query DB directly - Data is formatted as insert operations in the shape log format - Responses include up_to_date=true and fallback_mode=true - Clients receive 'electric-fallback-mode: true' header This allows clients to continue receiving data during replication failures and provides a foundation for polling-based fallback mechanisms with configurable intervals. Related to implementing status monitoring and graceful degradation when Electric cannot connect to logical replication.

This commit adds a status endpoint and client-side support for detecting and handling fallback mode with automatic recovery. Server-side changes: - StatusPlug: New /v1/status endpoint returning server status - Cache-Control: 5-second caching for CDN efficiency - Returns: status (live/fallback/starting), replication_available, connection state, and shape subsystem state Client-side changes (TypeScript): - Fallback mode detection: Reads electric-fallback-mode header - Auto status polling: Polls /v1/status every 60 seconds when in fallback mode - Auto-recovery: Automatically switches back to live mode when server replication is restored - Cleanup: Stops polling on unsubscribe or reset Client behavior: - When fallback mode detected via header, starts status polling - Status endpoint polled every 60s (configurable) - When server returns to "live" status, triggers shape refresh - Reconnects to live replication automatically - CDN caches status responses for 5s to minimize server load This provides a complete fallback solution: 1. Server detects replication unavailable → returns fallback data 2. Client detects fallback header → starts polling status 3. Server replication restored → status endpoint reflects change 4. Client polls status → detects live mode → auto-reconnects 5. Seamless transition back to real-time replication Example usage: ```typescript const stream = new ShapeStream({ url: 'http://localhost:3000/v1/shape', params: { table: 'items' } }) // Automatically handles fallback mode and recovery stream.subscribe(messages => { // Receives data in both live and fallback modes console.log(messages) }) ```

CDN caching (5s cache-control) will handle the load, so clients can poll more frequently for faster recovery when replication is restored.

codecov · 2025-11-06T20:00:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.94%. Comparing base (2747a71) to head (73ce117).
⚠️ Report is 41 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3402      +/-   ##
==========================================
+ Coverage   69.70%   73.94%   +4.23%     
==========================================
  Files         181       21     -160     
  Lines        9826      756    -9070     
  Branches      352        0     -352     
==========================================
- Hits         6849      559    -6290     
+ Misses       2975      197    -2778     
+ Partials        2        0       -2

Flag	Coverage Δ
elixir	`73.94% <ø> (+7.21%)`	⬆️
elixir-client	`73.94% <ø> (-0.53%)`	⬇️
packages/experimental	`?`
packages/react-hooks	`?`
packages/typescript-client	`?`
packages/y-electric	`?`
postgres-140000	`?`
postgres-150000	`?`
postgres-170000	`?`
postgres-180000	`?`
sync-service	`?`
typescript	`?`
unit-tests	`73.94% <ø> (+4.23%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add comprehensive RFC document explaining the query fallback mode feature: - Problem statement and motivation - Architecture and component design - Detailed implementation specifics - Data flow diagrams - Trade-offs and alternatives considered - Testing and deployment strategy - Performance impact analysis - Future work and extensibility The RFC serves as both documentation and design rationale for the feature, covering server-side (Elixir) and client-side (TypeScript) components.

balegas · 2025-11-10T15:14:12Z

Mixed feelings about this one. Obviously it's nice to continue serving data, but wouldn't it provide inconsistent experience? How does it work to retrieve snapshots but not have live updates? E.g. you get the first page, no more live updates for it but then you get a second page that is more recent in time.

I understand we do progressive snapshots, but I presume we don't want to rely on that. Even if we do, it means we're putting more and more load on the database with fallback query for each page.

msfstef · 2025-11-10T15:36:24Z

I feel that the PG replication is as reliable as a PG read replica - with the exception that we currently don't do read only mode yet so as to serve data while the replication stream is inactive.

If we knew a replication slot exists and where we last left it, we could also create new shapes (i.e. just the snapshots) without replication active since we know that we can "resume" them once we resume replication, and that brings us closer to a read replica behaviour as well.

In short I think we should aim to provide what someone would expect from a read replica, but in change stream format.

KyleAMathews · 2025-11-10T15:39:51Z

If we knew a replication slot exists and where we last left it, we could also create new shapes (i.e. just the snapshots) without replication active since we know that we can "resume" them once we resume replication, and that brings us closer to a read replica behaviour as well.

Ooo yes! That's a great point.

KyleAMathews · 2025-11-19T16:28:14Z

Closing this PR and converting it to issue #3470 for further discussion and design consideration. The issue includes the original RFC content, team feedback from @balegas and @msfstef, and additional context from internal discussions about resync constraints and architectural alternatives.

claude added 3 commits November 6, 2025 19:46

chore: Change status polling interval from 60s to 10s

b189e86

CDN caching (5s cache-control) will handle the load, so clients can poll more frequently for faster recovery when replication is restored.

KyleAMathews marked this pull request as draft November 6, 2025 19:58

KyleAMathews mentioned this pull request Nov 19, 2025

Add query fallback when Electric disconnects #3470

Open

12 tasks

KyleAMathews closed this Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add query fallback when Electric disconnects #3402

Add query fallback when Electric disconnects #3402

Uh oh!

KyleAMathews commented Nov 6, 2025

Uh oh!

codecov bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

balegas commented Nov 10, 2025

Uh oh!

msfstef commented Nov 10, 2025

Uh oh!

KyleAMathews commented Nov 10, 2025

Uh oh!

KyleAMathews commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add query fallback when Electric disconnects #3402

Add query fallback when Electric disconnects #3402

Uh oh!

Conversation

KyleAMathews commented Nov 6, 2025

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

balegas commented Nov 10, 2025

Uh oh!

msfstef commented Nov 10, 2025

Uh oh!

KyleAMathews commented Nov 10, 2025

Uh oh!

KyleAMathews commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Nov 6, 2025 •

edited

Loading