Add rate limiting for single-order reconciliation queries

cjdsellers · cjdsellers · commit 5b5fd4ad7e33 · 2025-10-05T21:06:22.000+10:00
diff --git a/RELEASES.md b/RELEASES.md
@@ -7,6 +7,7 @@ This will be the final release with support for Python 3.11.
 ### Enhancements
 - Added support for `OrderBookDepth10` requests (#2955), thanks @faysou
 - Added support for quotes from book depths (#2977), thanks @faysou
+- Added execution engine rate limiting for single-order reconciliation queries
 - Added Renko bar aggregator (#2941), thanks @faysou
 - Added `time_range_generator` for on the fly data data subscriptions (#2952), thanks @faysou
 - Added `__repr__` to `NewsEvent` (#2958), thanks @MK27MK
diff --git a/docs/concepts/live.md b/docs/concepts/live.md
@@ -215,6 +215,8 @@ The execution engine reuses a single retry counter (`_recon_check_retries`) for
 
 When the open-order loop exhausts its retries, the engine issues one targeted `GenerateOrderStatusReport` probe before applying a terminal state. If the venue returns the order, reconciliation proceeds and the retry counter resets automatically.
 
+**Single-order query protection**: To prevent rate limit exhaustion when many orders need individual queries, the engine limits single-order queries per reconciliation cycle via `max_single_order_queries_per_cycle` (default: 10). When this limit is reached, remaining orders are deferred to the next cycle. Additionally, the engine adds a configurable delay (`single_order_query_delay_ms`, default: 100ms) between single-order queries to further prevent rate limiting. This ensures the system can handle scenarios where bulk queries fail for hundreds of orders without overwhelming the venue API.
+
 Orders that age beyond `open_check_lookback_mins` rely on this targeted probe. Keep the lookback generous for venues with short history windows, and consider increasing `open_check_threshold_ms` if venue timestamps lag the local clock so recently updated orders are not marked missing prematurely.
 
 This ensures the trading node maintains a consistent execution state even under unreliable conditions.
@@ -228,8 +230,10 @@ This ensures the trading node maintains a consistent execution state even under
 | `open_check_open_only`              | True           | When enabled, only open orders are requested during checks; if disabled, full order history is fetched (resource-intensive).         |
 | `open_check_lookback_mins`          | 60&nbsp;min    | Lookback window (minutes) for order status polling during continuous reconciliation. Only orders modified within this window are considered. |
 | `open_check_threshold_ms`           | 5,000&nbsp;ms  | Minimum time since the order's last cached event before open-order checks act on venue discrepancies (missing, mismatched status, etc.). |
-| `open_check_missing_retries`        | 5&nbsp;retries | Maximum retries before resolving an order that is open in cache but not found at venue. Prevents false positives from race conditions. |
-| `reconciliation_startup_delay_secs` | 10.0&nbsp;s    | Additional delay (seconds) applied *after* startup reconciliation completes before starting continuous reconciliation loop. Provides time for additional system stabilization. |
+| `open_check_missing_retries`           | 5&nbsp;retries | Maximum retries before resolving an order that is open in cache but not found at venue. Prevents false positives from race conditions. |
+| `max_single_order_queries_per_cycle`   | 10             | Maximum number of single-order queries per reconciliation cycle. Prevents rate limit exhaustion when many orders fail bulk query checks. |
+| `single_order_query_delay_ms`          | 100&nbsp;ms    | Delay (milliseconds) between single-order queries to prevent rate limit exhaustion. |
+| `reconciliation_startup_delay_secs`    | 10.0&nbsp;s    | Additional delay (seconds) applied *after* startup reconciliation completes before starting continuous reconciliation loop. Provides time for additional system stabilization. |
 | `own_books_audit_interval_secs`     | None           | Sets the interval (in seconds) between audits of own order books against public ones. Verifies synchronization and logs errors for inconsistencies. |
 
 :::warning
diff --git a/docs/integrations/bitmex.md b/docs/integrations/bitmex.md
@@ -368,7 +368,7 @@ Exceeding BitMEX rate limits returns HTTP 429 and may trigger temporary IP bans;
 All requests automatically consume both the global burst bucket and the rolling minute bucket. Endpoints that have their own minute quota (e.g. `/api/v1/order`) also queue against that per-route key, so repeated calls with different parameters still share a single rate bucket.
 
 :::info
-For more details on rate limiting, see the official documentation: <https://www.bitmex.com/app/restAPI#Rate-Limits>.
+For more details on rate limiting, see the [BitMEX API documentation on rate limits](https://www.bitmex.com/app/restAPI#Limits).
 :::
 
 ### Rate-limit headers
diff --git a/nautilus_trader/live/config.py b/nautilus_trader/live/config.py
@@ -122,8 +122,7 @@ class LiveExecEngineConfig(ExecEngineConfig, frozen=True):
         The interval (seconds) between checks for open orders at the venue.
         If there is a discrepancy then an order status report is generated and reconciled.
         A recommended setting is between 5-10 seconds, consider API rate limits and the additional
-        request weights.
-        If no value is specified then the open order checking task is not started.
+        request weights. If no value is specified then the open order checking task is not started.
     open_check_open_only : bool, default True
         If True, the **check_open_orders** requests only currently open orders from the venue.
         If False, it requests the entire order history, which can be a heavy API call.
@@ -138,6 +137,11 @@ class LiveExecEngineConfig(ExecEngineConfig, frozen=True):
         The maximum number of retries before resolving an order that is open in cache but
         not found at the venue. This prevents race conditions where orders are resolved too
         quickly due to network delays or venue processing time.
+    max_single_order_queries_per_cycle : PositiveInt, default 10
+        The maximum number of single-order queries to perform per reconciliation cycle.
+        Prevents rate limit exhaustion when many orders fail bulk query checks.
+    single_order_query_delay_ms : NonNegativeInt, default 100
+        The delay (milliseconds) between single-order queries to prevent rate limit exhaustion.
     reconciliation_startup_delay_secs : PositiveFloat, default 10.0
         The additional delay (seconds) applied AFTER startup reconciliation
         completes before starting the continuous reconciliation loop. This provides time
@@ -193,6 +197,8 @@ class LiveExecEngineConfig(ExecEngineConfig, frozen=True):
     open_check_lookback_mins: PositiveInt = 60
     open_check_threshold_ms: NonNegativeInt = 5_000
     open_check_missing_retries: NonNegativeInt = 5
+    max_single_order_queries_per_cycle: PositiveInt = 10
+    single_order_query_delay_ms: NonNegativeInt = 100
     reconciliation_startup_delay_secs: PositiveFloat = 10.0
     purge_closed_orders_interval_mins: PositiveInt | None = None
     purge_closed_orders_buffer_mins: NonNegativeInt | None = None
diff --git a/nautilus_trader/live/execution_engine.py b/nautilus_trader/live/execution_engine.py
@@ -180,6 +180,8 @@ def __init__(
         self.open_check_lookback_mins: int = config.open_check_lookback_mins
         self.open_check_threshold_ms: int = config.open_check_threshold_ms
         self.open_check_missing_retries: int = config.open_check_missing_retries
+        self.max_single_order_queries_per_cycle: int = config.max_single_order_queries_per_cycle
+        self.single_order_query_delay_ms: int = config.single_order_query_delay_ms
         self.reconciliation_startup_delay_secs: float = config.reconciliation_startup_delay_secs
         self.purge_closed_orders_interval_mins = config.purge_closed_orders_interval_mins
         self.purge_closed_orders_buffer_mins = config.purge_closed_orders_buffer_mins
@@ -205,6 +207,8 @@ def __init__(
         self._log.info(f"{config.open_check_lookback_mins=}", LogColor.BLUE)
         self._log.info(f"{config.open_check_threshold_ms=}", LogColor.BLUE)
         self._log.info(f"{config.open_check_missing_retries=}", LogColor.BLUE)
+        self._log.info(f"{config.max_single_order_queries_per_cycle=}", LogColor.BLUE)
+        self._log.info(f"{config.single_order_query_delay_ms=}", LogColor.BLUE)
         self._log.info(f"{config.reconciliation_startup_delay_secs=}", LogColor.BLUE)
         self._log.info(f"{config.purge_closed_orders_interval_mins=}", LogColor.BLUE)
         self._log.info(f"{config.purge_closed_orders_buffer_mins=}", LogColor.BLUE)
@@ -633,14 +637,14 @@ async def _resolve_order_not_found_at_venue(self, order: Order) -> None:
         no record of it, which typically means the order was never successfully placed
         or was rejected.
 
-        Before marking as rejected, performs a targeted query to check if the order
+        Before marking as rejected, performs a single-order query to check if the order
         exists but was missed due to API timing/processing delays.
 
         """
         ts_now = self._clock.timestamp_ns()
 
         self._log.debug(
-            f"Performing targeted query for {order.client_order_id!r} before marking as REJECTED",
+            f"Performing single-order query for {order.client_order_id!r} before marking as REJECTED",
             LogColor.BLUE,
         )
 
@@ -1032,6 +1036,10 @@ async def _check_orders_consistency(self) -> None:
             missing_at_venue: set[ClientOrderId] = open_order_ids - venue_reported_ids
             ts_now = self._clock.timestamp_ns()
 
+            # Track targeted queries to prevent rate limit exhaustion
+            targeted_queries_count = 0
+            logged_limit_warning = False
+
             for client_order_id in missing_at_venue:
                 order = self._cache.order(client_order_id)
                 if order is None:
@@ -1058,12 +1066,43 @@ async def _check_orders_consistency(self) -> None:
 
                 retries = self._recon_check_retries.get(client_order_id, 0)
                 if retries >= self.open_check_missing_retries:
+                    if targeted_queries_count >= self.max_single_order_queries_per_cycle:
+                        self._recon_check_retries[client_order_id] = retries + 1
+
+                        if not logged_limit_warning:
+                            # Count how many orders at threshold are being deferred
+                            orders_at_threshold_remaining = (
+                                sum(
+                                    1
+                                    for cid in missing_at_venue
+                                    if self._recon_check_retries.get(cid, 0)
+                                    >= self.open_check_missing_retries
+                                )
+                                - targeted_queries_count
+                            )
+                            self._log.warning(
+                                f"Reached max single-order queries ({self.max_single_order_queries_per_cycle}) "
+                                f"this cycle, deferring {orders_at_threshold_remaining} order(s) at threshold to next cycle",
+                                LogColor.YELLOW,
+                            )
+                            logged_limit_warning = True
+
+                        continue  # Skip query but continue processing other orders
+
                     self._log.warning(
-                        f"Order {client_order_id!r} not found at venue after {retries} retries, performing targeted query",
+                        f"Order {client_order_id!r} not found at venue after {retries} retries, performing single-order query",
                         LogColor.YELLOW,
                     )
                     self._clear_recon_tracking(client_order_id, drop_last_query=False)
                     await self._resolve_order_not_found_at_venue(order)
+                    targeted_queries_count += 1
+
+                    # Add delay between single-order queries (skip after final query)
+                    if (
+                        targeted_queries_count < self.max_single_order_queries_per_cycle
+                        and self.single_order_query_delay_ms > 0
+                    ):
+                        await asyncio.sleep(self.single_order_query_delay_ms / 1000.0)
                 else:
                     self._recon_check_retries[client_order_id] = retries + 1
                     self._log.debug(
diff --git a/tests/integration_tests/live/test_live_reconciliation.py b/tests/integration_tests/live/test_live_reconciliation.py
@@ -736,3 +736,206 @@ async def test_concurrent_order_reconciliation(
     assert orders[2].filled_qty == Quantity.from_int(30_000)  # Verify complete fill
     assert orders[3].status == OrderStatus.REJECTED  # Venue reported REJECTED
     assert orders[4].status == OrderStatus.CANCELED  # Venue reported CANCELED
+
+
+@pytest.mark.asyncio()
+async def test_targeted_query_limiting(
+    msgbus,
+    cache,
+    clock,
+    trader_id,
+    account_id,
+    order_factory,
+):
+    """
+    Test that single-order queries are limited per cycle to prevent rate limit
+    exhaustion.
+
+    Simulates a scenario where:
+    1. Many orders fail the bulk query check
+    2. Single-order queries are needed for each order
+    3. System limits queries per cycle to prevent rate limit errors
+
+    """
+    # Arrange - Configure engine with low limits for testing
+    config = LiveExecEngineConfig(
+        open_check_interval_secs=1.0,
+        open_check_open_only=False,  # Full history mode so missing orders are detected
+        max_single_order_queries_per_cycle=3,  # Low limit for testing
+        single_order_query_delay_ms=50,  # Small delay for testing
+        open_check_missing_retries=0,  # Immediately trigger single-order queries
+    )
+
+    exec_engine = LiveExecutionEngine(
+        loop=asyncio.get_running_loop(),
+        msgbus=msgbus,
+        cache=cache,
+        clock=clock,
+        config=config,
+    )
+
+    exec_client = MockLiveExecutionClient(
+        loop=asyncio.get_running_loop(),
+        client_id=ClientId(SIM.value),
+        venue=SIM,
+        account_type=AccountType.CASH,
+        base_currency=USD,
+        instrument_provider=InstrumentProvider(),
+        msgbus=msgbus,
+        cache=cache,
+        clock=clock,
+    )
+
+    exec_engine.register_client(exec_client)
+    exec_engine.start()
+
+    # Create 10 orders and add them to cache as ACCEPTED
+    orders = []
+    for i in range(10):
+        order = order_factory.limit(
+            instrument_id=AUDUSD_SIM.id,
+            order_side=OrderSide.BUY,
+            quantity=AUDUSD_SIM.make_qty(100),
+            price=AUDUSD_SIM.make_price(1.0),
+        )
+        cache.add_order(order)
+        order.apply(TestEventStubs.order_submitted(order))
+        order.apply(TestEventStubs.order_accepted(order))
+        cache.update_order(order)
+        orders.append(order)
+
+    # Mock returns empty reports (all orders "missing at venue")
+    # No reports added to exec_client, so generate_order_status_reports returns []
+
+    # Act - Run check_orders_consistency which should limit single-order queries
+    await exec_engine._check_orders_consistency()
+
+    # Assert - Only 3 single-order queries should have been attempted (max_single_order_queries_per_cycle)
+    # Since single-order queries return None, orders should be resolved as REJECTED
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 3)
+
+    # Run another cycle to process more orders
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 6)
+
+    # Run one more cycle
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 9)
+
+    # Final cycle for the last order
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 10)
+
+    # Cleanup
+    exec_engine.stop()
+    await eventually(lambda: exec_engine.is_stopped)
+
+
+@pytest.mark.asyncio()
+async def test_targeted_query_limiting_with_retry_accumulation(
+    msgbus,
+    cache,
+    clock,
+    trader_id,
+    account_id,
+    order_factory,
+):
+    """
+    Test that orders accumulate retries even when max_single_order_queries_per_cycle is
+    reached, ensuring reconciliation progresses over multiple cycles.
+
+    Simulates a scenario where:
+    1. Many orders need reconciliation simultaneously
+    2. Rate limits prevent querying all at once
+    3. Orders continue accumulating retries while waiting
+    4. All orders eventually get reconciled
+
+    """
+    # Arrange - Configure with realistic retry threshold
+    config = LiveExecEngineConfig(
+        open_check_interval_secs=1.0,
+        open_check_open_only=False,  # Full history mode
+        max_single_order_queries_per_cycle=3,  # Low limit for testing
+        single_order_query_delay_ms=10,  # Small delay for testing
+        open_check_missing_retries=5,  # Realistic retry threshold
+    )
+
+    exec_engine = LiveExecutionEngine(
+        loop=asyncio.get_running_loop(),
+        msgbus=msgbus,
+        cache=cache,
+        clock=clock,
+        config=config,
+    )
+
+    exec_client = MockLiveExecutionClient(
+        loop=asyncio.get_running_loop(),
+        client_id=ClientId(SIM.value),
+        venue=SIM,
+        account_type=AccountType.CASH,
+        base_currency=USD,
+        instrument_provider=InstrumentProvider(),
+        msgbus=msgbus,
+        cache=cache,
+        clock=clock,
+    )
+
+    exec_engine.register_client(exec_client)
+    exec_engine.start()
+
+    # Create 10 orders, all ACCEPTED (missing at venue)
+    orders = []
+    for i in range(10):
+        order = order_factory.limit(
+            instrument_id=AUDUSD_SIM.id,
+            order_side=OrderSide.BUY,
+            quantity=AUDUSD_SIM.make_qty(100),
+            price=AUDUSD_SIM.make_price(1.0),
+        )
+        cache.add_order(order)
+        order.apply(TestEventStubs.order_submitted(order))
+        order.apply(TestEventStubs.order_accepted(order))
+        cache.update_order(order)
+        orders.append(order)
+
+    # Cycle 1: All orders get retry count 1 (none ready for query yet)
+    await exec_engine._check_orders_consistency()
+    for order in orders:
+        assert exec_engine._recon_check_retries.get(order.client_order_id, 0) == 1
+    assert all(o.status == OrderStatus.ACCEPTED for o in orders)
+
+    # Cycle 2-5: Retry counts accumulate to 5 (threshold)
+    for cycle in range(2, 6):
+        await exec_engine._check_orders_consistency()
+        for order in orders:
+            assert exec_engine._recon_check_retries.get(order.client_order_id, 0) == cycle
+
+    # Cycle 6: All 10 orders now at threshold (5), but only 3 can be queried
+    # First 3 get queried and resolved, remaining 7 increment to retry count 6
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 3)
+
+    # Check that the remaining 7 orders have retry count incremented
+    for order in orders:
+        if order.status == OrderStatus.ACCEPTED:
+            # These hit the limit, got retries incremented but not queried
+            assert exec_engine._recon_check_retries.get(order.client_order_id, 0) == 6
+
+    # Cycle 7: 3 more get queried (total 6 resolved), remaining 4 at retry 7
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 6)
+
+    # Cycle 8: 3 more (total 9), 1 remaining at retry 8
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 9)
+
+    # Cycle 9: Last order resolved
+    await exec_engine._check_orders_consistency()
+    await eventually(lambda: len([o for o in orders if o.status == OrderStatus.REJECTED]) == 10)
+
+    # All orders eventually processed
+    await eventually(lambda: all(o.status == OrderStatus.REJECTED for o in orders))
+
+    # Cleanup
+    exec_engine.stop()
+    await eventually(lambda: exec_engine.is_stopped)