Skip to content

Add new setting for standby_mode to allow replicating security configuration (system) indices#92

Open
cwperks wants to merge 5 commits into
mainfrom
ccr-security-replication
Open

Add new setting for standby_mode to allow replicating security configuration (system) indices#92
cwperks wants to merge 5 commits into
mainfrom
ccr-security-replication

Conversation

@cwperks

@cwperks cwperks commented May 1, 2026

Copy link
Copy Markdown
Owner

Description

This PR introduces a plugins.security.standby_mode setting that enables the security plugin to operate in a read-only standby mode, designed for disaster recovery (DR) architectures where a follower cluster receives replicated security configuration via Cross-Cluster Replication (CCR).

Background: Current CCR does not support system indices (indices starting with .). A companion change to the CCR plugin removes this restriction for the Security index path. However, even with that restriction lifted, the security plugin needs to be aware that it is operating as a follower: it should not bootstrap its own security index, should not accept config mutations, and should refresh its caches from the replicated index.

This PR also includes the Security-side transport and authorization support needed for actual CCR replication of .opendistro_security. CCR bootstrap restore can create the follower shard before the security config documents are available; the documents may arrive through CCR replay shortly afterward. Security therefore needs a narrow pre-initialization path for both CCR restore and CCR replay so the replicated config can land before Security has initialized.

What standby mode does:

  • Blocks security config mutation APIs (PUT/POST/PATCH/DELETE to /_plugins/_security/api/*) with a clear 403 response: "Security configuration is read-only because this cluster is in standby mode."
  • Allows config reads (GET requests) — roles, users, etc. can still be queried
  • Skips security index bootstrap/initialization — the security index is expected to arrive via CCR replication, not be created locally
  • Polls the replicated security index every 5 seconds to detect changes and refresh auth/authz caches — necessary because CCR-replicated writes don't trigger the normal ConfigUpdateAction broadcast
  • Authentication and authorization continue working using the replicated configuration
  • Allows narrowly scoped CCR restore/replay before Security initialization only for the exact registered system index marked by CCR
  • Can be toggled dynamically via cluster settings — standby enforcement paths read the current plugins.security.standby_mode value, so a follower cluster can be promoted by disabling standby mode without restarting nodes

CCR replicated system index handling:

  • CCR marks same-index system-index replication work with thread context: opensearch.ccr.replicated_system_index = <exact index name>
  • Security preserves that marker across transport using an internal header
  • Security only relaxes standby system-index protection when:
    • standby mode is enabled,
    • the CCR marker is present,
    • the request target exactly matches the marker,
    • the target is a registered system index via SystemIndexRegistry
  • The pre-init path covers:
    • RestoreSnapshotRequest for the exact marked registered system index
    • CCR replay action indices:data/write/plugins/replication/changes for the exact marked registered system index

This avoids adding a broad allowlist for dot indices or arbitrary system indices. The source of truth remains plugin system-index registration.

Setting: plugins.security.standby_mode (boolean, default false, dynamic, sensitive)

Promotion workflow:

  1. Stop CCR replication (makes the security index writable)
  2. Set plugins.security.standby_mode: false via the cluster settings API
  3. Security stops standby config polling and resumes normal security-index ownership behavior
  4. Security config mutation APIs become available again without a node restart

Related work:

Changes

File Change
ConfigConstants.java New SECURITY_STANDBY_MODE constant plus CCR replicated-system-index marker/header constants
StandbyModeSetting.java New dynamic setting holder for plugins.security.standby_mode, shared by standby enforcement paths
OpenSearchSecurityPlugin.java Register dynamic/sensitive setting; register cluster settings listener; pass shared standby setting into Security components
AbstractApiAction.java Block mutation requests (non-GET) based on the current standby mode value
SecurityApiDependencies.java / SecurityRestApiActions.java Wire the shared standby setting into REST API actions
ConfigurationRepository.java Skip bootstrap in standby mode; poll replicated security index every 5s; stop polling and resume normal ownership behavior when standby is disabled dynamically
SecurityInterceptor.java Preserve CCR replicated-system-index marker across same-node and cross-node transport requests
SecurityRequestHandler.java Restore CCR replicated-system-index marker from transport header into thread context
SecurityFilter.java Allow narrowly scoped standby CCR restore/replay before Security configuration is initialized, based on the current standby mode value
SystemIndexAccessEvaluator.java Legacy evaluator allows marked standby CCR access to registered system indices only, based on the current standby mode value
PrivilegesEvaluatorImpl.java Legacy wiring passes ThreadContext; next-gen evaluator preserves system-index protections except for marked standby CCR access; both use the current standby mode value
StandbyModeSettingTest.java Unit test verifies the standby setting tracks cluster setting updates
ConfigurationRepositoryTest.java Unit tests verify dynamic standby mode changes affect security-index initialization behavior
StandbyModeTests.java Integration tests: auth works, mutations blocked, reads allowed
StandbyModeCcrSimulationTests.java End-to-end test simulating CCR: leader cluster config is copied to standby, standby picks it up via polling, auth works with replicated config

Testing

  • 8 integration tests across 2 test classes, all passing
  • StandbyModeTests — single cluster with standby mode enabled, verifies API behavior
  • StandbyModeCcrSimulationTests — two clusters (leader + standby), simulates CCR by copying security config documents from leader to standby's security index, verifies standby detects and loads the replicated config
  • StandbyModeSettingTest — verifies plugins.security.standby_mode updates are tracked dynamically
  • ConfigurationRepositoryTest — verifies security-index initialization behavior follows the current standby mode value

Local validation with the companion CCR branch also passed:

cd /Users/craigperkins/Projects/OpenSearch/security
./gradlew compileJava publishZipToMavenLocal -Dbuild.snapshot=true

cd /Users/craigperkins/Projects/OpenSearch/cross-cluster-replication
./gradlew integTest --tests org.opensearch.replication.integ.rest.SecurityStandbyModeIT -Psecurity=true -PnumNodes=1

The CCR integration test verifies actual .opendistro_security replication into a standby follower cluster and confirms Security initializes from the replicated config.

Additional focused validation for dynamic standby mode:

cd /Users/craigperkins/Projects/OpenSearch/security
./gradlew compileJava compileTestJava
./gradlew test --tests org.opensearch.security.setting.StandbyModeSettingTest --tests org.opensearch.security.configuration.ConfigurationRepositoryTest --tests org.opensearch.security.filter.SecurityFilterTests

Future Work

  • Replace 5s polling with an index change listener or another direct config-change signal for lower-latency cache refresh
  • Promote standby_mode to a core cluster-level setting (cluster.standby_mode) that all plugins can respect
  • Add more plugin system-index coverage once other plugins are validated for standby behavior
  • Job Scheduler suppression on standby clusters
  • Promotion API / status endpoint
  • Active-to-standby failback flow: support and test dynamically enabling standby mode on a cluster that previously owned the security index

Check List

  • New functionality includes testing
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…uration (system) indices

Signed-off-by: Craig Perkins <cwperx@amazon.com>
cwperks added 3 commits May 2, 2026 11:28
Signed-off-by: Craig Perkins <craig5008@gmail.com>
Signed-off-by: Craig Perkins <craig5008@gmail.com>
Signed-off-by: Craig Perkins <craig5008@gmail.com>
Signed-off-by: Craig Perkins <craig5008@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant