Skip to content

feat: add field-level access restrictions, config file support, and sensitive field scanner#14

Merged
ergut merged 36 commits into
ergut:mainfrom
drharunyuksel:hy/feat-field-restrictions-and-scanner
Apr 8, 2026
Merged

feat: add field-level access restrictions, config file support, and sensitive field scanner#14
ergut merged 36 commits into
ergut:mainfrom
drharunyuksel:hy/feat-field-restrictions-and-scanner

Conversation

@drharunyuksel
Copy link
Copy Markdown
Contributor

@drharunyuksel drharunyuksel commented Mar 23, 2026

Why This Matters

Data warehouses often contain highly sensitive information — patient records, social security numbers, financial data, personal contact details, and authentication secrets. When an AI agent has direct access to query a BigQuery data warehouse, there is no human in the loop to prevent it from reading sensitive columns. A simple query like SELECT * FROM patients could expose thousands of PII/PHI records in a single response.

LLM inference happens in the cloud. When the agent runs a query, the results are sent to the LLM provider's servers (Anthropic, OpenAI, etc.) for processing — they leave your network. BigQuery IAM controls who can reach your data; field restrictions control what the AI agent surfaces into LLM responses. These are different protection boundaries.

This PR gives administrators fine-grained control over which tables and columns an AI agent can access, ensuring sensitive data stays protected while still allowing the AI to perform useful analytical queries on non-sensitive fields.

Security Model: Cooperative Guardrails, Not a SQL Firewall

The field restrictions and table allowlists in this server are designed as cooperative guardrails for AI agents, not as a hard security boundary against adversarial attackers.

When the agent encounters a restriction error, it reads the guidance in the error message and reformulates its query — using aggregate functions, EXCEPT clauses, or simply dropping the restricted field. In practice, AI agents cooperate immediately and consistently.

This system uses regex-based SQL analysis to detect restricted field usage. We performed penetration testing during development and fixed several bypass vectors (struct-alias expansion, comma-join evasion, implicit SELECT *). The enforcement logic is designed to fail closed (block ambiguous queries rather than allow them). Could a deliberately crafted adversarial query still slip through? Possibly — but AI agents don't write adversarial queries. They write straightforward SQL to answer the user's question. The only time we saw data leak was during our own manual penetration testing with intentionally crafted bypass queries that no AI agent would produce in normal operation.

For environments requiring strict compliance guarantees, combine these guardrails with BigQuery's native column-level security and authorized views.

Addressing Review Feedback

This update addresses all three points from your review:

1. SQL Parsing Robustness

"Regex-based SQL parsing can be bypassed through CTEs, subqueries, or aliasing. I would like to understand how robust it is against these patterns."

We performed extensive penetration testing and fixed several bypass vectors:

  • Struct-alias bypass (SELECT t FROM users AS t) — returns entire row as STRUCT, now detected and blocked
  • Comma-join evasion (FROM table1, table2) — second table was invisible to old regex, now correctly extracted
  • CTE chains (WITH a AS (...), b AS (SELECT * FROM a)) — restricted field references inside CTEs are detected
  • Alias shadowing (FROM restricted_table AS safe SELECT safe.restricted_field) — aliases are resolved back to real tables
  • Implicit SELECT * (FROM table |> LIMIT 10) — no SELECT clause means all columns returned, now treated as violation
  • Fail-closed design — if extractReferencedTables returns empty on a data query, the query is rejected rather than allowed

2. Test Coverage

"I would prefer to have test coverage before we merge. Could you please add tests, especially for the edge cases?"

Added 92 unit tests in src/sql-enforcement.test.ts using vitest, covering:

  • Unit tests for all SQL parsing helpers
  • Cooperative guardrail tests (standard query patterns)
  • Adversarial bypass tests (struct-alias, nested CTEs, comma-joins, alias shadowing, subqueries)
  • BigQuery pipe syntax penetration tests (EXTEND, SET, DROP, RENAME, AGGREGATE)
  • allowedTables enforcement tests (allowlist, fail-closed, CTE filtering, INFORMATION_SCHEMA exemption)

3. allowedTables Feature

"An allowedTables list in config.json would let users restrict the agent to a specific subset."

Implemented as a full protection mode. Users set protectionMode: "allowedTables" with an allowedTables array. Queries against any unlisted table are rejected immediately. Optional per-table field restrictions via preventedFieldsInAllowedTables. INFORMATION_SCHEMA queries are always allowed for schema discovery.

What's Included

Protection Modes

The server now supports three protection modes, configured via protectionMode in config.json:

Mode Description When active
off No protection — all tables and fields accessible No --config-file flag, or explicit "protectionMode": "off"
allowedTables Table allowlist — only listed tables can be queried Explicit "protectionMode": "allowedTables"
autoProtect Auto-scans for sensitive fields, enforces preventedFields Explicit or config without protectionMode key (backward compat)

Field-Level Access Restrictions

Define preventedFields (in autoProtect mode) or preventedFieldsInAllowedTables (in allowedTables mode) to block the AI agent from accessing sensitive columns:

{
  "protectionMode": "autoProtect",
  "maximumBytesBilled": "1000000000",
  "preventedFields": {
    "healthcare.patients": ["first_name", "last_name", "ssn", "date_of_birth", "email"],
    "billing.transactions": ["credit_card_number", "bank_account"]
  }
}

When the agent tries to query a restricted field:

SELECT first_name, last_name, diagnosis FROM healthcare.patients

The server blocks the query and returns a clear, instructive error:

Restricted fields detected — table "healthcare.patients" has restricted columns:
"first_name", "last_name", "ssn", "date_of_birth", "email".
You can only use these columns inside ["count", "countif", "avg", "sum"]
aggregate functions or exclude them with SELECT * EXCEPT (...).

The error lists ALL restricted fields for the table (not just the violated ones), so the agent can fix the query in one try without a retry loop.

Supported query patterns:

Query Pattern Behavior
SELECT restricted_col FROM table Blocked with error message
SELECT * FROM table Blocked (would expose restricted fields)
SELECT * EXCEPT(restricted_cols) FROM table Allowed
COUNT(restricted_col), AVG(...), SUM(...), COUNTIF(...) Allowed (aggregates don't expose individual values)
MIN(restricted_col), MAX(restricted_col) Blocked (returns actual individual values)
SELECT non_restricted_col FROM table Allowed

Table-Level Restrictions (allowedTables mode)

{
  "protectionMode": "allowedTables",
  "maximumBytesBilled": "10000000000",
  "allowedTables": [
    "analytics.page_views",
    "analytics.sessions",
    "reporting.daily_summary"
  ],
  "preventedFieldsInAllowedTables": {
    "analytics.page_views": ["user_ip", "user_agent"]
  }
}

Queries against any unlisted table are rejected immediately. INFORMATION_SCHEMA queries are always allowed for schema discovery.

Automated Sensitive Field Scanner

Automatically discovers sensitive columns across all BigQuery datasets by querying INFORMATION_SCHEMA.COLUMNS with configurable SQL LIKE patterns. Runs on server startup when the config is stale (based on lastScannedAt timestamp). The merge is additive-only — manually added restrictions are never removed.

First startup — running sensitive field scan...
Scanning all datasets for sensitive fields...
Found 1166 sensitive column(s) across 278 table(s)
Scan complete: config updated with 278 tables.

Graceful Startup on Missing Config

When --config-file points to a missing file, the server no longer crashes silently. It starts and returns an actionable error on every query:

Config file not found: /path/to/config.json. Your MCP server is configured with
--config-file, which requires a valid config file. To fix this: (1) create a config
file at the path above (see the example in the repository), or (2) correct the path
in --config-file, or (3) remove the --config-file flag from your MCP server settings
to run without protection.

Without --config-file, the server runs in simple/off mode. It no longer auto-discovers config.json in the working directory, avoiding collisions with unrelated config files in user projects.

Backward Compatibility

  • No breaking changes. Without --config-file, the server behaves identically to v1.0.3.
  • Existing config files without protectionMode default to autoProtect.
  • The scanner only runs in autoProtect mode.

Files Changed

File Description
src/index.ts Protection mode system, config loading, graceful startup
src/sql-enforcement.ts New — extracted SQL enforcement module (field + table restrictions)
src/sql-enforcement.test.ts New — 92 unit tests
src/sensitive-field-scanner.ts lastScannedAt timestamp for scan freshness
config.json.example Combined all modes into single example file
README.md Security model, protection modes, updated query patterns
package.json Added vitest, test scripts
tsconfig.json Excluded test files from compilation

Manual Test Results

Test 1: Simple/Off Mode (no --config-file flag)

Setup: Remove --config-file flag from MCP settings, config file may or may not exist.
Expected: Server runs in simple/off mode — all queries execute without protection.
Result:SELECT address FROM myproject.users returns data

Test 2: autoProtect Mode

Setup: --config-file flag pointing to valid config with protectionMode: "autoProtect".
Expected: Auto-scans for sensitive fields on first startup, blocks restricted fields with guidance.
Result:address column detected as sensitive and blocked with aggregate guidance

Test 3: allowedTables Mode

Setup: --config-file flag pointing to valid config with protectionMode: "allowedTables", allowedTables containing only myproject.users, and preventedFieldsInAllowedTables restricting address, email, first_name, last_name.
Expected: Only listed tables are queryable; restricted fields within allowed tables are blocked.
Result: ✅ Non-allowed table myproject.internal.restricted_table blocked; first_name in myproject.users blocked with guidance; id (unrestricted) allowed

Test 4: Off Mode (with config)

Setup: --config-file flag pointing to valid config with protectionMode: "off".
Expected: All protection bypassed, queries execute normally.
Result:SELECT address FROM myproject.users returns data

Test 5: Missing Config File (with flag)

Setup: --config-file flag points to non-existent path.
Expected: Server starts but all queries are blocked with helpful error message directing user to fix the path or remove the flag.
Result: ✅ Error returned: "Config file not found... To fix this: (1) create a config file..., (2) correct the path in --config-file, or (3) remove the --config-file flag"

Test 6: Simple Mode (no flag, config exists)

Setup: No --config-file flag, valid config file exists on disk but is ignored.
Expected: Server ignores existing config, runs in simple/off mode.
Result:SELECT address FROM myproject.users returns data

Test Plan

  • 92 unit tests pass (npm test)
  • TypeScript builds cleanly (npm run build)
  • Manual test 1: Simple/off mode (no flag) — all queries allowed
  • Manual test 2: autoProtect mode — sensitive fields auto-discovered and blocked
  • Manual test 3: allowedTables mode — table allowlist + field restrictions enforced
  • Manual test 4: Off mode (with config) — all protection bypassed
  • Manual test 5: Missing config file — server starts, queries return helpful error
  • Manual test 6: No flag with config on disk — config ignored, simple mode
  • Backward compatible — config without protectionMode defaults to autoProtect

drharunyuksel and others added 25 commits October 9, 2025 11:37
Introduce configurable field restriction system to prevent querying of sensitive columns. Users can specify a JSON file mapping table names to restricted column names. The system validates restriction file accessibility during startup and enforces restrictions at query time by parsing SQL statements and blocking queries that contain restricted fields. This provides an additional security layer for organizations needing to limit access to specific data columns while allowing broader table access.
Add support for using restricted columns within aggregate functions like count, sum, avg, min, and max. The field restriction enforcement now distinguishes between direct column access and aggregated usage, allowing sensitive fields to be used in statistical queries while preventing direct access to individual values.

Also adds support for detecting SELECT * queries and table aliases when enforcing restrictions, and updates the error message to clearly communicate the allowed aggregate functions.
…tions

Enhance the field restriction system to support SELECT * EXCEPT (...) syntax,
allowing users to exclude specific sensitive columns when using star (*) in
their queries. This adds flexibility to query writing while maintaining
security by preventing access to restricted fields.

The implementation includes parsing of EXCEPT clauses, tracking of star usages
with their qualifiers and excluded columns, and validation logic to ensure
restricted fields are properly excluded or aggregated. The error message has
also been updated to inform users about the EXCEPT option.
- Add shared scanner module (sensitive-field-scanner.ts) with BigQuery
  INFORMATION_SCHEMA scan, merge logic, and daily staleness check
- Add standalone CLI script (scan-sensitive-fields.ts) for manual runs
- Integrate daily scan into server startup — runs on first connection
  of the day, skips on subsequent starts
- Add scan-fields npm script
- Update config.json with 278 tables of sensitive field restrictions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- config.json contains environment-specific HIPAA field mappings,
  should not be committed
- Add config.template.json as a starting point for new users
- The scan-fields script auto-populates config.json on first run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add sensitiveFieldPatterns to config for user-extensible LIKE patterns
- Add sensitiveFieldScanFrequencyDays to config (default: 1, set 0 to disable)
- Scanner reads both settings from config, falls back to defaults if missing
- Standalone CLI always runs regardless of frequency setting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ensitive field scanner

- Add config.json support for centralized server configuration
  (optional — server uses safe defaults without a config file)
- Add field-level access restrictions via preventedFields config
  to block queries from accessing sensitive columns (PII/PHI)
- Support SELECT *, SELECT * EXCEPT, and aggregate functions
  in field restriction enforcement
- Add automated sensitive field scanner that discovers sensitive
  columns by querying BigQuery INFORMATION_SCHEMA.COLUMNS
- Add configurable scan patterns and frequency
- Add standalone CLI tool: npm run scan-fields
- Auto-scan on server startup with configurable frequency
- Move maximumBytesBilled from per-query parameter to server config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Add comprehensive documentation for field-level access restrictions,
automated sensitive field scanner, custom patterns, and configurability.
Also add AGENTS.md to gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.gitignore
#	README.md
#	src/index.ts
Prevents the scanner from creating an unexpected config.json when
the user never provided one. The scan now only runs if a config file
already exists or was explicitly passed via --config-file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The extractSelectClause function only matched standard SQL (SELECT ... FROM)
pattern. Pipe syntax queries (FROM table |> SELECT *) bypassed the star
detection entirely, allowing restricted fields to be returned via SELECT *.

Now extracts SELECT clauses from both standard and pipe syntax queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alse positives

- extractSelectClause now detects SELECT clauses in pipe syntax (|> SELECT)
  in addition to standard SQL (SELECT ... FROM)
- Strip EXCEPT clauses before scanning for direct field references,
  preventing false positives when restricted fields appear in EXCEPT lists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove MIN/MAX from allowed aggregates — they return actual individual
  values (e.g. MIN(name) leaks a real name). Only COUNT, COUNTIF, AVG,
  SUM are now allowed on restricted fields.
- Strip SQL comments and string literals before checking field references,
  preventing false positives when restricted field names appear in
  comments (-- patient_name) or strings ('Dr patient_name Clinic').
- Support complex aggregate expressions like COUNTIF(field IS NOT NULL)
  and COUNT(DISTINCT field) which were previously blocked as false positives.
- Detect implicit SELECT * in pipe syntax queries with no SELECT clause
  (e.g. FROM table |> LIMIT 10).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…CP client support

- Add fork notice at the top linking back to upstream
- Replace Smithery/npx quick install with clone-and-build instructions
  pointing to this fork (upstream package lacks security features)
- Add warning that Smithery/npx installs the original without field restrictions
- Update all clone URLs from ergut/ to drharunyuksel/
- Replace "Claude Desktop only" references with "any MCP-compatible client"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove sponsorship section (not applicable to this fork)
- Update author section to credit original author Salih Ergüt
  and add Harun Yüksel as fork maintainer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@ergut ergut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Harun, thank you for this PR. It's clear you put a lot of thought and effort into it, and the documentation is really well done.

After reviewing it, I have a question about the core use case. This MCP server runs locally over stdio, so the person configuring the restricted fields is the same person (or agent) running the queries. Anyone can just edit the config file or skip it entirely. What scenario are you thinking of where this provides real protection?

If this were a remote MCP server where an admin deploys it for other users, I could see the value. But as a local server, I'm not sure field restrictions can be meaningfully enforced on the client side.

I want to keep the server focused and minimal, so I'd need a clear use case before adding this. If you see an angle I'm missing, I'm happy to hear it.

Thanks again for the contribution.

drharunyuksel and others added 2 commits March 27, 2026 11:44
Align fork README with PR branch content. Adds the 'Which Setup Is
Right for You?' table and explains the two deployment modes clearly:
Simple Mode (npx/Smithery, no config) and Protected Mode (with
--config-file for PHI/PII environments). Also explains why LLM
inference in the cloud makes field restrictions meaningful even for
local server deployments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Renamed 'Developer Setup' to 'Local Build' to clarify it's a valid
deployment option (not just for contributors). Added separate config
examples for Simple Mode (no --config-file) and Protected Mode (with
--config-file), so all three deployment methods consistently support
both modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drharunyuksel
Copy link
Copy Markdown
Contributor Author

drharunyuksel commented Mar 27, 2026

Hi @ergut

Thanks for the thoughtful review. You're right to push on this.

On the use case:

The threat model isn't "a human bypassing the config file." It's about what happens when an AI agent queries BigQuery autonomously. LLM inference happens in the cloud: when the agent runs a query, the results are sent to the LLM provider's servers (Anthropic, OpenAI, etc.) for processing. They leave your network.

Consider a healthcare data analyst using Claude or Cursor with this MCP server to explore patient data. They ask something innocent like "how many patients were admitted last month?" The AI agent autonomously writes and runs SELECT * FROM patients. That query returns thousands of rows containing names, emails, dates of birth, SSNs, and medical record numbers; and every single one of those values is sent to Anthropic's or OpenAI's cloud servers to generate the next response. The data has now left your network and been processed by a third-party cloud provider. Under HIPAA, that's a reportable data breach.

This isn't a hypothetical. It's the default behavior of any AI agent with unrestricted BigQuery access. The agent isn't malicious: it's just doing its job. But there's no human in the loop reviewing each query before it executes.

BigQuery IAM controls who can reach your data. Field restrictions control what the AI agent surfaces into LLM responses. These are completely different protection boundaries. IAM can't prevent the AI from reading a column it has permission to access. Field restrictions can.

The AI agent interacts only through MCP tools: it has no filesystem access to modify or skip the config file. So the restrictions are genuinely enforceable against the agent, even in a local stdio setup.

On the two deployment modes:

We updated the README to clearly define two modes:

Simple Mode: No config.json → server starts with 1GB query limit, no field restrictions. Anyone who clones the repo gets this by default since config.json is in .gitignore and only config.json.example ships with the repo.
Protected Mode: config.json present (via --config-file or auto-discovered in the working directory) → field restrictions enforced, scanner runs on startup.
The two modes don't conflict. We tested both locally with Node.js running the same compiled dist/index.js:

Simple Mode: SELECT address FROM users → returned data freely
Protected Mode: SELECT address FROM users → blocked with a clear error message
Protected Mode safe alternative: SELECT * EXCEPT(address, ...) → returned data with sensitive fields excluded
On minimalism:

The feature is fully opt-in. Without a config.json, the server behaves identically to v1.0.3: zero behavior change for existing users. The scanner only runs when a config file is present and stale. No new required dependencies.

Happy to adjust anything if you see a simpler way to structure this.

@drharunyuksel drharunyuksel force-pushed the hy/feat-field-restrictions-and-scanner branch from 562daa6 to a4f067e Compare March 27, 2026 09:36
drharunyuksel and others added 2 commits March 27, 2026 12:48
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
npx @ergut/mcp-bigquery-server installs the upstream package which does
not yet support --config-file. Updated Option 2 to use the local fork
build with node dist/index.js. Added a note linking to the open PR.
Will revert to npx once the PR is merged and a new version is published.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@ergut ergut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Harun, thanks for the detailed explanation. I had the threat model wrong. I was thinking about a human bypassing the config, not the agent itself being the untrusted party. The point about query results being sent to LLM provider servers is a real concern, especially in regulated environments. That makes the use case clear.

A few things before we move forward:

The SQL parsing logic is now load-bearing from a privacy standpoint. Regex-based SQL parsing can be bypassed through CTEs, subqueries, or aliasing. Before merging, I would like to understand how robust it is against these patterns.

Given the large amount of new code, I would prefer to have test coverage before we merge. Could you please add tests, especially for the edge cases?

One extension worth considering: table-level restrictions. I have datasets with many tables where only a few are relevant for analysis. An allowedTables list in config.json would let users restrict the agent to a specific subset. It is a simpler enforcement problem than field restrictions and would be useful on its own. Would you be open to adding that?

drharunyuksel and others added 5 commits March 29, 2026 10:39
preventedFields was empty and the file existed with a fresh mtime,
causing the staleness check to skip the scan entirely — exposing
PHI/PII fields for up to 24 hours after initial deployment.

Now the scan always runs when preventedFields is empty, populating
protection immediately on first startup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lds enforcement

- Extract SQL enforcement logic into src/sql-enforcement.ts module
- Add three protection modes: off, allowedTables, autoProtect
- Add restrictedFields support with aggregate-only access control
- Add comprehensive test suite (92 tests) including pipe syntax support
- Update config.json.example with all three protection modes
- Improve error messages with actionable fix guidance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace isEmpty + file mtime check with a lastScannedAt timestamp written
to config.json after each scan. This ensures the first-run scan always
executes regardless of preventedFields content — fixing the case where
users copy config.json.example with placeholder entries and the scan
is incorrectly skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…covery

- Server now starts even when --config-file points to a missing file,
  returning an actionable error on every query instead of crashing silently
- Without --config-file flag, server runs in off mode — no longer
  auto-discovers config.json in working directory to avoid collisions
  with unrelated config files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-discovery of config.json in working directory was removed, so the
README now states that --config-file must be passed explicitly to enable
protection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@drharunyuksel drharunyuksel force-pushed the hy/feat-field-restrictions-and-scanner branch from 7dc6613 to f345ca4 Compare April 1, 2026 12:19
@drharunyuksel
Copy link
Copy Markdown
Contributor Author

drharunyuksel commented Apr 1, 2026

Hi @ergut, thanks for the clear feedback. Here's what we've done to address each point:

1. SQL Parsing Robustness

First, an important framing point: these field restrictions are cooperative guardrails for AI agents, not a SQL firewall. The threat model is straightforward — when an AI agent queries BigQuery, the results are sent to the LLM provider's servers (Anthropic, OpenAI, etc.). Field restrictions prevent the agent from inadvertently including sensitive columns (PII, PHI, secrets) in those results. When the agent encounters a restriction error, it reads the guidance in the error message and reformulates its query. In practice, AI agents cooperate immediately and consistently.

That said, we took robustness seriously. We ran extensive penetration testing and found (and fixed) several bypass vectors:

  • Struct-alias bypassSELECT t FROM users AS t returns the entire row as a STRUCT. Now detected and blocked.
  • Comma-join evasionFROM table1, table2 made the second table invisible to the old regex. Fixed with comma-aware extraction.
  • CTE chains and subqueries — restricted field references inside CTEs and nested subqueries are now caught.
  • Alias shadowingFROM restricted_table AS safe SELECT safe.restricted_field is resolved back to the real table.
  • Implicit SELECT *FROM table |> LIMIT 10 (no SELECT clause) returns all columns. Now treated as a violation.

Are there still edge cases where a very unusual SQL construct could slip through? Possibly — regex-based parsing has limits. But here's the thing: AI agents don't write unusual SQL and they don't try to hack or penetrate the database. They write straightforward queries to answer the user's question. The only time we saw restricted data leak through was during our own manual penetration testing, where we intentionally crafted bypass queries like SELECT t FROM users AS t — queries that no AI agent would produce in normal operation. In real usage, the agent hits a restriction, reads the error guidance, and reformulates its query — every time.

We've added a "Security Model" section to the README that's transparent about this:

This system uses regex-based SQL analysis to detect restricted field usage. We performed penetration testing during development and fixed several bypass vectors. However, regex-based parsing cannot guarantee coverage of every possible SQL construct. The enforcement logic is designed to fail closed, but it is not equivalent to a database-level security policy.

For environments requiring strict compliance guarantees, we recommend combining these guardrails with BigQuery's native column-level security and authorized views.

2. Test Coverage

Added 92 unit tests using vitest in src/sql-enforcement.test.ts. Coverage includes:

  • Unit tests for all SQL parsing helpers
  • Cooperative guardrail tests (standard query patterns, aggregates, EXCEPT)
  • Adversarial bypass tests (struct-alias, nested CTEs, comma-joins, alias shadowing)
  • BigQuery pipe syntax penetration tests (EXTEND, SET, DROP, RENAME, AGGREGATE)
  • allowedTables enforcement tests (allowlist, fail-closed, CTE filtering, INFORMATION_SCHEMA exemption)

All SQL enforcement logic has been extracted into a dedicated src/sql-enforcement.ts module for testability.

3. allowedTables

Implemented as a full protection mode. The server now supports three modes via protectionMode in config.json:

  • off — no restrictions
  • allowedTables — only listed tables can be queried, with optional per-table field restrictions via preventedFieldsInAllowedTables
  • autoProtect — the original behavior (auto-scan + preventedFields)

Backward compatible: existing configs without protectionMode default to autoProtect.

We also improved startup resilience — if --config-file points to a missing file, the server starts but blocks all queries with an actionable error message instead of crashing silently. Without --config-file, it runs in simple/off mode.

I've updated both the PR description and the README to reflect all of this — including the security model rationale, protection modes documentation, corrected query pattern tables (MIN/MAX now listed as blocked), manual test results (6 scenarios, all passing), and the complete test plan. Please take a look.

feat: add protection modes, SQL robustness fixes, and test coverage
Copy link
Copy Markdown
Owner

@ergut ergut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drharunyuksel, thanks for the thorough update. The test coverage is solid and the adversarial cases are well thought out: CTEs, struct-alias bypass, comma-joins, alias shadowing, pipe syntax variants are all covered. The discriminated union design for the protection modes is clean too. Good work overall.

A few things to address before we merge:

The sensitive field scanner interpolates patterns from the config file directly into SQL without validation. If someone puts a crafted string in sensitiveFieldPatterns, it gets injected into the INFORMATION_SCHEMA query. The impact is limited but it should be fixed, either by validating that patterns match a safe format before use, or by using parameterized queries.

The scanner hardcodes region-us and location: 'US' in sensitive-field-scanner.ts. Any non-US deployment will silently fail the auto-scan on startup, leaving preventedFields empty. The config.location is already threaded through the rest of the codebase so it just needs to be passed into the scanner as well.

The --maximum-bytes-billed CLI flag is silently ignored when a config file is present. loadConfiguration reads the value from the file only and the CLI value is discarded. The fix is to apply the CLI value on top of what loadConfiguration returns, if it is set.

One design question worth discussing: field names appearing in WHERE or ORDER BY clauses currently block the query even though the field is not being returned. For example SELECT id FROM users WHERE email = 'x@example.com' would be blocked. Is this intentional? If so, it should be documented clearly since it will surprise users.

…n support, CLI override

- Validate sensitiveFieldPatterns against safe-character allowlist before SQL interpolation
- Thread --location CLI flag through scanner (main server + manual scan-fields script)
- CLI --maximum-bytes-billed now overrides config file value (applied after config reload)
- Fail closed in autoProtect when scanner fails and preventedFields is empty
- Document WHERE/ORDER BY blocking behavior in README query pattern table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
drharunyuksel added a commit to drharunyuksel/mcp-bigquery-server that referenced this pull request Apr 8, 2026
@drharunyuksel
Copy link
Copy Markdown
Contributor Author

Hi @ergut, thanks for the detailed review. Here's what we've done to address each point:

1. SQL Injection in Scanner

Fixed. scanSensitiveFields now validates every pattern against a safe-character allowlist (/^[a-zA-Z0-9_%.\-]+$/) before building the SQL. Patterns containing quotes, semicolons, spaces, or any other character that could break out of a LIKE clause are rejected with a clear error message. The scanner won't run if any pattern fails validation.

We chose input validation over parameterized queries since LIKE patterns have a very constrained format — there's no legitimate reason for a pattern to contain ', ;, or whitespace.

2. Hardcoded US Region in Scanner

Fixed. scanSensitiveFields and runDailyScanIfNeeded now accept a location parameter, which is passed from config.location (the existing --location CLI flag, defaulting to 'US'). Both the INFORMATION_SCHEMA region in the SQL (region-{location}) and the BigQuery query location option now use this value. Non-US deployments will scan the correct regional INFORMATION_SCHEMA.

3. --maximum-bytes-billed CLI Flag Override

Fixed. After loadConfiguration() returns, the CLI --maximum-bytes-billed value (if provided) now overrides the config file value. This follows the standard convention where CLI flags take precedence over config file settings.

4. WHERE / ORDER BY Blocking

This is intentional. The full SQL query text is sent to the LLM provider as part of the conversation — so SELECT id FROM users WHERE email = 'patient@example.com' means the restricted value appears in the prompt sent to the cloud, even though BigQuery doesn't return it in the results. Additionally, if the agent is writing a WHERE filter on a restricted field, it means the agent already has or is probing for that value.

We've added this to the README with a clear note in the query pattern table explaining why WHERE, ORDER BY, and other non-SELECT references are blocked.

Additional Hardening

We ran an internal adversarial code review and found two more issues beyond the four review items. Both are now fixed:

5. autoProtect fails closed on scanner failure

Previously, if the scanner failed in autoProtect mode (bad pattern, network error, permissions issue) and preventedFields was empty, the server would silently serve queries with no restrictions — effectively unprotected. Now, when the scanner throws and preventedFields is empty, the server blocks all queries with an actionable error message explaining how to fix it. If existing preventedFields were already populated from a previous scan, those continue to work normally.

Note: a successful scan that finds zero sensitive columns is fine — it means the user's tables don't have matching column names yet. The fail-closed behavior only activates on scanner failure, not on empty results.

6. Manual scanner location support

The standalone npm run scan-fields script was still hardcoded to region-us. It now accepts --location <region> and passes it to scanSensitiveFields, consistent with how the main server handles --location. Without the flag, it defaults to US.

Manual Test Results

We tested all changes against a live BigQuery instance using the local build, running the MCP server from a clean test directory (simulating a fresh install).

# Scenario Result
1 Simple mode — no config file, no --config-file flag. Queries ran without restrictions. Pass
2 autoProtect mode — empty preventedFields, scanner ran on first startup, discovered sensitive fields across datasets. Restricted field → blocked, SELECT * EXCEPT(restricted) → allowed. Pass
3 allowedTables mode — allowed table → allowed, disallowed table → rejected, restricted field on allowed table → blocked, aggregate on restricted field → allowed. Pass
4 --maximum-bytes-billed CLI override — config file set to 10GB, CLI set to 1 byte. Query rejected by BigQuery billing limit → CLI overrode config file. Pass
5 SQL injection pattern validation — malicious pattern "'; DROP TABLE foo; --" in config. Scanner rejected it and did not run. Pass
6 Fail-closed on scanner failure — scanner failed (bad pattern) with empty preventedFields. All queries blocked with actionable error message. Pass
7 Scanner failure with existing restrictions — scanner failed but pre-populated preventedFields continued to work. Non-restricted → allowed, restricted → blocked. Pass
8 CLI override in autoProtect mode — config file set to 10GB, CLI set to 1 byte in autoProtect mode. CLI override persists after config reload. Pass

Copy link
Copy Markdown
Owner

@ergut ergut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drharunyuksel, all four points are addressed and verified. The pattern validation, region threading, CLI override, and fail-closed behavior are all in the code and the tests pass. The WHERE/ORDER BY design decision makes sense and is now documented.

This is a well-executed contribution. Merging now.

Copy link
Copy Markdown
Owner

@ergut ergut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drharunyuksel, all four points are addressed and verified. The pattern validation, region threading, CLI override, and fail-closed behavior are all in the code and the tests pass. The WHERE/ORDER BY design decision makes sense and is now documented.

This is a well-executed contribution. Merging now.

@ergut ergut merged commit 8781a84 into ergut:main Apr 8, 2026
@drharunyuksel
Copy link
Copy Markdown
Contributor Author

Thank you @ergut! I appreciate the thorough review process. The three rounds of feedback made the implementation significantly more robust. Happy to contribute more in the future.

@ergut
Copy link
Copy Markdown
Owner

ergut commented Apr 9, 2026

Thanks @drharunyuksel! You were patient throughout the process and delivered a high quality result. Hope to see more from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants