Skip to content

Harden dev deploy pipeline and add OpenTelemetry instrumentation#84

Merged
se02035 merged 17 commits into
devfrom
bug/fix-tf-deploy-issue
May 1, 2026
Merged

Harden dev deploy pipeline and add OpenTelemetry instrumentation#84
se02035 merged 17 commits into
devfrom
bug/fix-tf-deploy-issue

Conversation

@se02035

@se02035 se02035 commented Apr 24, 2026

Copy link
Copy Markdown
Owner

Harden dev deploy pipeline and add OpenTelemetry instrumentation

🎫 GitHub Issue

Type of Task

  • New feature
  • Bug fix
  • Refactoring
  • UI/UX improvement

Description

  • Terraform (dev platform): Tighten Cloud Run multi-container wiring (OTel collector sidecar), App Hub discovered-service handling, optional Firestore PITR, IAM and variable surface, and Terraform/GitHub workflow behavior (including plan/apply refresh handling where documented).
  • Application: Add OpenTelemetry bootstrap (src/app/otel.py), wire it in main.py, extend logging configuration for correlation and third-party loggers, and add unit tests. Document OTEL-related environment variables in .env.example.
  • Repository hygiene: Ignore local Terraform plan artifacts in .gitignore; small README and script touch-ups aligned with deploy and observability.

Refs #53

🌟 Impact

  • UX: No direct UI change; production-like telemetry improves operability when OTLP is configured.
  • Performance: Negligible overhead when OTEL is disabled; sidecar adds Cloud Run resource use as already modeled in Terraform.
  • Security: No broadening of public access by default; session and backend secrets remain environment-driven. Review Terraform IAM deltas in the diff.
  • Maintenance: Clearer split between Terraform-managed platform settings and CI-deployed app configuration; tests lock in logging and OTEL behavior.

ℹ️ Additional Information

  • After merging, run a dev Terraform plan/apply if your remote state is behind this branch, then confirm GitHub Actions deploy and optional Terraform workflows against the dev environment.
  • Ensure GitHub Environment variables (for example CLOUD_RUN_SET_ENV_VARS, image bases, and project/region) stay aligned with README.md and Terraform outputs.

💥 Issues

  • Cloud Run environment variables set only by gcloud run deploy (outside Terraform) can be overwritten when Terraform updates the service template; prefer defining durable env in Terraform or redeploying via the Deploy to environment workflow after infra applies.
  • If deploy smoketests still hit 403 on private Cloud Run, confirm the workflow uses an identity token for curl and that the deploy service account has roles/run.invoker on the service (or enable unauthenticated access where policy allows).

🏞️ Demo

Not applicable (infrastructure, workflow, and backend instrumentation changes only).

se02035 added 15 commits April 24, 2026 12:43
Initialize OTEL in app lifespan with OTLP when configured, extend
logging to bridge stdlib and third-party loggers, and add tests. Document
OTEL env vars in .env.example. Ignore local Terraform plan outputs in
.gitignore. Trim trailing whitespace in platform module README.

Made-with: Cursor
@se02035 se02035 added the enhancement New feature or request label Apr 24, 2026
@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ba59199-b9a1-4716-974f-ef7603d39f87

📥 Commits

Reviewing files that changed from the base of the PR and between dfe901b and 51a53ad.

📒 Files selected for processing (2)
  • src/app/otel.py
  • tests/test_otel.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • OpenTelemetry (OTLP) support with configurable exporting and automatic app instrumentation
    • Log–trace correlation added to structured logs
    • Optional App Hub registration and Firestore point-in-time recovery toggle
  • Improvements

    • Cloud Run: tuned startup probes and explicit container targeting for deploys
    • Terraform/CI: default denies unauthenticated invocations, new deploy toggles, updated IAM guidance and plan refresh behavior
    • Collector image defaults to a pinned public Google image
  • Tests

    • New tests covering OTEL setup/instrumentation and log correlation
  • Documentation

    • README and Terraform docs updated with usage and operational guidance

Walkthrough

Adds configurable OpenTelemetry initialization and log–trace correlation, conditional App Hub registration/adoption, Firestore PITR/deletion policy inputs, Cloud Run probe and sidecar adjustments, Terraform workflow/CI inputs and docs updates, and accompanying tests and minor deployment script/workflow tweaks.

Changes

Cohort / File(s) Summary
OpenTelemetry code & deps
src/app/otel.py, src/app/logging_config.py, src/app/main.py, pyproject.toml
New OTEL helper (setup, active, reset, instrument_app), log→trace correlation processor, app imports/initializes OTEL on startup, and OTEL dependencies added.
OTEL tests
tests/test_otel.py, tests/test_logging_config.py
New unit tests covering OTEL setup idempotency, tracing/metrics activation, instrumentation behavior, and log correlation with/without span and GOOGLE_CLOUD_PROJECT.
Cloud Run deploy scripts/workflow
.github/workflows/deploy-env.yml, tools/deploy_cloud_run.sh
gcloud run deploy now passes --container app to target the app container in multi-container services.
Terraform workflows & CI
.github/workflows/terraform.yml, .gitignore, infra/terraform/README.md
Terraform deploy plan uses -refresh=false; workflow accepts optional TF env vars for unauthenticated/App Hub; tfplan artifacts ignored; docs updated with IAM/drift guidance.
Terraform environment inputs
infra/terraform/environments/dev/main.tf, .../terraform.tfvars.example, .../variables.tf, infra/terraform/fixtures/terraform-ci.tfvars
Passes App Hub and Firestore flags into module; example tfvars expanded; allow_unauthenticated default changed to false in dev and CI fixture updated.
Module variables & locals
infra/terraform/modules/smart-events-platform/variables.tf, locals.tf
New variables for App Hub and Firestore (PITR/deletion policy); otel_collector_image defaulting logic changed to public digest-pinned Google image via locals; variable descriptions updated.
Module resource changes
infra/terraform/modules/smart-events-platform/apphub.tf, cloud_run.tf, firestore.tf, iam.tf, observability.tf
Conditional/adoptable App Hub creation and discovery, truncated effective service_id, lifecycle.ignore_changes for imported App Hub metadata; Cloud Run updated probes/startup behavior and deletion_protection=false; Firestore configured from variables; added monitoring IAM and conditional App Hub admin IAM; uptime check depends on monitoring IAM binding.
Documentation & examples
README.md, infra/terraform/modules/smart-events-platform/README.md, .env.example
Docs updated with OTEL env examples and usage, Terraform IAM guidance, App Hub/Firestore operational notes, sample tfvars entries, and updated Cloud Run/CI guidance.

Sequence Diagram

sequenceDiagram
    participant App as FastAPI App
    participant OTEL as OTEL SDK
    participant Instr as FastAPI/HTTPX Instrumentors
    participant Collector as OTLP Collector
    participant CloudLog as Cloud Logging

    Note over App,OTEL: Startup
    App->>OTEL: setup_opentelemetry()
    OTEL->>OTEL: init TracerProvider & MeterProvider
    OTEL->>Collector: register OTLP span/metric exporters
    App->>Instr: instrument_app(app)

    Note over App,OTEL: Request flow
    App->>OTEL: start span
    App->>App: emit log
    App->>OTEL: (log processor) get_current_span()
    OTEL-->>App: span context
    App->>CloudLog: emit JSON with logging.googleapis.com/trace/spanId/trace_sampled
    OTEL->>Collector: export spans/metrics
    Collector->>CloudLog: telemetry ingestion
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰
Spans and logs in cozy rows I weave,
Sidecars hum while App Hub takes its leave,
Firestore sleeps with PITR at bay,
Probes tiptoe softly so workers may play,
I twitch my whiskers and celebrate telemetry!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: hardening the dev deploy pipeline and adding OpenTelemetry instrumentation, which aligns with the substantial Terraform/workflow and application-level changes throughout the PR.
Description check ✅ Passed The description comprehensively covers the PR's scope: Terraform platform improvements, application OTel bootstrap and logging, repository hygiene, and operational guidance. It directly addresses multiple file changes and provides context for reviewers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Review rate limit: 3/5 reviews remaining, refill in 22 minutes and 1 second.

Comment @coderabbitai help to get the list of available commands and usage tips.

@se02035 se02035 self-assigned this Apr 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/terraform.yml (1)

184-190: 🧹 Nitpick | 🔵 Trivial

Document operational implications of -refresh=false in the workflow.

Using -refresh=false avoids the Monitoring API refresh race condition (as documented in lines 86-88), but it means Terraform won't detect out-of-band drift during CI runs. If resources are modified outside Terraform (e.g., via console or gcloud), the plan won't reflect those changes.

Consider adding a periodic refresh-enabled plan (manual or scheduled) to detect drift, or document in the README that local runs should use -refresh=true to catch drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/terraform.yml around lines 184 - 190, The CI terraform
plan call that uses -refresh=false (the terraform plan invocation producing
tfplan.binary and using variables like project_id, region,
terraform_state_bucket_name, infra_deployment_ref and EXTRA) should be
accompanied by documentation and/or an additional workflow job to surface drift:
update the README or repository docs to explain that -refresh=false avoids the
Monitoring API race but prevents detection of out-of-band drift, and add either
a scheduled/manual workflow that runs terraform plan without -refresh=false (or
a documented local run instruction to use -refresh=true) so drift is
periodically detected and operators know how to run a refresh-enabled plan.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infra/terraform/modules/smart-events-platform/variables.tf`:
- Around line 167-177: Add validation blocks to the two variables to reject
values with leading/trailing whitespace and optionally enforce an
allowed-character pattern; for example in variable "apphub_application_id" and
"apphub_service_id" add a validation { condition = var.apphub_application_id ==
"" || trimspace(var.apphub_application_id) == var.apphub_application_id &&
(can(regex("^[a-zA-Z0-9-_]+$", var.apphub_application_id)) ?
length(regexall("^[a-zA-Z0-9-_]+$", var.apphub_application_id)) == 1 : false)
error_message = "apphub_application_id must not include leading/trailing
whitespace and may only contain letters, digits, hyphens or underscores" } (and
the equivalent using var.apphub_service_id for that variable), using trimspace
and regex to detect/trap invalid input.

In `@infra/terraform/README.md`:
- Line 15: Update the guidance that currently recommends granting project-wide
roles/iam.serviceAccountAdmin; instead instruct maintainers to grant
iam.serviceAccounts.getIamPolicy and iam.serviceAccounts.setIamPolicy only on
the specific runtime/service account resources (using
google_service_account_iam_member or resource-scoped IAM bindings) or create a
minimal custom role with just those permissions, and replace the explicit
project-level roles/iam.serviceAccountAdmin recommendation with this
resource-scoped or custom-role approach while keeping mentions of
iam.serviceAccounts.getIamPolicy / setIamPolicy and
google_service_account_iam_member so readers can locate the relevant Terraform
resources.

In `@README.md`:
- Around line 368-370: Cloud Run section still implies the baked-in collector is
the default while the later Terraform text correctly states the default is
Google’s public otelcol-google image; update the Cloud Run wording so both
places consistently state the default is the public otelcol-google image and
that baking your own image is optional — if you want the repo’s
otelcol-google-config.yaml baked in you must build/push to your GAR and set the
module input otel_collector_image to the full image URI; ensure the Cloud Run
text mirrors the Terraform paragraph (public image default, baking is
recommended only when you need exact config) and mention otel_collector_image
and otelcol-google-config.yaml so readers can find the override instructions.
- Line 335: The heading "Terraform deploy identity (IAM)" is using #### which
skips from the previous ## level and triggers markdownlint MD001; change that
heading to ### (increase level by one) so headings progress ## → ### → ... and
the file's heading hierarchy is consistent; update the "Terraform deploy
identity (IAM)" heading text accordingly to use three hashes and run
markdownlint to verify the MD001 warning is resolved.

In `@src/app/otel.py`:
- Around line 39-49: The _should_init function currently uses multiple ifs and a
final explicit return False; simplify by returning a single boolean expression:
check _is_sdk_disabled() and the three env vars (OTEL_EXPORTER_OTLP_ENDPOINT,
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT)
directly so the function becomes a concise one-liner that negates
_is_sdk_disabled() and ORs the three environment checks (calling .strip() on
each) to determine the result.

In `@tests/test_otel.py`:
- Around line 70-76: The pytest.parametrization currently passes parameter names
as a comma-separated string to `@pytest.mark.parametrize` ("key, value"); change
that to a tuple of names (e.g., ("key", "value")) in the pytest.mark.parametrize
call so the decorator uses a tuple for parameter names; update the decorator
where the three OTEL endpoint pairs are provided to use ("key", "value") instead
of "key, value".

---

Outside diff comments:
In @.github/workflows/terraform.yml:
- Around line 184-190: The CI terraform plan call that uses -refresh=false (the
terraform plan invocation producing tfplan.binary and using variables like
project_id, region, terraform_state_bucket_name, infra_deployment_ref and EXTRA)
should be accompanied by documentation and/or an additional workflow job to
surface drift: update the README or repository docs to explain that
-refresh=false avoids the Monitoring API race but prevents detection of
out-of-band drift, and add either a scheduled/manual workflow that runs
terraform plan without -refresh=false (or a documented local run instruction to
use -refresh=true) so drift is periodically detected and operators know how to
run a refresh-enabled plan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf90b31e-7fa9-495c-ab13-5de6bd78189b

📥 Commits

Reviewing files that changed from the base of the PR and between 43777ee and d414c3b.

📒 Files selected for processing (25)
  • .env.example
  • .github/workflows/deploy-env.yml
  • .github/workflows/terraform.yml
  • .gitignore
  • README.md
  • infra/terraform/README.md
  • infra/terraform/environments/dev/main.tf
  • infra/terraform/environments/dev/terraform.tfvars.example
  • infra/terraform/environments/dev/variables.tf
  • infra/terraform/fixtures/terraform-ci.tfvars
  • infra/terraform/modules/smart-events-platform/README.md
  • infra/terraform/modules/smart-events-platform/apphub.tf
  • infra/terraform/modules/smart-events-platform/cloud_run.tf
  • infra/terraform/modules/smart-events-platform/firestore.tf
  • infra/terraform/modules/smart-events-platform/iam.tf
  • infra/terraform/modules/smart-events-platform/locals.tf
  • infra/terraform/modules/smart-events-platform/observability.tf
  • infra/terraform/modules/smart-events-platform/variables.tf
  • pyproject.toml
  • src/app/logging_config.py
  • src/app/main.py
  • src/app/otel.py
  • tests/test_logging_config.py
  • tests/test_otel.py
  • tools/deploy_cloud_run.sh

Comment thread infra/terraform/modules/smart-events-platform/variables.tf
Comment thread infra/terraform/README.md Outdated
Comment thread README.md Outdated
Comment thread README.md
Comment thread src/app/otel.py Outdated
Comment thread tests/test_otel.py
- Validate apphub_application_id and apphub_service_id (trim, charset).
- Prefer scoped SA IAM over project serviceAccountAdmin in Terraform READMEs;
  document terraform plan refresh vs drift for CI and operators.
- Align README Cloud Run collector wording with public otelcol-google default;
  fix Terraform IAM heading level (MD001); note CI -refresh=false trade-off.
- Collapse _should_init to a single expression; parametrize OTEL tests with a
  tuple of parameter names.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai coderabbitai Bot added the bug Something isn't working label May 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/otel.py`:
- Around line 66-70: The current assignment of _service uses
(os.environ.get("OTEL_SERVICE_NAME") or "smart-events").strip(), which yields an
empty string if OTEL_SERVICE_NAME is present but only whitespace; update the
logic so you first read and strip the environment value, then if the stripped
value is falsy fallback to "smart-events" before passing to Resource.create
(affecting _service, SERVICE_NAME usage and the Resource.create call); in short:
strip the raw env var first, then apply the default when the stripped result is
empty.
- Around line 39-48: Current _should_init() and setup_opentelemetry() conflate
traces and metrics so an unconfigured signal falls back to default exporters and
instrument_app() uses the generic active() flag; split activation by adding
per-signal checks (e.g., _should_init_traces() and _should_init_metrics() or
have _should_init() return two booleans), change setup_opentelemetry() to only
instantiate OTLPSpanExporter when the traces endpoint env var
(OTEL_EXPORTER_OTLP_ENDPOINT or OTEL_EXPORTER_OTLP_TRACES_ENDPOINT) is set and
pass that endpoint to OTLPSpanExporter, and only instantiate OTLPMetricExporter
when the metrics endpoint env var (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set
and pass that endpoint to OTLPMetricExporter; finally update instrument_app() to
gate instrumentation on the tracing-enabled flag (the new
_should_init_traces()/traces boolean) instead of the generic active() so only
tracing instrumentation runs when tracing is explicitly enabled.
- Around line 91-115: The current reset_opentelemetry_state_for_tests() helper
cannot restore globals because opentelemetry.trace.set_tracer_provider and
opentelemetry.metrics.set_meter_provider are one-shot; update the test strategy
instead of trying to call those setters repeatedly: leave
reset_opentelemetry_state_for_tests() alone or remove its calls to
otel_trace.set_tracer_provider and otel_metrics.set_meter_provider, and change
the autouse fixture in tests/test_otel.py to patch (mock) the functions
opentelemetry.trace.set_tracer_provider and
opentelemetry.metrics.set_meter_provider at import time (or run tests in a
subprocess) so each test controls what provider is installed before
setup_opentelemetry() runs; alternative approach is to add a private test-only
hook that setup_opentelemetry() will accept (e.g., a parameter or internal
function name) to inject fresh providers without calling the one-shot setters.
Ensure you reference reset_opentelemetry_state_for_tests(), setup_opentelemetry,
and the mocked symbols otel_trace.set_tracer_provider /
otel_metrics.set_meter_provider when making changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f3c33f57-8a31-4797-b0f9-6eb8d0677c4a

📥 Commits

Reviewing files that changed from the base of the PR and between d414c3b and dfe901b.

📒 Files selected for processing (6)
  • .github/workflows/terraform.yml
  • README.md
  • infra/terraform/README.md
  • infra/terraform/modules/smart-events-platform/variables.tf
  • src/app/otel.py
  • tests/test_otel.py

Comment thread src/app/otel.py Outdated
Comment thread src/app/otel.py Outdated
Comment thread src/app/otel.py
- Strip OTEL_SERVICE_NAME before defaulting to smart-events
- Initialize traces and metrics only when their OTLP endpoints are set;
  pass explicit endpoints to gRPC exporters
- Gate FastAPI/HTTPX instrumentation on trace OTLP only
- Simplify test reset and mock provider setters in test_otel autouse

Co-authored-by: Cursor <cursoragent@cursor.com>
@se02035 se02035 merged commit 87c852e into dev May 1, 2026
12 checks passed
se02035 added a commit that referenced this pull request May 2, 2026
* Add optional local LLM proxy dev tooling (#62)

* feat(tools): add optional local LLM proxy MVP

- Add tools/local-llm-proxy with LiteLLM, compose, env example, scripts
- Add CI workflow for shell/yaml/compose validation on path changes
- Ignore tools/local-llm-proxy/.env for local secrets

Refs #61

Made-with: Cursor

* chore: add shellcheck pre-commit hook

- Use shellcheck-py v0.11.0.1 for repo-wide shell linting
- Fix deploy_cloud_run env-var join without a subshell (SC2030/SC2031)
- Log expected Ollama model in test-setup.sh (SC2034)

Made-with: Cursor

* fix(tools): satisfy shellcheck SC2034 in test-setup.sh

Declare expected Ollama model inside main() so CI shellcheck sees usage.

Made-with: Cursor

* PR comment added

* chore(local-llm-proxy): harden proxy tooling and CI lint

- Point yamllint at root .yamllint; lint lib.sh in workflow; trigger on .yamllint
- Document CURSOR_TO_LITELLM_KEY as dev-only placeholder in .env.example
- Pin LiteLLM image to v1.83.7-stable; add compose env defaults
- Share err/log via lib.sh; fix shellcheck for script dir and dynamic source
- README: require curl for test-setup.sh; test-setup: curl timeouts and model log

Made-with: Cursor

* pr comments added

* fixed pr comments

* added ngrok support

* feat(local-llm-proxy): add Ngrok tunnel, PostgreSQL DB, and hardware-accelerated Ollama support

- Integrated Ngrok for secure public URL exposure of the LiteLLM proxy.
- Added a PostgreSQL container for LiteLLM UI data persistence.
- Reverted to native Ollama support on Mac to leverage Apple Silicon GPU acceleration.
- Created `proxy.sh` wrapper script for unified service lifecycle management (start/stop/restart).
- Updated healthchecks to use Python for container compatibility.
- Hardened model loading and test scripts for local development workflows.

Made-with: Cursor

* feat(local-llm-proxy): automate virtual key seeding and harden proxy scripts

- Rename `CURSOR_TO_LITELLM_KEY` to `LITELLM_MASTER_KEY` to align with LiteLLM terminology.
- Update `proxy.sh` to automatically seed a virtual key (`local-proxy-key`) restricted to configured models upon startup.
- Harden `proxy.sh` with robust `jq` parsing and authenticated model fetching to prevent initialization errors.
- Fix root `.env` line endings to resolve Pydantic boolean parsing failures.
- Update documentation and `.gitignore` for the new virtual key persistence.

Made-with: Cursor

* Feat/devtooling localllm (#63)

* feat(tools): add optional local LLM proxy MVP

- Add tools/local-llm-proxy with LiteLLM, compose, env example, scripts
- Add CI workflow for shell/yaml/compose validation on path changes
- Ignore tools/local-llm-proxy/.env for local secrets

Refs #61

Made-with: Cursor

* chore: add shellcheck pre-commit hook

- Use shellcheck-py v0.11.0.1 for repo-wide shell linting
- Fix deploy_cloud_run env-var join without a subshell (SC2030/SC2031)
- Log expected Ollama model in test-setup.sh (SC2034)

Made-with: Cursor

* fix(tools): satisfy shellcheck SC2034 in test-setup.sh

Declare expected Ollama model inside main() so CI shellcheck sees usage.

Made-with: Cursor

* PR comment added

* chore(local-llm-proxy): harden proxy tooling and CI lint

- Point yamllint at root .yamllint; lint lib.sh in workflow; trigger on .yamllint
- Document CURSOR_TO_LITELLM_KEY as dev-only placeholder in .env.example
- Pin LiteLLM image to v1.83.7-stable; add compose env defaults
- Share err/log via lib.sh; fix shellcheck for script dir and dynamic source
- README: require curl for test-setup.sh; test-setup: curl timeouts and model log

Made-with: Cursor

* pr comments added

* fixed pr comments

* added ngrok support

* feat(local-llm-proxy): add Ngrok tunnel, PostgreSQL DB, and hardware-accelerated Ollama support

- Integrated Ngrok for secure public URL exposure of the LiteLLM proxy.
- Added a PostgreSQL container for LiteLLM UI data persistence.
- Reverted to native Ollama support on Mac to leverage Apple Silicon GPU acceleration.
- Created `proxy.sh` wrapper script for unified service lifecycle management (start/stop/restart).
- Updated healthchecks to use Python for container compatibility.
- Hardened model loading and test scripts for local development workflows.

Made-with: Cursor

* feat(local-llm-proxy): automate virtual key seeding and harden proxy scripts

- Rename `CURSOR_TO_LITELLM_KEY` to `LITELLM_MASTER_KEY` to align with LiteLLM terminology.
- Update `proxy.sh` to automatically seed a virtual key (`local-proxy-key`) restricted to configured models upon startup.
- Harden `proxy.sh` with robust `jq` parsing and authenticated model fetching to prevent initialization errors.
- Fix root `.env` line endings to resolve Pydantic boolean parsing failures.
- Update documentation and `.gitignore` for the new virtual key persistence.

Made-with: Cursor

* moved the local-llm-proxy to a separate repo

* Add configurable initial ramp users for Locust load tests (#64)

* initial implementation

* feat(load-test): add configurable initial ramp user count

Expose SMART_EVENTS_LOADTEST_INITIAL_USERS and workflow_dispatch input so the first Locust step can differ from per-step growth; document env vars and ramp math.

Made-with: Cursor

* fixed pr comments

* implemented pr feedback

* made load tests more robust

* fix(load): ramp docs, duration stop, and signup bootstrap stats

- Document SMART_EVENTS_LOADTEST_STEP_TIME and MAX_DURATION; stop the shape when run_time >= max_duration.

- Cap default initial users to max_users; only reject INITIAL_USERS > max when set explicitly.

- Use catch_response for /api/signup/verify (200/409 success); richer signup failure text; log loaded attendee count.

- Shorten unknown-attendee signup error log to satisfy ruff line length.

Made-with: Cursor

* Create infrastructure deployment using IaC (#70)

* added caveman skill and rule

* feat: add Terraform deployment scaffolding and strengthen session loading

Introduce Terraform modules, environment configs, and CI workflows for GCP infrastructure while updating session routing/loader and Firestore seeding logic to better handle track and timeslot data from local and Firestore-backed catalogs.

Made-with: Cursor

* fixed an bug around sesison schedule view

* fix: CI hygiene, Terraform hardening, and stricter session parsing

- Drop duplicate tflint --recursive before pre-commit in build and lint workflows.
- Ignore nested Terraform state; mark dev platform output sensitive; align tfvars example.
- Add firestore_location_id, pin utils module via git ref, optional uptime alert policy.
- Tighten CI deploy SA IAM; idempotent tfstate bootstrap; conditional Cloud Run unauth flag.
- Accept only string track IDs in pages and sessions loader; safer timeslot fallback.
- Seed Firestore: ship sample override only for Locust sessions dir; fix ruff line length.
- Test seed catalog selection by comparing session ids to shipped sample files.

Made-with: Cursor

* test: do not require Locust sessions dir on disk for seed path test

tests/load/data/ is gitignored, so CI checkouts lack that folder. The shipped-demo
override is keyed by resolved path equality; assert catalog ids match sample without
asserting load_test_dir.is_dir().

Made-with: Cursor

* Harden deploy and IaC naming; share JSON list helpers

- Normalize CLOUD_RUN_ALLOW_UNAUTH to lowercase before treating as true in deploy workflow.
- Pad Terraform service account account_id when under GCP minimum length; clarify environment validation message.
- Add app.utils.json_helpers (as_dict_list, as_str_list) and use from pages and sessions_loader.
- Normalize track IDs with strip().lower() without redundant str().

Made-with: Cursor

* Add linux_amd64 provider checksums to dev Terraform lock file

terraform init on GitHub Actions (linux_amd64) was rewriting the lock file
during pre-commit terraform_validate, which fails the hook when tracked
files change. Align the dev environment lock file with providers lock for
darwin_arm64 and linux_amd64 so CI matches local behavior.

Made-with: Cursor

* Use Firestore ids as canonical; strict session track handling

- FirestoreReadOnlyStore: always set document id from snapshot.id.
- FirestoreSessionStore: always set row id from doc.id.
- Sessions page: match_track and track catalog accept only string ids/tracks.

Made-with: Cursor

* enable prelease builds from feature branches

* fixed linting issues

* refactored GH action

* added unit tests

* added support for firestore enumlator to run integration tests

* fixed JRE version for integration tests

* Fix Firestore emulator leakage and login verification (#73)

* added config to disable firestore emulator

* fixed GH action bug

* fixed PR comments

* fixed pr comments

* fixed pr comments

* Sessions: multi-select tracks, ordered columns, and deselect all (#75)

* initial bug fix

* feat(sessions): multi-select tracks, ordered columns, deselect all

- Support repeated track= params; preserve selection order for schedule columns
- Add track=none for explicit empty selection; query helpers for HTMX chips
- Track chips: selected indicator, data-track-key; remove All-tracks Alpine bar
- Session schedule layout: column selection from selected track keys
- E2E: Playwright mac-x64/arm64 symlink shim; update extended coverage and sessions UI
- Ruff: wrap long startup log line in main.py

Made-with: Cursor

* fixed pr comments

* fixed pr comments

* Fix empty My Events to Sessions navigation regressions (#78)

* bug fixed

* pr comments fixed

* Implement app monitoring and OTel collector sidecar (#79)

* implemented app montoring with app hub setup

* ruleset updated

* fixed failing GH actions

* be more restrictive on the linting check side

* implemented PR feedback

* Add event program schema validation and session catalog hardening

- Validate main event document against event_program.schema.json
- Normalize Firestore-style session docs; optional event arg for load_sessions
- Per-request session catalog reload middleware for fresh Firestore reads
- Sessions page day key fallback when timeslot omits start; schedule grid UTC fix
- API catalog-diagnostics endpoint; seed_firestore validates events
- README: EVENT_PROGRAM_SCHEMA_PATH; E2E language switch force-click stability

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add Cursor Cloud backlog workflow rule and prompt (#88)

Co-authored-by: Cursor <cursoragent@cursor.com>

* Harden dev deploy pipeline and add OpenTelemetry instrumentation (#84)

* fixed permission issues.

* fixing auth issue

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* updated

* Add OpenTelemetry setup and log correlation

Initialize OTEL in app lifespan with OTLP when configured, extend
logging to bridge stdlib and third-party loggers, and add tests. Document
OTEL env vars in .env.example. Ignore local Terraform plan outputs in
.gitignore. Trim trailing whitespace in platform module README.

Made-with: Cursor

* Harden App Hub tfvars, clarify IAM and OTEL docs, simplify OTEL init

- Validate apphub_application_id and apphub_service_id (trim, charset).
- Prefer scoped SA IAM over project serviceAccountAdmin in Terraform READMEs;
  document terraform plan refresh vs drift for CI and operators.
- Align README Cloud Run collector wording with public otelcol-google default;
  fix Terraform IAM heading level (MD001); note CI -refresh=false trade-off.
- Collapse _should_init to a single expression; parametrize OTEL tests with a
  tuple of parameter names.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix OpenTelemetry env handling and per-signal OTLP setup

- Strip OTEL_SERVICE_NAME before defaulting to smart-events
- Initialize traces and metrics only when their OTLP endpoints are set;
  pass explicit endpoints to gRPC exporters
- Gate FastAPI/HTTPX instrumentation on trace OTLP only
- Simplify test reset and mock provider setters in test_otel autouse

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Make login code requirement configurable (#90)

* Implement configurable login code requirement

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Harden login code defaults and errors

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Make login error tests force strict mode

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Stabilize language switch E2E test

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Config-driven toggles: Information, Profile, and AI assistant (#91)

* Add feature-toggle settings (information/profile/ai_assistant)

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Gate optional features in routes, nav, and assistant shell

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Add feature toggle tests and README deployment-modes section

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Address PR #91 review: include session-ratings in always-available list

- README: add /api/session-ratings* to the 'always available' sentence
  alongside /sessions, /api/sessions, /my-schedule, /api/schedule*.
- tests/test_feature_toggles.py: add
  test_session_ratings_api_reachable_when_all_flags_off (placed next to
  test_sessions_api_reachable_when_all_flags_off) to lock in that
  GET /api/session-ratings stays reachable when every feature flag is off.

Verified against api.py L958 and L974: neither GET /api/session-ratings
nor PUT /api/session-ratings/{session_id} declares require_feature_enabled,
so they really are always available.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Address PR #91 review: clarify default-on wording in README

The parenthetical 'flip a flag to false (or omit it)' was ambiguous: it
read as 'to disable, set false or omit', which is the opposite of actual
behavior (defaults are True, so omitting the variable keeps the feature
enabled). Spell it out and reaffirm the stable 404 body shape.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(e2e): stabilize language switch + mobile nav locators

Disable hx-boost on language links so full reload avoids detached-node clicks.
Scope data-app-nav targets by viewport (no or_ strict violations) and open
the drawer via Alpine navOpen with menu fallback after animated closes.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Document Cloud Agent VM gotchas in AGENTS.md and cursor rules (#92)

* Document Cloud Agent VM gotchas in AGENTS.md and rules

- AGENTS.md: refresh-inherited-venv, backlog-rule lives on dev,
  Terraform pre-commit carve-out, verify-pr.sh wrapper false positive.
- python-venv-pip.mdc: 'Refreshing an inherited venv' subsection.
- commit-workflow.mdc: env-only hook carve-out (Terraform/tflint),
  cross-reference to verify-pr.sh false positive.
- cloud-backlog-agent-workflow.mdc: warn when started on main,
  cross-reference env-only hook and verify-pr.sh carve-outs in
  Phase 2 instructions.

These changes are doc-only and pulled from issues hit while
implementing #68 in PR #91.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Fix flaky test_language_change after htmx-boost re-render race

The language switcher dropdown is rendered by Alpine inside the header.
After a language switch, htmx-boost replaces #app-shell-main and the
language-switcher component is re-mounted with open=false. The previous
test sequence (click button -> click force=True on link) raced against
that re-mount: the dropdown had already been closed (or detached) by the
time Playwright resolved/clicked the link, producing the
'element was detached from the DOM, retrying' timeout seen on GHA run
https://github.com/se02035/[REDACTED]/actions/runs/25235275354.

Replace the two ad-hoc click pairs with a small open_switcher_and_pick
helper that:
- waits for the switcher button to display the *current* locale (so the
  previous render has settled),
- clicks the button and waits up to ~1.5s for the target link to be
  visible, retrying the click up to 5x if the htmx swap closed the
  dropdown again,
- then clicks the link.

No application/template changes; the language switcher behavior itself
is unchanged. Verified locally with 5x desktop runs and 3x mobile runs
in a row, plus a full tests/e2e/ pass on both viewports (36 + 36).

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(e2e): stabilize language switcher test against htmx swap races

Scope locators to #language-switcher-mount and retry open-dropdown +
click when Alpine remounts after hx-boost swaps (fixes CI timeout on
EN link click).

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(e2e): click language links via DOM when Alpine hides menu

Playwright actionability requires visible targets; CI timed out on
?lang=en because x-show kept the link hidden after intermittent reopen.
Use evaluate(el.click()) after visible opener retries (matches prior dev
helper), with wait_for(attached) on the final path.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(e2e): wait for visible language link on header switch fallback

Replace attached-only wait so Alpine x-show dropdown must be open before
evaluate(el.click()) on the fallback path.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Split README into docs guides (#96)

* docs: split README into guides under docs/

Move long-form content to focused markdown files (local development,
testing, configuration, infrastructure, deployment, data seeding).
Keep a lean root README with quick start and documentation index.
Update AGENTS.md, infra/terraform README, and Cursor rules to link to
the new docs.

Refs #95

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: target Unix shells; drop Windows setup notes

State Unix/macOS/Linux scope in README and dev guides; remove Windows
venv and E2E env examples from commit-workflow and python-venv-pip.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add QR login scanner for attendee ID (issue 86) (#97)

* feat(login): add QR login scanner with tests and PWA cache update

Refs #86

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(login): a11y QR dialog, scanner retry state, SW lazy static JS

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(qr): guard load after dialog close, clear scan status on errors

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(qr): validate decode before close, guard html5.start resolution

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(qr): clear Html5Qrcode after stop promise settles

Refs #86

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* test(e2e): stabilize QR scanner dialog visibility on CI

The stub used delayMs 20 so the modal could close before Playwright
observed showModal on slow runners. Use a longer stub delay and
explicit visibility timeouts for the dialog and heading.

Refs #86

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* Fix QR login scanner E2E by repairing JS parse errors

The file header comment contained `**/` which terminated the block
comment early and broke the entire script in the browser. Replace
Promise.prototype.finally with then/catch for older Chromium. E2E
server env now sets non-placeholder legal emails; add Playwright
coverage for deny-permission, lib load failure, and insecure-context
paths.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* test: expand QR scanner coverage (static + E2E)

Add unit checks for qr-login-scanner.js Playwright-compat patterns and E2E
cases for html5 retry after load failure and stub path without attendee-id.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(qr-login-scanner): show scanning status on retry

Match openQrLoginScanner by setting data-msg-scanning on the status line
before stopScanner and stub or html5 load paths; failure paths still clear it.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(qr-scanner): apply review feedback improvements

- Hide retry button when insecure-context error is shown (retrying
  cannot fix a non-HTTPS context so the button was misleading UX).

- Remove redundant double-stop calls in startRealScanner onOk
  then/catch: after html5.stop() resolves the scanner is already
  stopped; early-return on stale state instead of calling stop() again.

- Fix applyDecoded always returning true: the final fallback branch
  (no passcode field, no form) now returns false to match the declared
  contract.

- Move QR-trigger click binding from inline onclick to a JS
  addEventListener in DOMContentLoaded, keyed off [data-qr-login-trigger].
  Removes the inline handler from login.html and improves resilience
  (no uncaught ReferenceError if the script fails to load).

- Remove redundant role="dialog" attribute from the <dialog> element;
  the element's implicit ARIA role is already dialog.

- Extend test_qr_login_scanner_js.py with a new guard test that
  asserts the E2E test-hook localStorage key constants have not been
  silently renamed.

- Update test_login_page_includes_qr_login_scanner_markup to assert
  the absence of the removed inline onclick and the narrowed role check.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* fix(tests): apply inline code-review findings

- conftest.py: seed ATT_ID_SIGNUP_LOGOUT with account_activated=False so
  test_user_signup_login_and_logout exercises the real first-time signup
  modal.  All other E2E users remain pre-activated for direct passcode
  login.  Consolidate the two allowlist_data imports into one.

- test_ui_flows.py: update _open_qr_login_scanner docstring to reflect
  that the QR trigger is now bound via addEventListener in DOMContentLoaded,
  not an inline onclick attribute (changed in the previous review commit).

- test_api.py: extract the fetch handler body with regex and assert that
  cache.put appears inside it, not just anywhere in sw.js, so a regression
  that removes runtime caching would be caught by this test.

Co-authored-by: Oliver L <se02035@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Oliver L <se02035@users.noreply.github.com>

* feat(infra): destroy dev terraform stack via GitHub Action (#103)

* feat(infra): add manual destroy workflow for dev terraform stack

Add .github/workflows/terraform-destroy.yml so the dev infra can be torn
down from a GitHub Action, mirroring the deploy job's auth/init wiring
but running terraform destroy -auto-approve against
infra/terraform/environments/dev.

Safety:
- workflow_dispatch only; required confirm input must equal 'destroy-dev'
- shares concurrency group 'terraform-dev' with deploy so destroy and
  apply cannot race the same state
- runs in the dev GitHub Environment, so existing reviewer rules apply

Document the new workflow under 'Bootstrap and remote state' in
docs/infrastructure.md, including the firestore_deletion_policy caveat
and the post-destroy refresh of GitHub Environment vars / GCP_SA_KEY.

Refs #83

* docs(infra): correct firestore destroy default to DELETE in dev

The dev root (infra/terraform/environments/dev/variables.tf) defaults
firestore_deletion_policy to DELETE, not the platform module's ABANDON.
Update the destroy workflow note and docs/infrastructure.md to reflect
that destroy WILL delete the Firestore database (and its data) in GCP
unless the user sets firestore_deletion_policy = "ABANDON" in the dev
terraform.tfvars before destroying.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Oliver L <se02035@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant