feat(api): add scheduled cleanup task for dataset_queries#35734
feat(api): add scheduled cleanup task for dataset_queries#35734Echo0ff wants to merge 3 commits intolanggenius:mainfrom
Conversation
The dataset_queries table grows without bound because every RAG retrieval and hit-test inserts a row. This adds a configurable Celery Beat task (clean_dataset_queries_task) that deletes rows older than a retention period (default 60 days) in batches, gated by ENABLE_CLEAN_DATASET_QUERIES_TASK. Retention is clamped to max(config, PLAN_SANDBOX_CLEAN_DAY_SETTING) to avoid breaking clean_unused_datasets_task which reads DatasetQuery.created_at. Also adds a created_at index on dataset_queries via alembic migration to keep the delete scan performant as the table grows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Hi, When the Redis lock is already held, the cleanup task prints a 'skipped' message but then re-raises Severity: remediation recommended | Category: reliability How to fix: Return on lock contention Agent prompt to fix - you can give this to your LLM of choice:
We noticed a couple of other issues in this PR as well - happy to share if helpful. Found by Qodo code review |
…task Re-raise LockError after printing a skip message caused false task failures for normal lock contention. Return instead to exit cleanly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the review! Fixed the Regarding "We noticed a couple of other issues" — could you please share those as well? Happy to fix them in this PR. |
Summary
Fixes #35733
The
dataset_queriestable grows without bound because every RAG retrieval and hit-testing operation inserts a row. There is currently no periodic cleanup task for this table — the only deletions happen when an entire dataset is removed viaclean_dataset_task.This PR adds a configurable Celery Beat task (
clean_dataset_queries_task) that deletes rows older than a retention period in batches.Changes
api/schedule/clean_dataset_queries_task.py— New task: Redis lock + batch deletion with automatic retention clamping and warning logging when the configured retention falls below the safe thresholdapi/configs/feature/__init__.py— 4 new config fields inCeleryScheduleTasksConfig:ENABLE_CLEAN_DATASET_QUERIES_TASK(default:False)CLEAN_DATASET_QUERIES_RETENTION_DAYS(default:60)CLEAN_DATASET_QUERIES_BATCH_SIZE(default:500)CLEAN_DATASET_QUERIES_LOCK_TTL(default:3600)api/extensions/ext_celery.py— Register in beat_schedule, hour=5 to avoid collision with existing tasks at 0/2/3/4api/models/dataset.py— Addcreated_atindex declaration toDatasetQuery.__table_args__api/migrations/versions/2026_04_30_1600-67b5709d7d0a_add_dataset_queries_created_at_idx.py— Alembic migrationapi/tests/unit_tests/schedule/test_clean_dataset_queries_task.py— 4 unit testsKey constraint
clean_unused_datasets_taskreadsDatasetQuery.created_atto determine whether a dataset has been queried recently (threshold =PLAN_SANDBOX_CLEAN_DAY_SETTING, default 30 days). The new task uses a default retention of 60 days (> 30). If a user manually sets retention below 30, the task clamps it and logs a warning.Test plan
make lintpassesbasedpyright api/schedule/clean_dataset_queries_task.py— 0 errorsENABLE_CLEAN_DATASET_QUERIES_TASK=true, start celery beat, confirm the task is scheduled and deletes in batches🤖 Generated with Claude Code