Skip to content

Add an API to list changes to quarantine state of media#19558

Open
turt2live wants to merge 41 commits intodevelopfrom
travis/list-quarantined-media-mk2
Open

Add an API to list changes to quarantine state of media#19558
turt2live wants to merge 41 commits intodevelopfrom
travis/list-quarantined-media-mk2

Conversation

@turt2live
Copy link
Copy Markdown
Member

@turt2live turt2live commented Mar 14, 2026

Fixes #19352

(See issue for history of this feature and previous PRs)

First, a naive implementation of the endpoint was introduced, but it quickly ran into performance issues on query and long startup times, leading to its removal. It also didn't actually work, and would fail to expose media when it was "unquarantined", so a partial fix was attempted, where the suggested direction is to use a stream instead of a timestamp column.

This PR re-introduces the API building on the previous feedback:

  • Adds a stream which tracks when media becomes (un)quarantined.
  • Runs a background update to capture already-quarantined media.
  • Adds a new admin API to return rows from the stream table.

We track both quarantine and unquarantine actions in the stream to allow downstream consumers to process the records appropriately. Namely, to allow our Synapse exchange in HMA to remove hashes for unquarantined media.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@turt2live turt2live force-pushed the travis/list-quarantined-media-mk2 branch from e76dc83 to 6328e78 Compare March 16, 2026 19:27
@turt2live turt2live force-pushed the travis/list-quarantined-media-mk2 branch from 258768e to f1a35fa Compare March 16, 2026 19:32
@turt2live turt2live marked this pull request as ready for review March 16, 2026 19:34
@turt2live turt2live requested a review from a team as a code owner March 16, 2026 19:34
turt2live added a commit that referenced this pull request Mar 18, 2026
Just something I noticed while working on
#19558

We start the function by setting `total_media_quarantined` to zero, then
we do work on the `media_ids`, add the number affected, zero it out
(**bug**), do work on `hashes`, add the number of affected rows, then
return `total_media_quarantined`.

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
async def on_GET(self, request: SynapseRequest) -> tuple[int, JsonDict]:
await assert_requester_is_admin(self.auth, request)

from_id = parse_integer(request, "from", default=0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you can paginate from the beginning of time which may be desirable.

Someone might only care about catching up on the stream from now, looking forwards. With the current setup, they would have to paginate years of history or guess stream ID's which are meant to be opaque (not guessed).

Since this is an admin thing, they could manually lookup the current value in the database table 🤷 Perhaps this only requires a little SQL snippet on how to do this lookup in our documentation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would normally agree, though unfortunately this PR is tied to relatively limited availability to expand the scope. A future PR can expand the capabilities of this API as needed.

ORDER BY media_origin, media_id
LIMIT ? OFFSET ?
""",
(batch_size, last_row_num),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear using OFFSET being extremely slow when this table grows to a bunch of rows.

I don't have experience using OFFSET in Postgres. I have experienced problems with MongoDB for example.

My first instinct would be to use media_id if order doesn't matter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use OFFSET in other background updates, so I'm not worried.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think some of those other background updates may be suspect (bad performance). I will need more opinions or knowledge

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, we (the consumers of this API) aren't concerned with it being slow to populate. It can take months if it needs to. If it's so bad performance-wise that it'll likely cause a production outage though, that's a different story.

Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suspicions with OFFSET are applicable, https://use-the-index-luke.com/no-offset

These queries will slow to a crawl once you start paginating thousands/millions of rows out and cause undue load on the database.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need some direction on what changes are required here. I'm fine with it being slow as a one-time operation, but I also don't want to wake anyone up at 2am for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential solution mentioned in the first reply (using media_id) and explained in https://use-the-index-luke.com/no-offset as well. Example: WHERE media_id > ? ORDER BY media_id and record the last media_id in the progress

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: is removing OFFSET a requirement for getting this PR merged?

Comment on lines +191 to +193
async def _flag_existing_quarantined_media(
self, progress: JsonDict, batch_size: int
) -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about filling in historical data for quarantined_media_changes?

Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved without reply

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this in the diff below this thread. We want historical data because we need it for the use case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the closest thing I can find:

# Register a background update to flag already-quarantined media in the quaranine
# media changes table. This is to populate the API endpoint which consumes the
# table with initial data that callers expect (namely, a list of currently
# quarantined media).

Which doesn't explain why we care. Please expand on the use case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The why is because the callers expect to receive data. I'm not sure how much clearer I can be - please suggest precise wording you'd like to see.

FROM remote_media_cache
WHERE quarantined_by IS NOT NULL

ORDER BY media_origin, media_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a good index to ORDER BY media_origin, media_id

$ psql --username=postgres synapse
synapse=# \d+ local_media_repository
                                         Table "public.local_media_repository"
        Column        |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
----------------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 media_id             | text    |           |          |         | extended |             |              |
 media_type           | text    |           |          |         | extended |             |              |
 media_length         | integer |           |          |         | plain    |             |              |
 created_ts           | bigint  |           |          |         | plain    |             |              |
 upload_name          | text    |           |          |         | extended |             |              |
 user_id              | text    |           |          |         | extended |             |              |
 quarantined_by       | text    |           |          |         | extended |             |              |
 url_cache            | text    |           |          |         | extended |             |              |
 last_access_ts       | bigint  |           |          |         | plain    |             |              |
 safe_from_quarantine | boolean |           | not null | false   | plain    |             |              |
 authenticated        | boolean |           | not null | false   | plain    |             |              |
 sha256               | text    |           |          |         | extended |             |              |
 quarantined_ts       | bigint  |           |          |         | plain    |             |              |
Indexes:
    "local_media_repository_media_id_key" UNIQUE CONSTRAINT, btree (media_id)
    "local_media_repository_sha256" btree (sha256) WHERE sha256 IS NOT NULL
    "local_media_repository_url_idx" btree (created_ts) WHERE url_cache IS NOT NULL
    "users_have_local_media" btree (user_id, created_ts)
Not-null constraints:
    "local_media_repository_safe_from_quarantine_not_null" NOT NULL "safe_from_quarantine"
    "local_media_repository_authenticated_not_null" NOT NULL "authenticated"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. Is this something I'm expected to address?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although I think we can switch to something better -> #19558 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering by media_id does not change the query plan. It'll still sequential scan both tables because of WHERE quarantined_by IS NOT NULL. Adding an index to that column has historically caused different problems (making media unavailable on matrix.org for long enough that Backend killed the query and backed out the migration).

(query plan not disclosed publicly because it shows private information about quarantined media on matrix.org)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to do this? -> #19558 (comment)

(query plan not disclosed publicly because it shows private information about quarantined media on matrix.org)

Just need to sanitize a few media_id? What's the real problem?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query plan doesn't show media IDs, it shows the count of quarantined media.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-link for discussing sensitivity of quarantined media count: #19558 (comment)

Copy link
Copy Markdown
Member Author

@turt2live turt2live Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity: I'm considering this the same as #19558 (comment) - we the consumers aren't concerned with speed here, but if there's likely to be a production incident then we'll fix it here.

Comment on lines +205 to +213
SELECT NULL AS media_origin, media_id
FROM local_media_repository
WHERE quarantined_by IS NOT NULL

UNION

SELECT media_origin, media_id
FROM remote_media_cache
WHERE quarantined_by IS NOT NULL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might as well handle these separately? Does order matter? I assume not

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's benefit to doing it separately, sure. I don't believe there is. Ordering can't be done on the queries in a union.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to reason about. Smaller transactions.

This might happen anyway if we start paginating by media_id in each table anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this thread block the PR from merging?

@turt2live
Copy link
Copy Markdown
Member Author

@MadLittleMods this should be ready for review. Apologies if it wasn't clarified earlier, but this PR is tied to a strict timebox on my end - if major changes are needed, we'll have to coordinate our teams to make that happen.

I've resolved threads which I believe are resolved. Unresolved threads appear to need your reply. Due to the number of comments on this PR, I won't see comments in resolved threads - please open new threads rather than un-resolving them if they need my attention.

# return the `from` value.
next_batch = changes[-1].stream_id if len(changes) > 0 else from_id

return HTTPStatus.OK, {"next_batch": next_batch, "rows": rows}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential better name

Suggested change
return HTTPStatus.OK, {"next_batch": next_batch, "rows": rows}
return HTTPStatus.OK, {"next_batch": next_batch, "changes": serialized_changes}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a non-trivial amount of work to rename the JSON field that's returned - how important is it to call it something other than rows?

# `from` is exclusive, so don't +1 this. We also know the last record will have
# the highest stream ID, so use that one. If there aren't any records, just
# return the `from` value.
next_batch = changes[-1].stream_id if len(changes) > 0 else from_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
next_batch = changes[-1].stream_id if len(changes) > 0 else from_id
next_batch = changes[-1].stream_id if len(changes) > 0 else to_id

(comment also needs updating)

Previous discussion, #19558 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion has the potential to introduce a bug where records could end up being lost. When there's no changes, from_id is a more reliable stream position than whatever to_id is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_id will be the current position of the worker.

If we didn't find any rows between from_id and to_id, then it's safe to return to_id

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If from_id is in the future though, then we'd be going backwards with to_id. That is undesirable.

On a scale of 1 to 10, where 10 is "this cannot merge without this change", where does this thread sit?

Comment on lines +831 to +832
# We expect to continue from `from` because we have no rows
self.assertEqual(0, channel.json_body["next_batch"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language needs adjusting once code is updated to use to_id (see other discussion)

@turt2live turt2live requested a review from MadLittleMods March 31, 2026 19:23
@MadLittleMods MadLittleMods removed their request for review March 31, 2026 21:19
@turt2live turt2live requested a review from MadLittleMods April 2, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Admin API to list quarantined media

2 participants