Add an API to list changes to quarantine state of media#19558
Add an API to list changes to quarantine state of media#19558
Conversation
Sticky events was used as reference for creating this.
e76dc83 to
6328e78
Compare
258768e to
f1a35fa
Compare
Just something I noticed while working on #19558 We start the function by setting `total_media_quarantined` to zero, then we do work on the `media_ids`, add the number affected, zero it out (**bug**), do work on `hashes`, add the number of affected rows, then return `total_media_quarantined`. ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
| async def on_GET(self, request: SynapseRequest) -> tuple[int, JsonDict]: | ||
| await assert_requester_is_admin(self.auth, request) | ||
|
|
||
| from_id = parse_integer(request, "from", default=0) |
There was a problem hiding this comment.
It looks like you can paginate from the beginning of time which may be desirable.
Someone might only care about catching up on the stream from now, looking forwards. With the current setup, they would have to paginate years of history or guess stream ID's which are meant to be opaque (not guessed).
Since this is an admin thing, they could manually lookup the current value in the database table 🤷 Perhaps this only requires a little SQL snippet on how to do this lookup in our documentation.
There was a problem hiding this comment.
I would normally agree, though unfortunately this PR is tied to relatively limited availability to expand the scope. A future PR can expand the capabilities of this API as needed.
| ORDER BY media_origin, media_id | ||
| LIMIT ? OFFSET ? | ||
| """, | ||
| (batch_size, last_row_num), |
There was a problem hiding this comment.
I fear using OFFSET being extremely slow when this table grows to a bunch of rows.
I don't have experience using OFFSET in Postgres. I have experienced problems with MongoDB for example.
My first instinct would be to use media_id if order doesn't matter.
There was a problem hiding this comment.
We use OFFSET in other background updates, so I'm not worried.
There was a problem hiding this comment.
I also think some of those other background updates may be suspect (bad performance). I will need more opinions or knowledge
There was a problem hiding this comment.
fwiw, we (the consumers of this API) aren't concerned with it being slow to populate. It can take months if it needs to. If it's so bad performance-wise that it'll likely cause a production outage though, that's a different story.
There was a problem hiding this comment.
I think my suspicions with OFFSET are applicable, https://use-the-index-luke.com/no-offset
These queries will slow to a crawl once you start paginating thousands/millions of rows out and cause undue load on the database.
There was a problem hiding this comment.
I'll need some direction on what changes are required here. I'm fine with it being slow as a one-time operation, but I also don't want to wake anyone up at 2am for it.
There was a problem hiding this comment.
Potential solution mentioned in the first reply (using media_id) and explained in https://use-the-index-luke.com/no-offset as well. Example: WHERE media_id > ? ORDER BY media_id and record the last media_id in the progress
There was a problem hiding this comment.
To clarify: is removing OFFSET a requirement for getting this PR merged?
| async def _flag_existing_quarantined_media( | ||
| self, progress: JsonDict, batch_size: int | ||
| ) -> int: |
There was a problem hiding this comment.
Why do we care about filling in historical data for quarantined_media_changes?
There was a problem hiding this comment.
Resolved without reply
There was a problem hiding this comment.
I addressed this in the diff below this thread. We want historical data because we need it for the use case.
There was a problem hiding this comment.
This is the closest thing I can find:
synapse/synapse/storage/databases/main/room.py
Lines 193 to 196 in b5eafbc
Which doesn't explain why we care. Please expand on the use case
There was a problem hiding this comment.
The why is because the callers expect to receive data. I'm not sure how much clearer I can be - please suggest precise wording you'd like to see.
| FROM remote_media_cache | ||
| WHERE quarantined_by IS NOT NULL | ||
|
|
||
| ORDER BY media_origin, media_id |
There was a problem hiding this comment.
There isn't a good index to ORDER BY media_origin, media_id
$ psql --username=postgres synapse
synapse=# \d+ local_media_repository
Table "public.local_media_repository"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
----------------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
media_id | text | | | | extended | | |
media_type | text | | | | extended | | |
media_length | integer | | | | plain | | |
created_ts | bigint | | | | plain | | |
upload_name | text | | | | extended | | |
user_id | text | | | | extended | | |
quarantined_by | text | | | | extended | | |
url_cache | text | | | | extended | | |
last_access_ts | bigint | | | | plain | | |
safe_from_quarantine | boolean | | not null | false | plain | | |
authenticated | boolean | | not null | false | plain | | |
sha256 | text | | | | extended | | |
quarantined_ts | bigint | | | | plain | | |
Indexes:
"local_media_repository_media_id_key" UNIQUE CONSTRAINT, btree (media_id)
"local_media_repository_sha256" btree (sha256) WHERE sha256 IS NOT NULL
"local_media_repository_url_idx" btree (created_ts) WHERE url_cache IS NOT NULL
"users_have_local_media" btree (user_id, created_ts)
Not-null constraints:
"local_media_repository_safe_from_quarantine_not_null" NOT NULL "safe_from_quarantine"
"local_media_repository_authenticated_not_null" NOT NULL "authenticated"
There was a problem hiding this comment.
indeed. Is this something I'm expected to address?
There was a problem hiding this comment.
Yes, although I think we can switch to something better -> #19558 (comment)
There was a problem hiding this comment.
Ordering by media_id does not change the query plan. It'll still sequential scan both tables because of WHERE quarantined_by IS NOT NULL. Adding an index to that column has historically caused different problems (making media unavailable on matrix.org for long enough that Backend killed the query and backed out the migration).
(query plan not disclosed publicly because it shows private information about quarantined media on matrix.org)
There was a problem hiding this comment.
Do we even need to do this? -> #19558 (comment)
(query plan not disclosed publicly because it shows private information about quarantined media on matrix.org)
Just need to sanitize a few media_id? What's the real problem?
There was a problem hiding this comment.
The query plan doesn't show media IDs, it shows the count of quarantined media.
There was a problem hiding this comment.
Cross-link for discussing sensitivity of quarantined media count: #19558 (comment)
There was a problem hiding this comment.
For clarity: I'm considering this the same as #19558 (comment) - we the consumers aren't concerned with speed here, but if there's likely to be a production incident then we'll fix it here.
| SELECT NULL AS media_origin, media_id | ||
| FROM local_media_repository | ||
| WHERE quarantined_by IS NOT NULL | ||
|
|
||
| UNION | ||
|
|
||
| SELECT media_origin, media_id | ||
| FROM remote_media_cache | ||
| WHERE quarantined_by IS NOT NULL |
There was a problem hiding this comment.
We might as well handle these separately? Does order matter? I assume not
There was a problem hiding this comment.
If there's benefit to doing it separately, sure. I don't believe there is. Ordering can't be done on the queries in a union.
There was a problem hiding this comment.
Easier to reason about. Smaller transactions.
This might happen anyway if we start paginating by media_id in each table anyway.
There was a problem hiding this comment.
How much does this thread block the PR from merging?
|
@MadLittleMods this should be ready for review. Apologies if it wasn't clarified earlier, but this PR is tied to a strict timebox on my end - if major changes are needed, we'll have to coordinate our teams to make that happen. I've resolved threads which I believe are resolved. Unresolved threads appear to need your reply. Due to the number of comments on this PR, I won't see comments in resolved threads - please open new threads rather than un-resolving them if they need my attention. |
| # return the `from` value. | ||
| next_batch = changes[-1].stream_id if len(changes) > 0 else from_id | ||
|
|
||
| return HTTPStatus.OK, {"next_batch": next_batch, "rows": rows} |
There was a problem hiding this comment.
Potential better name
| return HTTPStatus.OK, {"next_batch": next_batch, "rows": rows} | |
| return HTTPStatus.OK, {"next_batch": next_batch, "changes": serialized_changes} |
There was a problem hiding this comment.
It's a non-trivial amount of work to rename the JSON field that's returned - how important is it to call it something other than rows?
| # `from` is exclusive, so don't +1 this. We also know the last record will have | ||
| # the highest stream ID, so use that one. If there aren't any records, just | ||
| # return the `from` value. | ||
| next_batch = changes[-1].stream_id if len(changes) > 0 else from_id |
There was a problem hiding this comment.
| next_batch = changes[-1].stream_id if len(changes) > 0 else from_id | |
| next_batch = changes[-1].stream_id if len(changes) > 0 else to_id |
(comment also needs updating)
Previous discussion, #19558 (comment)
There was a problem hiding this comment.
This suggestion has the potential to introduce a bug where records could end up being lost. When there's no changes, from_id is a more reliable stream position than whatever to_id is.
There was a problem hiding this comment.
to_id will be the current position of the worker.
If we didn't find any rows between from_id and to_id, then it's safe to return to_id
There was a problem hiding this comment.
If from_id is in the future though, then we'd be going backwards with to_id. That is undesirable.
On a scale of 1 to 10, where 10 is "this cannot merge without this change", where does this thread sit?
| # We expect to continue from `from` because we have no rows | ||
| self.assertEqual(0, channel.json_body["next_batch"]) |
There was a problem hiding this comment.
Language needs adjusting once code is updated to use to_id (see other discussion)
Fixes #19352
(See issue for history of this feature and previous PRs)
This PR re-introduces the API building on the previous feedback:
We track both quarantine and unquarantine actions in the stream to allow downstream consumers to process the records appropriately. Namely, to allow our Synapse exchange in HMA to remove hashes for unquarantined media.
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.