T&S is building tooling which matches media against "known unsafe" media, and would like to regularly import quarantined media as one of several data sets in that tool. The rough script pattern would be to query for new quarantined media, download each one, then do work on the file and either upload it or artifacts of it into the tool. An example of this can be seen at https://github.com/matrix-org/hma-matrix/blob/4f0b9676beb7b5d72b2d55ae7034609593f5fba3/matrix_exchanges/synapse_quarantined.py
The prior art here is a bit lengthy, but it should be possible to implement as a net-new feature. The current state is there's no API for T&S to use for listing quarantined media. That API will need to ensure it returns already-quarantined media as well so that T&S's tool can import those media objects well.
First, a naive implementation of the endpoint was introduced, but it quickly ran into performance issues on query and long startup times, leading to its removal. It also didn't actually work, and would fail to expose media when it was "unquarantined", so a partial fix was attempted, where the suggested direction is to use a stream instead of a timestamp column.
T&S is building tooling which matches media against "known unsafe" media, and would like to regularly import quarantined media as one of several data sets in that tool. The rough script pattern would be to query for new quarantined media, download each one, then do work on the file and either upload it or artifacts of it into the tool. An example of this can be seen at https://github.com/matrix-org/hma-matrix/blob/4f0b9676beb7b5d72b2d55ae7034609593f5fba3/matrix_exchanges/synapse_quarantined.py
The prior art here is a bit lengthy, but it should be possible to implement as a net-new feature. The current state is there's no API for T&S to use for listing quarantined media. That API will need to ensure it returns already-quarantined media as well so that T&S's tool can import those media objects well.
First, a naive implementation of the endpoint was introduced, but it quickly ran into performance issues on query and long startup times, leading to its removal. It also didn't actually work, and would fail to expose media when it was "unquarantined", so a partial fix was attempted, where the suggested direction is to use a stream instead of a timestamp column.