|
| 1 | +--- |
| 2 | +title: Event notification schema discussion |
| 3 | +summary: Event notifications schema discussion |
| 4 | +date: 2025-06-29 |
| 5 | +jira: HDDS-13513 |
| 6 | +status: design |
| 7 | +author: Colm Dougan, Donal Magennis |
| 8 | +--- |
| 9 | +<!-- |
| 10 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 11 | + you may not use this file except in compliance with the License. |
| 12 | + You may obtain a copy of the License at |
| 13 | +
|
| 14 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 15 | +
|
| 16 | + Unless required by applicable law or agreed to in writing, software |
| 17 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 18 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 19 | + See the License for the specific language governing permissions and |
| 20 | + limitations under the License. See accompanying LICENSE file. |
| 21 | +--> |
| 22 | + |
| 23 | +## Overview |
| 24 | + |
| 25 | +This document outlines the schema requirements for event notification |
| 26 | +within Ozone and discusses the suitability of 2 widely used event |
| 27 | +notification schemas (S3 and HDFS) as candidates to use as a basis for |
| 28 | +the transmission format for notifications within Ozone. |
| 29 | + |
| 30 | +# General schema requirements |
| 31 | + |
| 32 | +## File/Directory creation/modification |
| 33 | + |
| 34 | +event notifications should be raised to inform consumers of completed |
| 35 | +operations which modify the filesystem and specifically the requests: |
| 36 | + |
| 37 | +#### CreateRequest |
| 38 | + |
| 39 | +we should emit some **create** event |
| 40 | + |
| 41 | +required fields: |
| 42 | +- path (volume + bucket + key) |
| 43 | +- isfile |
| 44 | + |
| 45 | +nice to have fields: |
| 46 | +- overwrite |
| 47 | +- recursive |
| 48 | + |
| 49 | +#### CreateFileRequest |
| 50 | + |
| 51 | +we should emit some **create** event |
| 52 | + |
| 53 | +required fields: |
| 54 | +- path (volume + bucket + key) |
| 55 | +- isfile |
| 56 | + |
| 57 | +nice to have fields: |
| 58 | +- overwrite |
| 59 | +- recursive |
| 60 | + |
| 61 | +#### CreateDirectoryRequest |
| 62 | + |
| 63 | +we should emit some **create** event |
| 64 | + |
| 65 | +required fields: |
| 66 | +- path (volume + bucket + key) |
| 67 | +- isfile |
| 68 | + |
| 69 | +#### CommitKeyRequest |
| 70 | + |
| 71 | +we should emit some **commit/close** event |
| 72 | + |
| 73 | +required fields: |
| 74 | +- path (volume + bucket + key) |
| 75 | + |
| 76 | +nice to have fields: |
| 77 | +- data size |
| 78 | +- hsync? |
| 79 | + |
| 80 | +#### DeleteKeyRequest |
| 81 | + |
| 82 | +we should emit some **delete** event |
| 83 | + |
| 84 | +required fields: |
| 85 | +- path (volume + bucket + key) |
| 86 | + |
| 87 | +nice to have fields: |
| 88 | +- recursive (if known) |
| 89 | + |
| 90 | +### RenameKeyRequest |
| 91 | + |
| 92 | +we should emit some **rename** event |
| 93 | + |
| 94 | +required fields: |
| 95 | +- fromPath (volume + bucket + key) |
| 96 | +- toPath (volume + bucket + toKeyName) |
| 97 | + |
| 98 | +nice to have fields: |
| 99 | +- recursive (if known) |
| 100 | +- is directory (if known) |
| 101 | + |
| 102 | +NOTE: in the case of a FSO directory rename there is a dillema |
| 103 | +(discussed later in this document) as to whether we should emit a single |
| 104 | +event for a directory rename (specifying only the old/new directory names) |
| 105 | +or whether we should emit granular events for all the child objects impacted by |
| 106 | +the rename. |
| 107 | + |
| 108 | +## ACLs |
| 109 | + |
| 110 | +event notifications should be raised to inform consumers that ACL events |
| 111 | +have happened. The relevant requests are: |
| 112 | + |
| 113 | +* AddAclRequest |
| 114 | +* SetAclRequest |
| 115 | +* RemoveAclRequest |
| 116 | + |
| 117 | +The fields provided could vary based on the implementation complexity. |
| 118 | + |
| 119 | +Minimally we have a requirement that we be informed that "some ACL update |
| 120 | +happened" to a certain key (or prefix). |
| 121 | + |
| 122 | +Ideally the details would include the full context of the change made as |
| 123 | +per the request. (perhaps by mirroring the full request details as a JSON |
| 124 | +sub-object) e.g. : |
| 125 | + |
| 126 | +```json |
| 127 | + ... |
| 128 | + |
| 129 | + "acls": [ |
| 130 | + { |
| 131 | + type: "GROUP", |
| 132 | + name: "mygroup" |
| 133 | + rights: "\000\001", |
| 134 | + aclScope: "ACCESS", |
| 135 | + } |
| 136 | + ] |
| 137 | +``` |
| 138 | + |
| 139 | +The precise details we would need to revisit with guidance from the |
| 140 | +community but this is just to set broad brush expectations. |
| 141 | + |
| 142 | +## SetTimes |
| 143 | + |
| 144 | +event notifications should be raised to inform consumers that |
| 145 | +mtime/atime has changed, as per **SetTimesRequest** |
| 146 | + |
| 147 | +# Transmission format |
| 148 | + |
| 149 | +This section discusses 2 widely used transmission formats for event |
| 150 | +notifiations (S3 and HDFS) and their suitability as candidates for |
| 151 | +adoption within Ozone. |
| 152 | + |
| 153 | +It is not assumed that these are the only options available but they are |
| 154 | +good examples to test against our requirements and discuss trade-offs. |
| 155 | + |
| 156 | +## 1. S3 Event Notification schema |
| 157 | + |
| 158 | +The S3 event notification schema: |
| 159 | + |
| 160 | +[https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types](https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types) |
| 161 | + |
| 162 | +has become a standard for change notifications in S3 compatible storage services such as S3 itself, Ceph, MinIO etc |
| 163 | + |
| 164 | +Notification events are produced as a list of JSON records. |
| 165 | + |
| 166 | +To illustrate we can look at a sample "create" event from the Ceph docs |
| 167 | +(https://docs.ceph.com/en/quincy/radosgw/notifications/#events): |
| 168 | + |
| 169 | +```json |
| 170 | + |
| 171 | +{"Records":[ |
| 172 | + { |
| 173 | + "eventVersion":"2.1", |
| 174 | + "eventSource":"ceph:s3", |
| 175 | + "awsRegion":"us-east-1", |
| 176 | + "eventTime":"2019-11-22T13:47:35.124724Z", |
| 177 | + "eventName":"ObjectCreated:Put", |
| 178 | + "userIdentity":{ |
| 179 | + "principalId":"tester" |
| 180 | + }, |
| 181 | + "requestParameters":{ |
| 182 | + "sourceIPAddress":"" |
| 183 | + }, |
| 184 | + "responseElements":{ |
| 185 | + "x-amz-request-id":"503a4c37-85eb-47cd-8681-2817e80b4281.5330.903595", |
| 186 | + "x-amz-id-2":"14d2-zone1-zonegroup1" |
| 187 | + }, |
| 188 | + "s3":{ |
| 189 | + "s3SchemaVersion":"1.0", |
| 190 | + "configurationId":"mynotif1", |
| 191 | + "bucket":{ |
| 192 | + "name":"mybucket1", |
| 193 | + "ownerIdentity":{ |
| 194 | + "principalId":"tester" |
| 195 | + }, |
| 196 | + "arn":"arn:aws:s3:us-east-1::mybucket1", |
| 197 | + "id":"503a4c37-85eb-47cd-8681-2817e80b4281.5332.38" |
| 198 | + }, |
| 199 | + "object":{ |
| 200 | + "key":"myimage1.jpg", |
| 201 | + "size":"1024", |
| 202 | + "eTag":"37b51d194a7513e45b56f6524f2d51f2", |
| 203 | + "versionId":"", |
| 204 | + "sequencer": "F7E6D75DC742D108", |
| 205 | + "metadata":[], |
| 206 | + "tags":[] |
| 207 | + } |
| 208 | + }, |
| 209 | + "eventId":"", |
| 210 | + |
| 211 | + } |
| 212 | +]} |
| 213 | +``` |
| 214 | + |
| 215 | +As we can see above: there are a number of boilerplate fields to inform us |
| 216 | +of various aspects of the completed operation but there are a few fundamental |
| 217 | +aspects to highlight; |
| 218 | + |
| 219 | +1. the "key" informs us of the key that the operation was performed on. |
| 220 | + |
| 221 | +2. the "eventName" informs us of the type of operation that was |
| 222 | + performed. The 2 most notable eventNames are **ObjectCreated:Put** and |
| 223 | + **ObjectRemoved:Deleted** which pertain to key creation and deletion respectively. |
| 224 | + |
| 225 | +3. operation specific fields can be included within the "object" sub-object (in |
| 226 | + the above example we can see that "size" and "eTag" of the created object are included) |
| 227 | + |
| 228 | +## Applicability to Ozone |
| 229 | + |
| 230 | +For non-FSO Ozone buckets / operations there is a clear mapping between |
| 231 | +operations such as CreateKey / CommitKey / DeleteKey / RenameKey and the |
| 232 | +standard S3 event notification semantics. |
| 233 | + |
| 234 | +Examples: |
| 235 | + |
| 236 | +1. CommitKey could be mapped to a ObjectCreated:Put "/path/to/keyToCreate" notification event |
| 237 | + |
| 238 | +2. DeleteKey could be mapped to a ObjectRemoved:Deleted "/path/to/keyToDelete" notification event |
| 239 | + |
| 240 | +3. RenameKey (assuming a file based key) in standard S3 event noification semantics would produce 2 events: |
| 241 | + |
| 242 | +- a ObjectRemoved:Deleted event for the source path of the rename |
| 243 | +- a ObjectCreated:Put event for the destination path of the rename |
| 244 | + |
| 245 | +The challenge in adopting S3 Event notification semantics within Ozone |
| 246 | +would be in at least 2 areas: |
| 247 | + |
| 248 | +### 1. FSO hierarchical operations which impact multiple child keys |
| 249 | + |
| 250 | +Example: directory renames |
| 251 | + |
| 252 | +To illustrate with an example: lets say we have the following simple directory structure: |
| 253 | + |
| 254 | +``` |
| 255 | + /vol1/bucket1/myfiles/f1 |
| 256 | + /vol1/bucket1/myfiles/f2 |
| 257 | + /vol1/bucket1/myfiles/subdir/f1 |
| 258 | +``` |
| 259 | + |
| 260 | +If a user performs a directory rename such as: |
| 261 | + |
| 262 | +``` |
| 263 | + ozone fs -mv /vol1/bucket1/myfiles /vol1/bucket1/myfiles-RENAMED |
| 264 | +``` |
| 265 | + |
| 266 | +Within standard S3 event notification semantics we would expect to see 6 notifications |
| 267 | +emitted in that case: |
| 268 | + |
| 269 | +``` |
| 270 | + eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/f1 |
| 271 | + eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/f2 |
| 272 | + eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/subdir/f1 |
| 273 | + eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/f1 |
| 274 | + eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/f2 |
| 275 | + eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/subdir/f1 |
| 276 | +``` |
| 277 | + |
| 278 | +However, with an approach of simply producing notifications based on Ratis |
| 279 | +state machine events then all we would have to go on from the |
| 280 | +RenameKeyRequest would be the fromKeyName and the toKeyName of the |
| 281 | +*parent* of the directory being renamed (and not the impacted child |
| 282 | +objects). |
| 283 | + |
| 284 | +Therefore to produce notifications using the standard S3 event |
| 285 | +notification semantics for FSO directory renames we would need to |
| 286 | +consider the trade-offs between compatibility with the normal S3 |
| 287 | +semantics for renames vs a custom event type for directory renames. |
| 288 | + |
| 289 | +### most compatible approach |
| 290 | + |
| 291 | +We could introduce some additional processing before emitting notification |
| 292 | +events in the case of a directory rename which "gathers together" (prior |
| 293 | +to the change being committed to the DB) the child objects impacted by |
| 294 | +the directory rename and emits pairs of delete/create events for each |
| 295 | +key (as described above) |
| 296 | + |
| 297 | +Pros: |
| 298 | +- standard S3 event notification rename semantics |
| 299 | + |
| 300 | +Cons: |
| 301 | +- additional processing to pull together the events. This could mean an |
| 302 | + unknown amount of additional processing for large directory renames. |
| 303 | +- could be a performance drag if performed on the leader |
| 304 | + |
| 305 | +### custom event type |
| 306 | + |
| 307 | +Conversely - we could opt to not try to be fully compliant with existing S3 event notification |
| 308 | +semantics since the schema was designed for non-hierarchical filesystems and |
| 309 | +instead create some custom event extension (e.g. ObjectRenamed:) and |
| 310 | +emit just a single event for directory renames which specifies only the parent |
| 311 | +paths impacted by the rename: |
| 312 | + |
| 313 | +e.g. |
| 314 | +``` |
| 315 | + eventName=ObjectReanmed:Reanmed, fromKey=myfiles, toKey=myfiles-RENAMED |
| 316 | +``` |
| 317 | + |
| 318 | +.. it would then be up to the notification consumer to deal with the |
| 319 | +different rename event semantics (i.e. that only the parent names were |
| 320 | +notified and not the impacted child objects). |
| 321 | + |
| 322 | +This is the same semantics used in the HDFS inotify directory rename |
| 323 | +event (see below). |
| 324 | + |
| 325 | +Pros: |
| 326 | +- no additional processing when emitting events |
| 327 | + |
| 328 | +Cons: |
| 329 | +- non-standard S3 event notification semantics |
| 330 | + |
| 331 | +NOTE: directory rename is just one example of a hierarchical FSO |
| 332 | +operation which impacts child objects. There may be other Ozone |
| 333 | +hierarchical FSO operations which will need be catered for in a similar |
| 334 | +way (recursive delete?) |
| 335 | + |
| 336 | +### 2. Metadata changes |
| 337 | + |
| 338 | +The standard S3 event notification schema does not have provision for |
| 339 | +notifying about metadata changes. |
| 340 | + |
| 341 | +Therefore to support notifying about metadata changes one option would |
| 342 | +be to add a custom event type. e.g. ObjectMetadataUpdated:* |
| 343 | + |
| 344 | +It is worth noting here that Ceph has some custom extensions, |
| 345 | +so there is some precedent for that: |
| 346 | +https://docs.ceph.com/en/latest/radosgw/s3-notification-compatibility/#event-types |
| 347 | + |
| 348 | + |
| 349 | +## 2. HDFS event schema |
| 350 | + |
| 351 | +The HDFS inotify event notification schema |
| 352 | + |
| 353 | +[https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/package-summary.html](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/package-summary.html) |
| 354 | + |
| 355 | +allows a HDFS client with suitable privileges to poll the HDFS namenode |
| 356 | +for notifications pertaining to changes on the filesystem across the entire cluster |
| 357 | +(i.e. there is no granular per-directory subscription). |
| 358 | + |
| 359 | +The notifications use a binary protocol (protobuf). The protobuf specs |
| 360 | +for the notification events can be found here: |
| 361 | + |
| 362 | +https://github.com/apache/hadoop/blob/3d905f9cd07d118f5ea0c8485170f5ebefb84089/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/inotify.proto#L62 |
| 363 | + |
| 364 | + |
| 365 | +## Applicability to Ozone |
| 366 | + |
| 367 | +Since HDFS is a hierarchical filesystem there is a natural mapping to |
| 368 | +the FSO operations within Ozone. |
| 369 | + |
| 370 | +For example: |
| 371 | + |
| 372 | +* a directory rename is emitted as a RenameEvent |
| 373 | + (https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.RenameEvent.html) with |
| 374 | + srcPath=/path/to/old-dir, dstPath=/path/to/new-dir (i.e. there is no |
| 375 | + expectation that the impact on child objects will be notified) |
| 376 | + |
| 377 | +* a recursive delete is emitted as a UnlinkEvent |
| 378 | + (https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.UnlinkEvent.html) on the parent |
| 379 | + |
| 380 | +* metadata changes (such as changes to permissions, replication, |
| 381 | + owner/group, acls, xattr etc. |
| 382 | + |
| 383 | +are sent via a MetadataUpdateEvent |
| 384 | +(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.MetadataUpdateEvent.html) |
| 385 | + |
| 386 | +This would be a good starting point for Ozone but would require some |
| 387 | +bespoke changes as acls, for example, do not have a one-to-one mapping |
| 388 | +to HDFS concepts. |
| 389 | + |
| 390 | +Pros: |
| 391 | +- clear mapping for FSO and non-FSO operations such as directory renames |
| 392 | +- caters for metadata operations by design (although would require some |
| 393 | + customization) |
| 394 | + |
| 395 | +Cons: |
| 396 | +- not ubiquitous across many storage solutions in the way that the S3 Event Notification schema is |
0 commit comments