Skip to content

Commit da8a5e5

Browse files
committed
HDDS-13513 Ozone Event Notification Design
Includes event notification schema design
1 parent 593c816 commit da8a5e5

3 files changed

Lines changed: 589 additions & 0 deletions

File tree

220 KB
Loading
Lines changed: 396 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,396 @@
1+
---
2+
title: Event notification schema discussion
3+
summary: Event notifications schema discussion
4+
date: 2025-06-29
5+
jira: HDDS-13513
6+
status: design
7+
author: Colm Dougan, Donal Magennis
8+
---
9+
<!--
10+
Licensed under the Apache License, Version 2.0 (the "License");
11+
you may not use this file except in compliance with the License.
12+
You may obtain a copy of the License at
13+
14+
http://www.apache.org/licenses/LICENSE-2.0
15+
16+
Unless required by applicable law or agreed to in writing, software
17+
distributed under the License is distributed on an "AS IS" BASIS,
18+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19+
See the License for the specific language governing permissions and
20+
limitations under the License. See accompanying LICENSE file.
21+
-->
22+
23+
## Overview
24+
25+
This document outlines the schema requirements for event notification
26+
within Ozone and discusses the suitability of 2 widely used event
27+
notification schemas (S3 and HDFS) as candidates to use as a basis for
28+
the transmission format for notifications within Ozone.
29+
30+
# General schema requirements
31+
32+
## File/Directory creation/modification
33+
34+
event notifications should be raised to inform consumers of completed
35+
operations which modify the filesystem and specifically the requests:
36+
37+
#### CreateRequest
38+
39+
we should emit some **create** event
40+
41+
required fields:
42+
- path (volume + bucket + key)
43+
- isfile
44+
45+
nice to have fields:
46+
- overwrite
47+
- recursive
48+
49+
#### CreateFileRequest
50+
51+
we should emit some **create** event
52+
53+
required fields:
54+
- path (volume + bucket + key)
55+
- isfile
56+
57+
nice to have fields:
58+
- overwrite
59+
- recursive
60+
61+
#### CreateDirectoryRequest
62+
63+
we should emit some **create** event
64+
65+
required fields:
66+
- path (volume + bucket + key)
67+
- isfile
68+
69+
#### CommitKeyRequest
70+
71+
we should emit some **commit/close** event
72+
73+
required fields:
74+
- path (volume + bucket + key)
75+
76+
nice to have fields:
77+
- data size
78+
- hsync?
79+
80+
#### DeleteKeyRequest
81+
82+
we should emit some **delete** event
83+
84+
required fields:
85+
- path (volume + bucket + key)
86+
87+
nice to have fields:
88+
- recursive (if known)
89+
90+
### RenameKeyRequest
91+
92+
we should emit some **rename** event
93+
94+
required fields:
95+
- fromPath (volume + bucket + key)
96+
- toPath (volume + bucket + toKeyName)
97+
98+
nice to have fields:
99+
- recursive (if known)
100+
- is directory (if known)
101+
102+
NOTE: in the case of a FSO directory rename there is a dillema
103+
(discussed later in this document) as to whether we should emit a single
104+
event for a directory rename (specifying only the old/new directory names)
105+
or whether we should emit granular events for all the child objects impacted by
106+
the rename.
107+
108+
## ACLs
109+
110+
event notifications should be raised to inform consumers that ACL events
111+
have happened. The relevant requests are:
112+
113+
* AddAclRequest
114+
* SetAclRequest
115+
* RemoveAclRequest
116+
117+
The fields provided could vary based on the implementation complexity.
118+
119+
Minimally we have a requirement that we be informed that "some ACL update
120+
happened" to a certain key (or prefix).
121+
122+
Ideally the details would include the full context of the change made as
123+
per the request. (perhaps by mirroring the full request details as a JSON
124+
sub-object) e.g. :
125+
126+
```json
127+
...
128+
129+
"acls": [
130+
{
131+
type: "GROUP",
132+
name: "mygroup"
133+
rights: "\000\001",
134+
aclScope: "ACCESS",
135+
}
136+
]
137+
```
138+
139+
The precise details we would need to revisit with guidance from the
140+
community but this is just to set broad brush expectations.
141+
142+
## SetTimes
143+
144+
event notifications should be raised to inform consumers that
145+
mtime/atime has changed, as per **SetTimesRequest**
146+
147+
# Transmission format
148+
149+
This section discusses 2 widely used transmission formats for event
150+
notifiations (S3 and HDFS) and their suitability as candidates for
151+
adoption within Ozone.
152+
153+
It is not assumed that these are the only options available but they are
154+
good examples to test against our requirements and discuss trade-offs.
155+
156+
## 1. S3 Event Notification schema
157+
158+
The S3 event notification schema:
159+
160+
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types](https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types)
161+
162+
has become a standard for change notifications in S3 compatible storage services such as S3 itself, Ceph, MinIO etc
163+
164+
Notification events are produced as a list of JSON records.
165+
166+
To illustrate we can look at a sample "create" event from the Ceph docs
167+
(https://docs.ceph.com/en/quincy/radosgw/notifications/#events):
168+
169+
```json
170+
171+
{"Records":[
172+
{
173+
"eventVersion":"2.1",
174+
"eventSource":"ceph:s3",
175+
"awsRegion":"us-east-1",
176+
"eventTime":"2019-11-22T13:47:35.124724Z",
177+
"eventName":"ObjectCreated:Put",
178+
"userIdentity":{
179+
"principalId":"tester"
180+
},
181+
"requestParameters":{
182+
"sourceIPAddress":""
183+
},
184+
"responseElements":{
185+
"x-amz-request-id":"503a4c37-85eb-47cd-8681-2817e80b4281.5330.903595",
186+
"x-amz-id-2":"14d2-zone1-zonegroup1"
187+
},
188+
"s3":{
189+
"s3SchemaVersion":"1.0",
190+
"configurationId":"mynotif1",
191+
"bucket":{
192+
"name":"mybucket1",
193+
"ownerIdentity":{
194+
"principalId":"tester"
195+
},
196+
"arn":"arn:aws:s3:us-east-1::mybucket1",
197+
"id":"503a4c37-85eb-47cd-8681-2817e80b4281.5332.38"
198+
},
199+
"object":{
200+
"key":"myimage1.jpg",
201+
"size":"1024",
202+
"eTag":"37b51d194a7513e45b56f6524f2d51f2",
203+
"versionId":"",
204+
"sequencer": "F7E6D75DC742D108",
205+
"metadata":[],
206+
"tags":[]
207+
}
208+
},
209+
"eventId":"",
210+
"opaqueData":"[email protected]"
211+
}
212+
]}
213+
```
214+
215+
As we can see above: there are a number of boilerplate fields to inform us
216+
of various aspects of the completed operation but there are a few fundamental
217+
aspects to highlight;
218+
219+
1. the "key" informs us of the key that the operation was performed on.
220+
221+
2. the "eventName" informs us of the type of operation that was
222+
performed. The 2 most notable eventNames are **ObjectCreated:Put** and
223+
**ObjectRemoved:Deleted** which pertain to key creation and deletion respectively.
224+
225+
3. operation specific fields can be included within the "object" sub-object (in
226+
the above example we can see that "size" and "eTag" of the created object are included)
227+
228+
## Applicability to Ozone
229+
230+
For non-FSO Ozone buckets / operations there is a clear mapping between
231+
operations such as CreateKey / CommitKey / DeleteKey / RenameKey and the
232+
standard S3 event notification semantics.
233+
234+
Examples:
235+
236+
1. CommitKey could be mapped to a ObjectCreated:Put "/path/to/keyToCreate" notification event
237+
238+
2. DeleteKey could be mapped to a ObjectRemoved:Deleted "/path/to/keyToDelete" notification event
239+
240+
3. RenameKey (assuming a file based key) in standard S3 event noification semantics would produce 2 events:
241+
242+
- a ObjectRemoved:Deleted event for the source path of the rename
243+
- a ObjectCreated:Put event for the destination path of the rename
244+
245+
The challenge in adopting S3 Event notification semantics within Ozone
246+
would be in at least 2 areas:
247+
248+
### 1. FSO hierarchical operations which impact multiple child keys
249+
250+
Example: directory renames
251+
252+
To illustrate with an example: lets say we have the following simple directory structure:
253+
254+
```
255+
/vol1/bucket1/myfiles/f1
256+
/vol1/bucket1/myfiles/f2
257+
/vol1/bucket1/myfiles/subdir/f1
258+
```
259+
260+
If a user performs a directory rename such as:
261+
262+
```
263+
ozone fs -mv /vol1/bucket1/myfiles /vol1/bucket1/myfiles-RENAMED
264+
```
265+
266+
Within standard S3 event notification semantics we would expect to see 6 notifications
267+
emitted in that case:
268+
269+
```
270+
eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/f1
271+
eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/f2
272+
eventName=ObjectRemoved:Deleted, key=/vol1/bucket1/myfiles/subdir/f1
273+
eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/f1
274+
eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/f2
275+
eventName=ObjectCreated:Put, key=/vol1/bucket1/myfiles-RENAMED/subdir/f1
276+
```
277+
278+
However, with an approach of simply producing notifications based on Ratis
279+
state machine events then all we would have to go on from the
280+
RenameKeyRequest would be the fromKeyName and the toKeyName of the
281+
*parent* of the directory being renamed (and not the impacted child
282+
objects).
283+
284+
Therefore to produce notifications using the standard S3 event
285+
notification semantics for FSO directory renames we would need to
286+
consider the trade-offs between compatibility with the normal S3
287+
semantics for renames vs a custom event type for directory renames.
288+
289+
### most compatible approach
290+
291+
We could introduce some additional processing before emitting notification
292+
events in the case of a directory rename which "gathers together" (prior
293+
to the change being committed to the DB) the child objects impacted by
294+
the directory rename and emits pairs of delete/create events for each
295+
key (as described above)
296+
297+
Pros:
298+
- standard S3 event notification rename semantics
299+
300+
Cons:
301+
- additional processing to pull together the events. This could mean an
302+
unknown amount of additional processing for large directory renames.
303+
- could be a performance drag if performed on the leader
304+
305+
### custom event type
306+
307+
Conversely - we could opt to not try to be fully compliant with existing S3 event notification
308+
semantics since the schema was designed for non-hierarchical filesystems and
309+
instead create some custom event extension (e.g. ObjectRenamed:) and
310+
emit just a single event for directory renames which specifies only the parent
311+
paths impacted by the rename:
312+
313+
e.g.
314+
```
315+
eventName=ObjectReanmed:Reanmed, fromKey=myfiles, toKey=myfiles-RENAMED
316+
```
317+
318+
.. it would then be up to the notification consumer to deal with the
319+
different rename event semantics (i.e. that only the parent names were
320+
notified and not the impacted child objects).
321+
322+
This is the same semantics used in the HDFS inotify directory rename
323+
event (see below).
324+
325+
Pros:
326+
- no additional processing when emitting events
327+
328+
Cons:
329+
- non-standard S3 event notification semantics
330+
331+
NOTE: directory rename is just one example of a hierarchical FSO
332+
operation which impacts child objects. There may be other Ozone
333+
hierarchical FSO operations which will need be catered for in a similar
334+
way (recursive delete?)
335+
336+
### 2. Metadata changes
337+
338+
The standard S3 event notification schema does not have provision for
339+
notifying about metadata changes.
340+
341+
Therefore to support notifying about metadata changes one option would
342+
be to add a custom event type. e.g. ObjectMetadataUpdated:*
343+
344+
It is worth noting here that Ceph has some custom extensions,
345+
so there is some precedent for that:
346+
https://docs.ceph.com/en/latest/radosgw/s3-notification-compatibility/#event-types
347+
348+
349+
## 2. HDFS event schema
350+
351+
The HDFS inotify event notification schema
352+
353+
[https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/package-summary.html](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/package-summary.html)
354+
355+
allows a HDFS client with suitable privileges to poll the HDFS namenode
356+
for notifications pertaining to changes on the filesystem across the entire cluster
357+
(i.e. there is no granular per-directory subscription).
358+
359+
The notifications use a binary protocol (protobuf). The protobuf specs
360+
for the notification events can be found here:
361+
362+
https://github.com/apache/hadoop/blob/3d905f9cd07d118f5ea0c8485170f5ebefb84089/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/inotify.proto#L62
363+
364+
365+
## Applicability to Ozone
366+
367+
Since HDFS is a hierarchical filesystem there is a natural mapping to
368+
the FSO operations within Ozone.
369+
370+
For example:
371+
372+
* a directory rename is emitted as a RenameEvent
373+
(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.RenameEvent.html) with
374+
srcPath=/path/to/old-dir, dstPath=/path/to/new-dir (i.e. there is no
375+
expectation that the impact on child objects will be notified)
376+
377+
* a recursive delete is emitted as a UnlinkEvent
378+
(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.UnlinkEvent.html) on the parent
379+
380+
* metadata changes (such as changes to permissions, replication,
381+
owner/group, acls, xattr etc.
382+
383+
are sent via a MetadataUpdateEvent
384+
(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/hdfs/inotify/Event.MetadataUpdateEvent.html)
385+
386+
This would be a good starting point for Ozone but would require some
387+
bespoke changes as acls, for example, do not have a one-to-one mapping
388+
to HDFS concepts.
389+
390+
Pros:
391+
- clear mapping for FSO and non-FSO operations such as directory renames
392+
- caters for metadata operations by design (although would require some
393+
customization)
394+
395+
Cons:
396+
- not ubiquitous across many storage solutions in the way that the S3 Event Notification schema is

0 commit comments

Comments
 (0)