Skip to content

Commit f1489b4

Browse files
authored
cl_intel_split_work_group_barrier (#765)
* initial draft for cl_intel_split_work_group_barrier * update version to v1.0.0
1 parent 9c65735 commit f1489b4

File tree

1 file changed

+209
-0
lines changed

1 file changed

+209
-0
lines changed
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
:data-uri:
2+
:sectanchors:
3+
:icons: font
4+
:source-highlighter: coderay
5+
// TODO: try rouge?
6+
7+
= cl_intel_split_work_group_barrier
8+
9+
== Name Strings
10+
11+
`cl_intel_split_work_group_barrier`
12+
13+
== Contact
14+
15+
Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)
16+
17+
== Contributors
18+
19+
// spell-checker: disable
20+
Ben Ashbaugh, Intel +
21+
Eugene Chereshnev, Intel +
22+
John Pennycook, Intel
23+
// spell-checker: enable
24+
25+
== Notice
26+
27+
Copyright (c) 2022 Intel Corporation. All rights reserved.
28+
29+
== Status
30+
31+
Shipping
32+
33+
== Version
34+
35+
Built On: {docdate} +
36+
Version: 1.0.0
37+
38+
== Dependencies
39+
40+
This extension is written against the OpenCL 3.0 C Language specification and the OpenCL SPIR-V Environment specification, V3.0.10.
41+
42+
This extension requires OpenCL 1.0.
43+
44+
Some OpenCL C function overloads added by this extension require OpenCL C 2.0 or newer.
45+
46+
== Overview
47+
48+
This extension adds built-in functions to split a `barrier` or `work_group_barrier` function in OpenCL C into two separate operations:
49+
the first indicates that a work-item has "arrived" at a barrier but should continue executing,
50+
and the second indicates that a work-item should "wait" for all of the work-items to arrive at the barrier before executing further.
51+
52+
Splitting a barrier operation may improve performance and may provide a closer match to "latch" or "barrier" operations in other parallel languages such as C++ 20.
53+
54+
== New API Functions
55+
56+
None.
57+
58+
== New API Enums
59+
60+
None.
61+
62+
== New API Types
63+
64+
None.
65+
66+
== New OpenCL C Functions
67+
68+
[source]
69+
----
70+
void intel_work_group_barrier_arrive(cl_mem_fence_flags flags);
71+
void intel_work_group_barrier_wait(cl_mem_fence_flags flags);
72+
73+
// For OpenCL C 2.0 or newer:
74+
void intel_work_group_barrier_arrive(cl_mem_fence_flags flags, memory_scope scope);
75+
void intel_work_group_barrier_wait(cl_mem_fence_flags flags, memory_scope scope);
76+
----
77+
78+
== Modifications to the OpenCL C Specification
79+
80+
=== Add to Table 19 - Built-in Work-Group Synchronization Functions
81+
82+
[caption="Table 19. "]
83+
.Built-in Work-Group synchronization Functions
84+
[cols="1a,2",options="header"]
85+
|====
86+
| *Function*
87+
| *Description*
88+
89+
|[source]
90+
----
91+
void intel_work_group_barrier_arrive(
92+
cl_mem_fence_flags flags);
93+
void intel_work_group_barrier_wait(
94+
cl_mem_fence_flags flags);
95+
96+
// For OpenCL C 2.0 or newer:
97+
void intel_work_group_barrier_arrive(
98+
cl_mem_fence_flags flags,
99+
memory_scope scope);
100+
void intel_work_group_barrier_wait(
101+
cl_mem_fence_flags flags,
102+
memory_scope scope);
103+
----
104+
| For these functions, if any work-item in a work-group arrives at a barrier, behavior is undefined unless all work-items in the work-group arrive at the barrier.
105+
If any work-item in a work-group waits on a barrier, behavior is undefined unless all work-items in the work-group wait on the barrier.
106+
107+
If a barrier arrive function is inside of a conditional statement and any work-item in the work-group enters the conditional statement and arrives at the barrier, behavior is undefined unless all work-items enter the conditional and arrive at the barrier.
108+
If a barrier wait function is inside of a conditional statement and any work-item in the work-group enters the conditional statement and waits on the barrier, behavior is undefined unless all work-items enter the conditional and wait on the barrier.
109+
110+
If a barrier arrive function is inside of a loop and any work-item arrives at the barrier for an iteration of the loop, behavior is undefined unless all work-items arrive at the barrier for the same iteration of the loop.
111+
If a barrier wait function is inside of a loop and any work-item waits on the barrier for an iteration of the loop, behavior is undefined unless all work-items wait on the barrier for the same iteration of the loop.
112+
113+
Behavior is undefined if a work-item waits on a barrier before arriving at a barrier.
114+
After a work-item arrives at a barrier, behavior is undefined if the work-item arrives at another barrier before waiting on a barrier.
115+
After a work-item waits on a barrier, behavior is undefined if the work-item waits on another barrier before arriving at a barrier.
116+
117+
The `intel_work_group_barrier_arrive` and `intel_work_group_barrier_wait` functions specify which memory operations from before arriving at the barrier must be visible to work-items after waiting on the barrier by using the _flags_ and _scope_ arguments.
118+
119+
The _flags_ argument specifies the memory address spaces to apply the memory ordering constraints.
120+
This is a bitfield that can be zero or a combination of the following values:
121+
122+
`CLK_LOCAL_MEM_FENCE`: for `local` memory accesses. +
123+
`CLK_GLOBAL_MEM_FENCE`: for `global` memory accesses. +
124+
`CLK_IMAGE_MEM_FENCE`: for image memory accesses, for this flag the value of _scope_ must be `memory_scope_work_group` or behavior is undefined.
125+
126+
The _scope_ argument describes the work-items to apply the memory ordering constraints.
127+
If no _scope_ argument is provided, the _scope_ is `memory_scope_work_group`.
128+
129+
If the _flags_ argument differs between the barrier arrive function and the barrier wait function then only memory operations for the address spaces specified by the intersection of the two _flags_ arguments must be visible.
130+
131+
If the _scope_ argument differs between the barrier arrive function and the barrier wait function then the memory ordering constraints only apply to work-items described by the narrower of the two _scope_ arguments.
132+
133+
For each call to these functions, the values of _flags_ and _scope_ must be the same for all work-items in the work-group.
134+
|====
135+
136+
== Modifications to the OpenCL SPIR-V Environment Specification
137+
138+
=== Add a new section 5.2.X - `cl_intel_split_work_group_barrier`
139+
140+
If the OpenCL environment supports the extension `cl_intel_split_work_group_barrier` then the environment must accept modules that declare use of the extension `SPV_INTEL_split_barrier` and that declare the SPIR-V capability *SplitBarrierINTEL*.
141+
142+
For the instructions *OpControlBarrierArriveINTEL* and *OpControlBarrierWaitINTEL* added by the extension:
143+
144+
* _Scope_ for _Execution_ must be *WorkGroup*.
145+
* Valid values for _Scope_ for _Memory_ are the same as for *OpControlBarrier*.
146+
147+
For the instruction *OpControlBarrierArriveINTEL*, the memory-order constraint in _Memory Semantics_ must be *Release*.
148+
149+
For the instruction *OpControlBarrierWaitINTEL*, the memory-order constraint in _Memory Semantics_ must be *Acquire*.
150+
151+
== Issues
152+
153+
. Do we need to support all of the features of C++ 20 barriers (completion functions, arrival tokens, etc.)?
154+
+
155+
--
156+
*RESOLVED*: Not in this extension.
157+
--
158+
159+
. Do we need to support subgroup split barriers?
160+
+
161+
--
162+
*RESOLVED*: Not in this extension.
163+
--
164+
165+
. Do we need to document formal changes to the memory model?
166+
+
167+
--
168+
*RESOLVED*: Not initially.
169+
Informally, the barrier wait for one work-item synchronizes-with the barrier arrives for the other work-items in the work-group.
170+
--
171+
172+
. What are the memory order constraints for a split barrier?
173+
+
174+
--
175+
*RESOLVED*: Arriving at a split barrier will effectively be a release memory fence and waiting on a barrier will effectively be an acquire memory fence.
176+
177+
Alternatively, both arriving and waiting could be sequentially consistent memory fences, but acquire and release are sufficient for most use-cases and may perform better.
178+
If a sequentially consistent fence is required instead, applications can use an ordinary non-split barrier, or insert explicit memory fences before arriving at the split barrier and after waiting on a split barrier.
179+
--
180+
181+
. What should behavior be if the flags arguments differ between the barrier arrive and the barrier wait?
182+
+
183+
--
184+
*RESOLVED*: The address spaces will be the intersection of the flags, and the memory scope will be the narrowest of the two scopes.
185+
This is the same behavior that would be observed with a release fence before arriving at the barrier and an acquire fence after waiting on the barrier.
186+
187+
Alternatively, this scenario could be undefined behavior, but this appears to be unnecessary.
188+
--
189+
190+
== Revision History
191+
192+
[cols="5,15,15,70"]
193+
[grid="rows"]
194+
[options="header"]
195+
|========================================
196+
|Version|Date|Author|Changes
197+
|0.9.0|2022-01-11|Ben Ashbaugh|*Initial revision*
198+
|0.9.1|2022-02-07|Ben Ashbaugh|Added "intel" prefix to split barrier functions.
199+
|1.0.0|2022-09-06|Ben Ashbaugh|Updated version.
200+
|========================================
201+
202+
//************************************************************************
203+
//Other formatting suggestions:
204+
//
205+
//* Use *bold* text for host APIs, or [source] syntax highlighting.
206+
//* Use `mono` text for device APIs, or [source] syntax highlighting.
207+
//* Use `mono` text for extension names, types, or enum values.
208+
//* Use _italics_ for parameters.
209+
//************************************************************************

0 commit comments

Comments
 (0)