diff --git a/programming_examples/README.md b/programming_examples/README.md index 8ab90c6ea12..ef8082aab86 100644 --- a/programming_examples/README.md +++ b/programming_examples/README.md @@ -18,11 +18,15 @@ Each IRON example has one or more implementations: They are organized into the following directories: -## [getting_started](./getting_started) +## [getting_started](./getting_started) Designs tailored to the new user experience that span from basic applications such as SAXPY to more complicated ones such as tiled matrix multiplication, for the NPU in Ryzen™ AI. -## [basic](./basic) +## [algorithms](./algorithms) + +Higher-level algorithm templates (transform, for_each, and parallel variants) that handle Workers, ObjectFIFOs, and data movement automatically for common element-wise dataflow patterns on the NPU in Ryzen™ AI. + +## [basic](./basic) Basic building blocks to understand the NPU architecture and first steps towards building applications for the NPU in Ryzen™ AI. diff --git a/programming_examples/basic/README.md b/programming_examples/basic/README.md index d591ba4a23d..4bdf285b12c 100644 --- a/programming_examples/basic/README.md +++ b/programming_examples/basic/README.md @@ -12,17 +12,30 @@ These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. -* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core. -* [Passthrough Kernel](./passthrough_kernel) - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming. -* [DMA Transpose](./dma_transpose) - Transposes a matrix with the Shim DMA using `npu_dma_memcpy_nd` -* [Vector Scalar Add](./vector_scalar_add) - Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. -* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096`. The kernel does a `1024` vector multiply and is invoked multiple times to complete the full `vector * scalar` compute. +* [Passthrough DMAs](./passthrough_dmas) - Data movement memcpy using object FIFOs via DMAs only, without involving the AIE core. +* [Passthrough Kernel](./passthrough_kernel) - Vectorized memcpy via a single AIE core kernel. +* [Passthrough PyKernel](./passthrough_pykernel) - Memcpy where the AIE kernel is written as an inline Python function rather than a C++ external function. +* [Passthrough DMAs PLIO](./passthrough_dmas_plio) - **Targets the Xilinx VCK5000, not Ryzen AI NPU.** Demonstrates PLIO-connected soft DMAs in programmable logic. +* [DMA Transpose](./dma_transpose) - Matrix transpose using the Shim DMA with `npu_dma_memcpy_nd`. +* [DMA Transpose Packet](./dma_transpose_packet) - Matrix transpose using packet-switched DMA flows. +* [Chaining Channels](./chaining_channels) - Demonstrates chaining multiple DMA buffer descriptors in sequence on a single channel. +* [Combined Transpose](./combined_transpose) - Matrix transpose combining Shim DMA strides with AIE core VSHUFFLE instructions. +* [Shuffle Transpose](./shuffle_transpose) - Matrix transpose using only AIE core VSHUFFLE instructions. +* [Vector Scalar Add](./vector_scalar_add) - Single tile increments every element of a vector by `1`. +* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096` in `1024`-element chunks. +* [Vector Scalar Add Runlist](./vector_scalar_add_runlist) - Vector scalar add using the run-list execution model. * [Vector Vector Add](./vector_vector_add) - Single tile performs `vector + vector` of size `1024`. +* [Vector Vector Add BDs Init Values](./vector_vector_add_BDs_init_values) - Vector addition with buffer descriptors pre-initialized with values. * [Vector Vector Modulo](./vector_vector_modulo) - Single tile performs `vector % vector` of size `1024`. * [Vector Vector Multiply](./vector_vector_mul) - Single tile performs `vector * vector` of size `1024`. -* [Vector Reduce Add](./vector_reduce_add) - Single tile performs a reduction of a vector to return the `sum` of the elements. -* [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements. -* [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements. -* [Vector Exp](./vector_exp) - A simple element-wise exponent function, using the look up table capabilities of the AI Engine. -* [Matrix Scalar Add](./matrix_scalar_add) - Single tile performs `matrix * vector` with matrix size of `16x8`. -* [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking. +* [Vector Reduce Add](./vector_reduce_add) - Single tile reduction returning the `sum` of a vector. +* [Vector Reduce Max](./vector_reduce_max) - Single tile reduction returning the `max` of a vector. +* [Vector Reduce Min](./vector_reduce_min) - Single tile reduction returning the `min` of a vector. +* [Vector Exp](./vector_exp) - Element-wise $e^x$ using the AIE look-up table capability. +* [Matrix Scalar Add](./matrix_scalar_add) - Single tile adds a scalar constant to every element of a `16x8` matrix. +* [Matrix Multiplication](./matrix_multiplication) - Single-core, multi-core (whole array), and matrix-vector multiply designs, plus sweep benchmarking infrastructure. +* [Row Wise Bias Add](./row_wise_bias_add) - Adds a bias vector to each row of a matrix using DMA tiling. +* [Event Trace](./event_trace) - Demonstrates the AIE hardware trace unit for measuring kernel cycle counts and stall events. See also [Section 4b](../../programming_guide/section-4/section-4b/) of the programming guide. +* [Packet Switch](./packet_switch) - Demonstrates packet-switched routing for multiplexing multiple data streams over shared interconnect. +* [Tiling Exploration](./tiling_exploration) - Interactive exploration of `TensorAccessPattern` and `TensorTiler2D` for n-dimensional DMA tiling. Includes visualization tools. +* [Memcpy](./memcpy) - **Exercise design.** A parameterized multi-column memcpy with an intentionally unoptimized runtime sequence. The goal is to add task groups to achieve peak bandwidth. See [getting_started/00_memcpy](../getting_started/00_memcpy/) for the reference solution. diff --git a/programming_examples/basic/memcpy/README.md b/programming_examples/basic/memcpy/README.md index 72b71171068..376ca275174 100644 --- a/programming_examples/basic/memcpy/README.md +++ b/programming_examples/basic/memcpy/README.md @@ -1,8 +1,10 @@ ---- +--- # **Memcpy** -The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance charactaristics. +> **Exercise Design:** The runtime sequence in `memcpy.py` is intentionally left unoptimized — drain operations run serially rather than in parallel, which limits measured bandwidth. Your task is to restructure the runtime sequence using `task_group()` to achieve full concurrency across all columns and channels. See Step 4 below for guidance, and [getting_started/00_memcpy/memcpy.py](../../../getting_started/00_memcpy/memcpy.py) for the reference solution. + +The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance characteristics. --- diff --git a/programming_examples/basic/passthrough_kernel/README.md b/programming_examples/basic/passthrough_kernel/README.md index 0ee505b996a..e75f057c112 100644 --- a/programming_examples/basic/passthrough_kernel/README.md +++ b/programming_examples/basic/passthrough_kernel/README.md @@ -10,7 +10,7 @@ # Passthrough Kernel: -This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_kernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. ## Source Files Overview diff --git a/programming_examples/basic/passthrough_pykernel/README.md b/programming_examples/basic/passthrough_pykernel/README.md index eb61a7fc5af..7d04d71d698 100644 --- a/programming_examples/basic/passthrough_pykernel/README.md +++ b/programming_examples/basic/passthrough_pykernel/README.md @@ -10,7 +10,7 @@ # Passthrough Kernel: -This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_pykernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of a primary design file `passthrough_pykernel.py` and a testbench `test.cpp` or `test.py`. ## Source Files Overview @@ -31,7 +31,7 @@ This IRON design flow example, called "Passthrough Kernel", demonstrates a simpl This simple example effectively passes data through a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows: 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory. -1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python fucntion is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object". +1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python function is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object". 1. After the copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing `depth` is `2` when constructing the `ObjectFifo`, for example, `ObjectFifo(line_ty, depth=2)` to denote ping-pong buffers. By default, the depth is `2` in recognition of this common pattern. @@ -64,7 +64,7 @@ This design performs a memcpy operation on a vector of input data. The AIE desig ## Usage -### Compile the desing: +### Compile the design: To compile the design: diff --git a/programming_examples/basic/vector_exp/README.md b/programming_examples/basic/vector_exp/README.md index a69468b12d0..805169200bd 100644 --- a/programming_examples/basic/vector_exp/README.md +++ b/programming_examples/basic/vector_exp/README.md @@ -46,7 +46,7 @@ env use_placed=1 make To compile the C++ testbench: ```shell -make text_exp.exe +make vector_exp.exe ``` To run the design: diff --git a/programming_examples/basic/vector_scalar_mul/README.md b/programming_examples/basic/vector_scalar_mul/README.md index 61ccf3073bb..d68a0a97eb5 100644 --- a/programming_examples/basic/vector_scalar_mul/README.md +++ b/programming_examples/basic/vector_scalar_mul/README.md @@ -22,9 +22,9 @@ This IRON design flow example, called "Vector Scalar Multiplication", demonstrat 1. `vector_scalar_mul_jit.py`: A JIT version that passes `scale.cc` to the transform algorithm. JIT compilation allows combining the host code with AIE design into one file. -1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. +1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data. -1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. +1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data. ## Design Overview diff --git a/programming_examples/getting_started/00_memcpy/memcpy.py b/programming_examples/getting_started/00_memcpy/memcpy.py index 756ae5900c7..53503239ac0 100644 --- a/programming_examples/getting_started/00_memcpy/memcpy.py +++ b/programming_examples/getting_started/00_memcpy/memcpy.py @@ -122,6 +122,16 @@ def core_fn(of_in, of_out, passThroughLine): # Create a TensorAccessPattern for each channel to describe the data movement. # The pattern chops the data in equal chunks and moves them in parallel across # the columns and channels. + # + # TensorAccessPattern arguments (see programming_guide/section-2/section-2c/ + # for a full explanation of data layout transformations): + # tensor_dims : logical shape of the full transfer buffer — (1, size) + # offset : starting element index into that buffer for this chunk + # sizes : [dim3, dim2, dim1, dim0] — number of elements in each + # dimension. [1, 1, 1, chunk] means a single 1-D transfer + # of `chunk` elements (the higher dimensions are unused). + # strides : [dim3, dim2, dim1, dim0] — step between elements in each + # dimension. [0, 0, 0, 1] means contiguous (stride-1) access. taps = [ TensorAccessPattern( (1, size), diff --git a/programming_examples/getting_started/01_SAXPY/saxpy.cc b/programming_examples/getting_started/01_SAXPY/saxpy.cc index eda7c188fbd..7455ef6ef59 100644 --- a/programming_examples/getting_started/01_SAXPY/saxpy.cc +++ b/programming_examples/getting_started/01_SAXPY/saxpy.cc @@ -15,11 +15,11 @@ #include #include -#define REL_WRITE 0 -#define REL_READ 1 - #include +// NOTE: Both kernels below are hardcoded for N=4096 elements. The Python +// design file (saxpy.py) must be called with a tensor of exactly this size. +// Calling with any other size will produce silently incorrect results. extern "C" { void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) { event0(); @@ -39,6 +39,10 @@ void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) { event1(); } +// saxpy_scalar: a non-vectorized reference implementation of SAXPY. +// Useful for verifying correctness and understanding the algorithm before +// examining the vectorized version above. Can be selected from Python by +// changing the ExternalFunction name from "saxpy" to "saxpy_scalar". void saxpy_scalar(bfloat16 *x, bfloat16 *y, bfloat16 *z) { event0(); float a = 3.f; diff --git a/programming_examples/getting_started/01_SAXPY/saxpy.py b/programming_examples/getting_started/01_SAXPY/saxpy.py index 23ef83a3e8e..81af611bfbe 100644 --- a/programming_examples/getting_started/01_SAXPY/saxpy.py +++ b/programming_examples/getting_started/01_SAXPY/saxpy.py @@ -10,11 +10,9 @@ import os import aie.iron as iron -from aie.iron import ExternalFunction, jit -from aie.iron import Kernel, ObjectFifo, Program, Runtime, Worker +from aie.iron import ExternalFunction +from aie.iron import ObjectFifo, Program, Runtime, Worker from aie.iron.placers import SequentialPlacer -from aie.iron.controlflow import range_ -from aie.helpers.taplib import TensorAccessPattern, TensorTiler2D from aie.utils.config import cxx_header_path @@ -86,8 +84,10 @@ def core_body(of_x, of_y, of_z, saxpy_kernel): def main(): - # Define tensor shapes and data types - data_size = 2048 + # Define tensor shapes and data types. + # NOTE: saxpy.cc hardcodes the loop bound to 4096 elements. This value + # must match data_size or the kernel will produce silently wrong results. + data_size = 4096 element_type = bfloat16 # Construct an input tensor and an output zeroed tensor @@ -100,21 +100,14 @@ def main(): # to the kernel will use the same compiled kernel and loaded code objects saxpy(input0, input1, output) - # Check the correctness of the result and print + # Check the correctness of the result and print any mismatches ref_vec = [3 * input0[i] + input1[i] for i in range(data_size)] errors = 0 - for index, (actual, ref) in enumerate( - zip( - output, - ref_vec, - ) - ): + for index, (actual, ref) in enumerate(zip(output, ref_vec)): if actual != ref: print(f"Error at {index}: {actual} != {ref}") errors += 1 - else: - print(f"Correct output at {index}: {actual} == {ref}") # If the result is correct, exit with a success code # Otherwise, exit with a failure code diff --git a/programming_examples/getting_started/02_vector_reduce_max/vector_reduce_max_1col.py b/programming_examples/getting_started/02_vector_reduce_max/vector_reduce_max_1col.py index ee3e330e962..bdf6d390bd5 100644 --- a/programming_examples/getting_started/02_vector_reduce_max/vector_reduce_max_1col.py +++ b/programming_examples/getting_started/02_vector_reduce_max/vector_reduce_max_1col.py @@ -60,6 +60,9 @@ def vector_reduce_max(input0, output): else: of_a_offsets = [0] + # split() distributes one large ObjectFIFO into n_cores smaller ones, + # each starting at the given offset. See programming_guide/section-2/section-2b/ + # for more detail on ObjectFIFO distribute and join patterns. in_fifos = of_in.cons().split( of_a_offsets, obj_types=[op_ty] * n_cores, @@ -94,7 +97,10 @@ def vector_reduce_max(input0, output): include_dirs=[cxx_header_path()], ) - def start_core_body(of_in, of_out, reduce_fn, nextC_buffer, tmp_buffer): + # final_core_body: runs on the last core in the cascade. This core does not + # read results from a downstream neighbor — it is the terminal node that + # writes the final maximum value to the output ObjectFIFO. + def final_core_body(of_in, of_out, reduce_fn, nextC_buffer, tmp_buffer): elem_out = of_out.acquire(1) for _ in range_(num_iter): elem_in = of_in.acquire(1) @@ -142,7 +148,7 @@ def core_body(of_in, of_out, in0, reduce_fn, nextC_buffer, tmp_buffer): else: workers.append( Worker( - start_core_body, + final_core_body, fn_args=[ in_fifos[i].cons(), out_fifos[i].prod(), @@ -178,7 +184,9 @@ def main(): out_size = 4 element_type = bfloat16 - assert out_size == 4, "Output buffer must be size 4 (4 bytes = 1 integer)." + assert ( + out_size == 4 + ), "Output buffer must be size 4 (4 bytes = 2 bfloat16 elements)." in_tensor_size = in_size // element_type(0).nbytes out_tensor_size = out_size // element_type(0).nbytes @@ -192,8 +200,9 @@ def main(): # to the kernel will use the same compiled kernel and loaded code objects vector_reduce_max(input0, output) - # Check the correctness of the result and print - ref_max = 0 + # Check the correctness of the result and print. + # Initialize to -inf so the reference is correct for all-negative inputs. + ref_max = bfloat16(float("-inf")) for i in input0: if i > ref_max: ref_max = i diff --git a/programming_examples/getting_started/03_matrix_multiplication_single_core/matrix_multiplication_single_core.py b/programming_examples/getting_started/03_matrix_multiplication_single_core/matrix_multiplication_single_core.py index c8b212688b7..c6ea0e5a999 100644 --- a/programming_examples/getting_started/03_matrix_multiplication_single_core/matrix_multiplication_single_core.py +++ b/programming_examples/getting_started/03_matrix_multiplication_single_core/matrix_multiplication_single_core.py @@ -46,7 +46,9 @@ def matrix_multiplication_single_core(input0, input1, output): # The following ObjectFIFOs route m*k-, k*n-, and m*n-sized subtiles # (objects) to/from the compute cores via mem tiles, rearranging their data - # into r*s-, s*t-, and r*t-sized sub-subtiles. + # into r*s-, s*t-, and r*t-sized sub-subtiles. The data layout transformations + # performed by the mem tile DMAs are explained in detail in + # programming_guide/section-2/section-2c/ (Data Layout Transformations). fifo_A_L3L2 = ObjectFifo(a_ty, name="A_L3L2") tap_A_L2L1 = TensorTiler2D.group_tiler((m, k), (r, s), (m // r, k // s))[0] @@ -61,6 +63,16 @@ def matrix_multiplication_single_core(input0, input1, output): ) fifo_C_L1L2 = ObjectFifo(c_ty, name="C_L1L2") + # tap_C_L1L2 describes the inverse tiling transform that unpacks the C tile + # from the kernel's intrinsic layout (r*t sub-tiles) back to row-major order. + # TensorAccessPattern arguments (see programming_guide/section-2/section-2c/ + # for a full explanation of n-dimensional data layout transformations): + # tensor_dims : logical shape of one C tile — (m, n) + # offset : 0 (start from the beginning of the tile) + # sizes : [m//r, r, n//t, t] — iterate over (m/r) groups of r rows, + # each with (n/t) groups of t columns → visits all r*t sub-tiles + # strides : [r*n, t, r*t, 1] — step sizes matching the sub-tile layout + # produced by the MMUL kernel intrinsic tap_C_L1L2 = TensorAccessPattern( tensor_dims=(m, n), offset=0, @@ -113,7 +125,9 @@ def core_fn(of_a, of_b, of_c, matmul): # The data movement patterns from DRAM divide the input matrices (sizes # M*K, K*N) into m*k- and k*n-sized subtiles and produce output into C in # m*n-sized subtiles. Each single "task group" encompasses all data - # movement required for a single row of the output matrix. + # movement required for a single row of the output matrix. See + # programming_guide/section-2/section-2f/ for practical examples of + # multi-level (L3→L2→L1) data movement patterns. a_taps = TensorTiler2D.group_tiler( (M, K), (m, k), (1, K // k), pattern_repeat=(N // n) diff --git a/programming_examples/getting_started/README.md b/programming_examples/getting_started/README.md index 111f8bec710..801b5e8d33d 100644 --- a/programming_examples/getting_started/README.md +++ b/programming_examples/getting_started/README.md @@ -12,6 +12,8 @@ These programming examples provide a good starting point for those new to NPU programming with IRON, and aim to provide an overview of the IRON and NPU capabilities. All the designs are self-contained and operate on fixed problem sizes for simplicity. Please see the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. +## Examples + * [Memcpy](./00_memcpy/) - This design demonstrates a highly parallel, parameterized implementation of a memcpy operation that uses shim DMAs in every NPU column with the goal to measure memory bandwidth across the full NPU and evaluate how well a design utilizes available memory bandwidth across multiple columns and channels. * [SAXPY](./01_SAXPY/) - This design demonstrates an implementation of a SAXPY operation (i.e. $Z = a*X + Y$) with both scalar and vectorized kernels. * [Vector Reduce Max](./02_vector_reduce_max/) - This design demonstrates a vector reduce max implementation using a distributed, parallel approach across multiple AIE cores in one NPU column. diff --git a/programming_examples/ml/README.md b/programming_examples/ml/README.md index f9525e3e446..9d5875a0eb8 100644 --- a/programming_examples/ml/README.md +++ b/programming_examples/ml/README.md @@ -10,14 +10,24 @@ # Machine Learning Examples -| Design name | Data type | Description | +| Design name | Data type | Description | |-|-|-| -| [Eltwise Add](../../programming_examples/ml/eltwise_add/) | bfloat16 | An element by element addition of two vectors | -| [Eltwise Mul](../../programming_examples/ml/eltwise_mul/) | i32 | An element by element multiplication of two vectors | -| [ReLU](../../programming_examples/ml/relu/) | bfloat16 | Rectified linear unit (ReLU) activation function on a vector| -| [Softmax](../../programming_examples/ml/softmax/) | bfloat16 | Softmax operation on a matrix | +| [Eltwise Add](../../programming_examples/ml/eltwise_add/) | bfloat16 | An element by element addition of two vectors | +| [Eltwise Mul](../../programming_examples/ml/eltwise_mul/) | i32 | An element by element multiplication of two vectors | +| [GeLU](../../programming_examples/ml/gelu/) | bfloat16 | Gaussian Error Linear Unit (GeLU) activation function on a vector | +| [SiLU](../../programming_examples/ml/silu/) | bfloat16 | Sigmoid Linear Unit (SiLU) activation function on a vector | +| [SwiGLU](../../programming_examples/ml/swiglu/) | bfloat16 | Swish-Gated Linear Unit (SwiGLU) activation function on a vector | +| [ReLU](../../programming_examples/ml/relu/) | bfloat16 | Rectified linear unit (ReLU) activation function on a vector| +| [Softmax](../../programming_examples/ml/softmax/) | bfloat16 | Softmax operation on a matrix | +| [LayerNorm](../../programming_examples/ml/layernorm/) | bfloat16 | Layer normalization on a matrix | +| [RMSNorm](../../programming_examples/ml/rmsnorm/) | bfloat16 | Root Mean Square layer normalization on a matrix | +| [RoPE](../../programming_examples/ml/rope/) | bfloat16 | Rotary Position Embedding on a matrix | +| [Scale Shift](../../programming_examples/ml/scale_shift/) | bfloat16 | Element-wise scale (multiply) and shift (add) on vectors | | [Conv2D](../../programming_examples/ml/conv2d) | i8 | A single core 2D convolution for CNNs | +| [Conv2D 14x14](../../programming_examples/ml/conv2d_14x14) | i8 | A multi-core 2D convolution for 14x14 feature maps | | [Conv2D+ReLU](../../programming_examples/ml/conv2d_fused_relu) | i8 | A Conv2D with a ReLU fused at the vector register level | |[Bottleneck](../../programming_examples/ml/bottleneck/)|ui8|A Bottleneck Residual Block is a variant of the residual block that utilizes three convolutions, using 1x1, 3x3, and 1x1 filter sizes, respectively. The implementation features fusing of multiple kernels and dataflow optimizations, highlighting the unique architectural capabilities of AI Engines| |[ResNet](../../programming_examples/ml/resnet/)|ui8|ResNet with offloaded conv2_x layers. The implementation features depth-first implementation of multiple bottleneck blocks across multiple NPU columns.| +|[Magika](../../programming_examples/ml/magika/)|bfloat16|Magika file-type detection model inference on the NPU.| +|[Block Datatypes](../../programming_examples/ml/block_datatypes/)|various|Examples demonstrating block floating point and other block datatypes on the NPU.| diff --git a/programming_examples/ml/conv2d_14x14/README.md b/programming_examples/ml/conv2d_14x14/README.md index a44d18a15dd..59e5946d837 100644 --- a/programming_examples/ml/conv2d_14x14/README.md +++ b/programming_examples/ml/conv2d_14x14/README.md @@ -18,34 +18,34 @@ This optimized design is currently targeting a single AIE core and uses memtile The data layout at each stage of the design is as follows: ### Sub-kernel ([conv2dk14.cc](../../../aie_kernels/aie2p/conv2dk14.cc)) -For each vector multiply (vmul) on a strix device for uint8/int8 datatypes, we perform a 8x8x8 matrix multiplcation. The format of the data for each vmul is as follows: +For each vector multiply (vmul) on a strix device for uint8/int8 datatypes, we perform a 8x8x8 matrix multiplication. The format of the data for each vmul is as follows: * Inputs/Activations - {T8}{P2} * Weights - {P2}{C8} * Outputs - {T8}{C8} -Defintions +Definitions * P2 - 2 pixels consisting of rgba. So that would be {r0 b0 g0 a0} and {r1 b1 g1 a1} * T8 - 8 tiles. Tiles are a sequential notation we're using to number each of the 14x14 pixel blocks we're interating over. So in our 896 x 896 pixel image, we have 64 x 64 tiles. The first row of tiles are then indexed as t0 .. t63. The first tile in the second row is then t64, etc. -* C8 - 8 channels. This corresponds to output channels. We do techncially have 4 input channels but we're grouping all 4 of them into a single pixel in this notation +* C8 - 8 channels. This corresponds to output channels. We do technically have 4 input channels but we're grouping all 4 of them into a single pixel in this notation Then, within our sub-kernel ([conv2dk14.cc](../../../aie_kernels/aie2p/conv2dk14.cc)) we loop over the following inputs, weights and outputs: * Inputs - {T/8}{P/P2}{T8}{P2} - 12,544 bytes * Weights - {C/8}{P/P2}{P2}{C8} - 12,544 bytes * Outputs - {C/8}{T/8}{T8}{C8} - 256 bytes -Defintions -* P/P2 - Total pixels divided by 2 or the remaining pixels. In this design, we have a kernel size of 14x14 or 196 pixels. Since the lowest dimenioni is every 2 pixels, on the outer dimension, we iterate of 196/2 or 98 +Definitions +* P/P2 - Total pixels divided by 2 or the remaining pixels. In this design, we have a kernel size of 14x14 or 196 pixels. Since the lowest dimension is every 2 pixels, on the outer dimension, we iterate of 196/2 or 98 * T/8 - Remaining tiles. Our sub-kernel operates on 16 tiles so T/8 = 16/8 = 2. * C/8 - Remaining channels. Our sub-kernel operates on 16 channels on C/8 = 2. ### Kernel ([conv2dk14_placed.py](./conv2dk14_placed.py))/ -Our kernel then loops over a set of inputs and weights while calling our sub-kernel to process 16 channels, 16 tiles and 196 pixels (14x14x4). We loop over the tile row of our image which has 64 tiles, giving us a loop size of 4 (x_blocks) since we process 16 tiles at a time. Then we loop over the tile rows of our image which is a loop size of 64. That in turn is inside a infinite loop which allows us to compute as many output channels as needed. Given that we compute 16 output channels each iteration of the kernel body, we would iterate 72 times (1152/ 16) to compute the results for all output chanenls. +Our kernel then loops over a set of inputs and weights while calling our sub-kernel to process 16 channels, 16 tiles and 196 pixels (14x14x4). We loop over the tile row of our image which has 64 tiles, giving us a loop size of 4 (x_blocks) since we process 16 tiles at a time. Then we loop over the tile rows of our image which is a loop size of 64. That in turn is inside a infinite loop which allows us to compute as many output channels as needed. Given that we compute 16 output channels each iteration of the kernel body, we would iterate 72 times (1152/ 16) to compute the results for all output channels. ### Memtiles and Top-level We use the memtile primarily to buffer DDR reads but also to leverage the layout transformation of the memtile DMA. * Inputs - an entire tile row or 4 x 12,544 bytes = 50,176 bytes. For the inputs, we assume the data is arranged as YXC where each pixel has 4 channels or rgba, then ordered by image width (896) and image height (896). We use the 2 levels of DMA layout transformation to arrange the data into the {T/8}{P/P2}{T8}{P2} format used by the kernel * Weights - Not stored in memtile as weights are assumed to arranged in the correct layout format which is {C/8}{P/P2}{P2}{C8} -* Outputs - Full output size (64x64) for 16 channels or 65,536 bytes. The output has a partial layout transformation in that it transforms it into {C/16}YX{C16}. This means the lowest dimenion of C16 is 16 channels. In the case of the memtile, that's all we store so it's only YX{C16}. But since we continue to push data via the inputs and weights, we continue to compute results for additional channels, the output buffer format in DDR is currently {C/16}YX{C16}. We do a transformation in our testbench ([test.py](./test.py)) to get this back to CYX to compare it to the pytorch golden data. +* Outputs - Full output size (64x64) for 16 channels or 65,536 bytes. The output has a partial layout transformation in that it transforms it into {C/16}YX{C16}. This means the lowest dimension of C16 is 16 channels. In the case of the memtile, that's all we store so it's only YX{C16}. But since we continue to push data via the inputs and weights, we continue to compute results for additional channels, the output buffer format in DDR is currently {C/16}YX{C16}. We do a transformation in our testbench ([test.py](./test.py)) to get this back to CYX to compare it to the pytorch golden data. ## Source Files Overview diff --git a/programming_examples/ml/eltwise_add/README.md b/programming_examples/ml/eltwise_add/README.md index a7589e159a2..6b6f16ef7ed 100644 --- a/programming_examples/ml/eltwise_add/README.md +++ b/programming_examples/ml/eltwise_add/README.md @@ -10,16 +10,16 @@ # Eltwise Addition -This design implements a `bfloat16` based element-wise addtiplication between two vectors, performed in parallel on two cores in a single column. Element-wise addtiplication usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). +This design implements a `bfloat16` based element-wise addition between two vectors, performed in parallel on two cores in a single column. Element-wise addition usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). ## Source Files Overview 1. `eltwise_add.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc` to produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI). -1. `add.cc`: A C++ implementation of a vectorized vector addtiplication operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). +1. `add.cc`: A C++ implementation of a vectorized vector addition operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). -1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the element-wise addition results against a CPU reference and optionally outputs trace data. ## Usage diff --git a/programming_examples/ml/gelu/README.md b/programming_examples/ml/gelu/README.md new file mode 100644 index 00000000000..60783c7b1dc --- /dev/null +++ b/programming_examples/ml/gelu/README.md @@ -0,0 +1,43 @@ + + +# GeLU + +GeLU (Gaussian Error Linear Unit) is an activation function widely used in transformer-based models such as BERT and GPT. It is defined as: + +$$\text{GeLU}(x) = x \cdot \Phi(x)$$ + +where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution. In practice a fast approximation is commonly used: + +$$\text{GeLU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\!\left(\sqrt{2/\pi}\,(x + 0.044715\,x^3)\right)\right)$$ + +This design implements a `bfloat16` based GeLU on a vector, distributed in parallel across multiple AIE cores and NPU columns. Like other element-wise activation functions, GeLU is I/O bound due to its low compute intensity relative to data movement. + +## Source Files Overview + +1. `gelu.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.bin for the NPU in Ryzen™ AI). + +1. `gelu_bf16` kernel: A vectorized C++ implementation of GeLU for AIE cores, compiled into `kernels.a`. The kernel operates on 1024-element `bfloat16` chunks per invocation. + +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the GeLU results against a CPU reference. + +## Usage + +### C++ Testbench + +To compile the design and C++ testbench: +```shell +make +``` + +To run the design: +```shell +make run +``` diff --git a/programming_examples/ml/relu/README.md b/programming_examples/ml/relu/README.md index 77b77521e79..b84acb82e70 100644 --- a/programming_examples/ml/relu/README.md +++ b/programming_examples/ml/relu/README.md @@ -28,7 +28,7 @@ This design implements a `bfloat16` based ReLU on a vector, performed in paralle 1. `relu.cc`: A C++ implementation of a vectorized ReLU operation for AIE cores, which is a 1:1 implementation of the inherent function using low-level intrinsics. The AIE2 allows an element-wise max of 32 `bfloat16` numbers against a second vector register containing all zeros, implementing the $ReLU(x) = max(0,x)$ function directly. The source can be found [here](../../../aie_kernels/aie2/relu.cc). -1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the ReLU results against a CPU reference and optionally outputs trace data. ## Usage diff --git a/programming_examples/ml/resnet/layers_conv2_x/resnet.py b/programming_examples/ml/resnet/layers_conv2_x/resnet.py index a2cfe658bc2..18f90865198 100644 --- a/programming_examples/ml/resnet/layers_conv2_x/resnet.py +++ b/programming_examples/ml/resnet/layers_conv2_x/resnet.py @@ -537,23 +537,23 @@ def conv1_skip_fn( # Set runtime parameters def set_rtps(rtp): + # Only set RTPs for tiles that actually read them (conv1_fn and conv1_skip_fn + # workers). conv2_fn workers use a hardcoded scale=1 and have no RTP arg, + # so their corresponding buffers are never placed/resolved. + # col 0: conv1_fn at Tile(0,2) → rtp[0][0]; conv1_skip_fn at Tile(0,4) → rtp[0][3] rtp[0][0][0] = 1 - rtp[0][1][0] = 1 - rtp[0][2][0] = 1 rtp[0][3][0] = 1 rtp[0][3][1] = 0 rtp[0][3][2] = 1 + # col 1: conv1_fn at Tile(1,5) → rtp[1][3]; conv1_skip_fn at Tile(1,3) → rtp[1][1] rtp[1][3][0] = 1 - rtp[1][2][0] = 1 - rtp[1][0][0] = 1 rtp[1][1][0] = 1 rtp[1][1][1] = 0 + # col 2: conv1_fn at Tile(2,2) → rtp[2][0]; conv1_skip_fn at Tile(2,4) → rtp[2][2] rtp[2][0][0] = 1 - rtp[2][1][0] = 1 - rtp[2][3][0] = 1 rtp[2][2][0] = 1 rtp[2][2][1] = 0 diff --git a/programming_examples/ml/scale_shift/README.md b/programming_examples/ml/scale_shift/README.md index 163c4659ac4..6b3c423441f 100644 --- a/programming_examples/ml/scale_shift/README.md +++ b/programming_examples/ml/scale_shift/README.md @@ -8,7 +8,7 @@ // //===----------------------------------------------------------------------===//--> -# Eltwise Multiplication +# Scale Shift This design implements a `bfloat16` based element-wise multiplication followed by an element-wise addition of three vectors, performed in parallel on two cores in a single column. Element-wise multiplication and addition usually is I/O bound due to the low compute intensity. In a practical ML implementation, this is an example of the type of simple kernel fusion passing intermediate results in DDR. @@ -19,7 +19,7 @@ This design implements a `bfloat16` based element-wise multiplication followed b 1. `scale_shift.cc`: A C++ implementation of a vectorized vector multiplication and addition operations for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics. The source can be found [here](../../../aie_kernels/aie2/scale_shift.cc). The parameter `is_mul` is used to switch between the two operations at runtime. -1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the scale-shift results against a CPU reference and optionally outputs trace data. ## Usage diff --git a/programming_examples/ml/silu/README.md b/programming_examples/ml/silu/README.md new file mode 100644 index 00000000000..fcd6e3874bd --- /dev/null +++ b/programming_examples/ml/silu/README.md @@ -0,0 +1,41 @@ + + +# SiLU + +SiLU (Sigmoid Linear Unit), also known as the Swish activation function, is defined as: + +$$\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$$ + +where $\sigma(x)$ is the sigmoid function. SiLU is used as an activation function in models such as EfficientNet and various vision transformers. It is smooth and non-monotonic, which can improve training dynamics compared to ReLU. + +This design implements a `bfloat16` based SiLU on a vector, distributed in parallel across multiple AIE cores and NPU columns. Like other element-wise activation functions, SiLU is I/O bound due to its low compute intensity relative to data movement. + +## Source Files Overview + +1. `silu.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.bin for the NPU in Ryzen™ AI). + +1. `silu_bf16` kernel: A vectorized C++ implementation of SiLU for AIE cores, compiled into `kernels.a`. The kernel operates on 1024-element `bfloat16` chunks per invocation. + +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the SiLU results against a CPU reference. + +## Usage + +### C++ Testbench + +To compile the design and C++ testbench: +```shell +make +``` + +To run the design: +```shell +make run +``` diff --git a/programming_examples/ml/swiglu/README.md b/programming_examples/ml/swiglu/README.md new file mode 100644 index 00000000000..0152b7ffbe1 --- /dev/null +++ b/programming_examples/ml/swiglu/README.md @@ -0,0 +1,41 @@ + + +# SwiGLU + +SwiGLU (Swish-Gated Linear Unit) is a gated activation function used in large language models such as LLaMA and PaLM. It is defined as: + +$$\text{SwiGLU}(x, W, V) = \text{SiLU}(xW) \otimes (xV)$$ + +where $\otimes$ denotes element-wise multiplication, $W$ and $V$ are two separate weight projections, and $\text{SiLU}(x) = x \cdot \sigma(x)$ is the Sigmoid Linear Unit. In practice the input gate and the linear projection are stored as two halves of a single weight matrix. + +This design implements a `bfloat16` based SwiGLU on a vector, distributed in parallel across multiple AIE cores and NPU columns. The design accepts two input vectors (the gated and linear projections) and produces one output vector. Unlike single-input activation functions such as ReLU or GeLU, SwiGLU requires two simultaneous input streams per core, reflected in the two-ObjectFIFO input structure of this design. + +## Source Files Overview + +1. `swiglu.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.bin for the NPU in Ryzen™ AI). + +1. `swiglu_bf16` kernel: A vectorized C++ implementation of SwiGLU for AIE cores, compiled into `kernels.a`. The kernel accepts two 1024-element `bfloat16` weight chunks and one activation chunk per invocation. + +1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the SwiGLU results against a CPU reference. + +## Usage + +### C++ Testbench + +To compile the design and C++ testbench: +```shell +make +``` + +To run the design: +```shell +make run +``` diff --git a/programming_examples/mlir/README.md b/programming_examples/mlir/README.md new file mode 100644 index 00000000000..06784cfc7cb --- /dev/null +++ b/programming_examples/mlir/README.md @@ -0,0 +1,25 @@ + + +# MLIR Examples + +These examples illustrate how AIE designs are expressed at the MLIR level, which is the intermediate representation that the Python IRON API compiles down to. Reading these files alongside the [programming guide](../../programming_guide/) can provide insight into what the higher-level abstractions generate. + +## Examples + +* [MM_2x2](./MM_2x2/) - Matrix multiplication mapped onto a 2×2 array of AIE cores, in circuit-switched, packet-switched, and ObjectFIFO variants. Targets Versal VCK5000. + +* [horizontal_diffusion](./horizontal_diffusion/) - Implementation of the horizontal diffusion stencil computation from the COSMO atmospheric model, demonstrating multi-core data streaming across AIE tiles. Published at ICS 2023. Targets Versal hardware. + +* [autocorrelation](./autocorrelation/) - Autocorrelation of a signal vector across AIE cores using explicit DMA programming. + +* [idct](./idct/) - Inverse Discrete Cosine Transform (IDCT) kernel for image/video processing. + +* [prime_sieve_large](./prime_sieve_large/) - Sieve of Eratosthenes for finding prime numbers, demonstrating a dataflow pipeline across multiple AIE tiles. diff --git a/programming_examples/vision/color_threshold/README.md b/programming_examples/vision/color_threshold/README.md index ad8613544ab..da11e3a279e 100644 --- a/programming_examples/vision/color_threshold/README.md +++ b/programming_examples/vision/color_threshold/README.md @@ -20,11 +20,11 @@ The pipeline is mapped onto a single column of the npu device, with one Shim til width="750">

-The data movement of this pipeline is described using the OrderedObjectBuffer (OOB) primitive. The input image is brought into the array via Shim tile (0, 0) and first sent to Mem tile (0, 1). There it is split into smaller blocks of data and each block is distributed to one of the 4 AIE tiles (0, 2) to (0, 5). One OOB is used to express data movement from the Shim tile to the Mem tile. Four different OOBs express the one-to-one data movements between the Mem tile and each of the compute tiles. The input OOB is linked to the other four OOBs to express that data from the input OOB should be copied implicitly to the other OOBs via the DMA. Currently, the ordering of the four OOBs in the Link operation expresses what piece of input data should go to each compute tile. +The data movement of this pipeline is described using the ObjectFIFO primitive. The input image is brought into the array via Shim tile (0, 0) and first sent to Mem tile (0, 1). There it is split into smaller blocks of data and each block is distributed to one of the 4 AIE tiles (0, 2) to (0, 5). One ObjectFIFO is used to express data movement from the Shim tile to the Mem tile. Four different ObjectFIFOs express the one-to-one data movements between the Mem tile and each of the compute tiles. The input ObjectFIFO is linked to the other four ObjectFIFOs to express that data from the input ObjectFIFO should be copied implicitly to the other ObjectFIFOs via the DMA. Currently, the ordering of the four ObjectFIFOs in the Link operation expresses what piece of input data should go to each compute tile. -Each AIE tile applies a threshold kernel on its data and sends its result back to the Mem tile, this is represented by one OOB for each compute tile. The results are then joined back together in the Mem tile and sent back to the output through the Shim tile. This is again described using a Link operation in which the ordering of the input OOBs expresses how the different results should be joined together before being sent to the output OOB, to the Shim tile. +Each AIE tile applies a threshold kernel on its data and sends its result back to the Mem tile, this is represented by one ObjectFIFO for each compute tile. The results are then joined back together in the Mem tile and sent back to the output through the Shim tile. This is again described using a Link operation in which the ordering of the input ObjectFIFOs expresses how the different results should be joined together before being sent to the output ObjectFIFO, to the Shim tile. -To compile desing in Windows: +To compile design in Windows: ``` make make colorThreshold.exe diff --git a/programming_examples/vision/edge_detect/README.md b/programming_examples/vision/edge_detect/README.md index 016546ba9e8..470b21a1d11 100644 --- a/programming_examples/vision/edge_detect/README.md +++ b/programming_examples/vision/edge_detect/README.md @@ -22,9 +22,9 @@ The pipeline is mapped onto a single column of the npu device, with one Shim til width="1050">

-The data movement of this pipeline is described using the OrderedObjectBuffer (OOB) primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering couldn't directly be done in the smaller L1 memory module of tile (0, 5). This is described using two OOBs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two OOBs are linked to express that data from the first OOB should be copied to the second OOB implicitly through the Mem tile's DMA. +The data movement of this pipeline is described using the ObjectFIFO primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering couldn't directly be done in the smaller L1 memory module of tile (0, 5). This is described using two ObjectFIFOs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two ObjectFIFOs are linked to express that data from the first ObjectFIFO should be copied to the second ObjectFIFO implicitly through the Mem tile's DMA. -Starting from tile (0, 2) data is processed by each compute tile and the result is sent to the next tile. This is described by a series of one-to-one OOBs. As the two kernels `gray2rgba` and `addWeighted` are mapped together on AIE tile (0, 5), an OOB is also created with tile (0, 5) being both its source and destination to describe the data movement between the two kernels. Finally, the output is sent from tile (0, 5) to the Mem tile and finally back to the output through the Shim tile. +Starting from tile (0, 2) data is processed by each compute tile and the result is sent to the next tile. This is described by a series of one-to-one ObjectFIFOs. As the two kernels `gray2rgba` and `addWeighted` are mapped together on AIE tile (0, 5), an ObjectFIFO is also created with tile (0, 5) being both its source and destination to describe the data movement between the two kernels. Finally, the output is sent from tile (0, 5) to the Mem tile and finally back to the output through the Shim tile. To compile the design: ```shell diff --git a/programming_guide/README.md b/programming_guide/README.md index a83dd7f7514..29d6a5413e5 100644 --- a/programming_guide/README.md +++ b/programming_guide/README.md @@ -46,7 +46,7 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a * Introduce an example of the first simple program (Vector Scalar Multiplication) * Illustrate how to run designs on Ryzen™ AI-enabled hardware -
Section 4 - Peformance Measurement & Vector Programming +
Section 4 - Performance Measurement & Vector Programming * Introduce performance measurement (timers, trace) * Discuss topic of vector programming at the kernel level diff --git a/programming_guide/mini_tutorial/README.md b/programming_guide/mini_tutorial/README.md index 0134d77ab37..dad421a33ab 100644 --- a/programming_guide/mini_tutorial/README.md +++ b/programming_guide/mini_tutorial/README.md @@ -1,5 +1,7 @@ # IRON Mini Tutorial +> **Prerequisites:** These exercises require a physical Ryzen AI NPU (Phoenix/npu1 or Strix/npu2) with XRT installed, and the mlir-aie environment set up (see [docs/Building.md](../../docs/Building.md)). The `@iron.jit` decorator automatically detects your hardware — no manual device selection is needed. Each exercise is run directly with `python3 .py`. + ## Key Components: Workers, ObjectFifos, Runtime IRON provides an unplaced (deferred placement) [API](../../python/iron/) for NPU programming. Below are examples describing AIE compute code and the Object FIFO data movement primitive: @@ -36,7 +38,7 @@ of_in = ObjectFifo(data_ty, name="in") # default depth is 2 ``` More on the Object FIFO in [Section 2a](../section-2/section-2a/README.md) of the programming guide and in the [objectfifo.py](../../python/iron/dataflow/objectfifo.py). -The IRON code [example](./aie2p.py) in this mini tutorial details the different parts of an IRON design. More information on the Runtime in particular can be found in [Section 2d](../section-2/section-2d/README.md) of the programming guide. +The IRON code [example](./aie2.py) in this mini tutorial details the different parts of an IRON design. More information on the Runtime in particular can be found in [Section 2d](../section-2/section-2d/README.md) of the programming guide. ## Exercises 1. Familiarize yourself with [exercise_1](./exercise_1/exercise_1.py). The code contains a single Worker which has an already instantiated local buffer that it sends out to external memory. Run `python3 exercise_1.py` to run the program and verify the output. diff --git a/programming_guide/mini_tutorial/aie2p.py b/programming_guide/mini_tutorial/aie2.py similarity index 99% rename from programming_guide/mini_tutorial/aie2p.py rename to programming_guide/mini_tutorial/aie2.py index d038edbbe17..3715bd0b214 100644 --- a/programming_guide/mini_tutorial/aie2p.py +++ b/programming_guide/mini_tutorial/aie2.py @@ -1,4 +1,4 @@ -# aie2p.py -*- Python -*- +# aie2.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. diff --git a/programming_guide/mini_tutorial/run.lit b/programming_guide/mini_tutorial/run.lit index 228e7984aeb..beb1f63e4ea 100644 --- a/programming_guide/mini_tutorial/run.lit +++ b/programming_guide/mini_tutorial/run.lit @@ -3,5 +3,5 @@ // // REQUIRES: ryzen_ai, peano // -// RUN: %run_on_npu1% python3 %S/aie2p.py -// RUN: %run_on_npu2% python3 %S/aie2p.py +// RUN: %run_on_npu1% python3 %S/aie2.py +// RUN: %run_on_npu2% python3 %S/aie2.py diff --git a/programming_guide/section-1/README.md b/programming_guide/section-1/README.md index 56035c51e03..e4b605bd9e1 100644 --- a/programming_guide/section-1/README.md +++ b/programming_guide/section-1/README.md @@ -41,7 +41,7 @@ class Worker(ObjectFifoEndpoint): ``` In our simple design there is only one Worker which will perform the `core_fn` routine. The compute routine iterates over a data buffer and initializes each entry to zero. The compute routine in this case has no inputs other than a handle to the buffer. As we will see in the next section of the guide, computational tasks usually run on data that is brought into the AIE array from external memory and the output produced is sent back out. Note that in this example design the Worker is explicitly placed on a Compute tile with coordinates (0,2) in the AIE array. ```python -buffer = LocalBuffer(data_ty, name="buff") +buffer = Buffer(data_ty, name="buff") # Task for the worker to perform def core_fn(buff): @@ -94,11 +94,11 @@ Then we declare a structural design function that will expand into MLIR code whe def mlir_aie_design(): <... AI Engine device, blocks, and connections ...> ``` -Let's look at how we declare the AI Engine device, blocks, and connections. We start off by declaring our AIE device via `@device(AIEDevice.npu1_1col)` or `@device(AIEDevice.npu2)`. The blocks and connections themselves will then be declared inside the `def device_body():`. Here, we instantiate our AI Engine blocks, which are AIE compute tiles in this first example. +Let's look at how we declare the AI Engine device, blocks, and connections. We start off by declaring our AIE device via `@device(AIEDevice.npu1)` or `@device(AIEDevice.npu2)`. The blocks and connections themselves will then be declared inside the `def device_body():`. Here, we instantiate our AI Engine blocks, which are AIE compute tiles in this first example. The arguments for the tile declaration are the tile coordinates (column, row). We assign each declared tile to a variable in our Python program. -> **NOTE:** The actual tile coordinates used on the device when the program is run may deviate from the ones declared here. For example, on the NPU on Ryzen™ AI (`@device(AIEDevice.npu)`), these coordinates tend to be relative coordinates as the runtime scheduler may assign it to a different available column during runtime. +> **NOTE:** The actual tile coordinates used on the device when the program is run may deviate from the ones declared here. For example, on the NPU on Ryzen™ AI (`@device(AIEDevice.npu1)`), these coordinates tend to be relative coordinates as the runtime scheduler may assign it to a different available column during runtime. ```python # Device declaration - here using aie2 device NPU diff --git a/programming_guide/section-2/section-2a/README.md b/programming_guide/section-2/section-2a/README.md index 8d792bf5173..42c1860e50c 100644 --- a/programming_guide/section-2/section-2a/README.md +++ b/programming_guide/section-2/section-2a/README.md @@ -38,7 +38,7 @@ class ObjectFifo(Resolvable): ``` The Object FIFO functions as an ordered buffer that has a count of `depth` objects; by default it is set to `2` which represents double or ping-pong buffering. All objects in an Object FIFO have to be of the same `obj_type` datatype. The datatype is a tensor-like attribute where the size of the tensor and the type of the individual elements are specified at the same time (i.e. `np.ndarray[(16,), np.dtype[np.int32]]`). The `name` input must be unique and can either be given by the user or left empty for the compiler to complete. It is required for subsequent lowering steps in the compiler flow. -As it traverses the AIE array, data can be restructured using the capabilities of Direct Memory Access channels (DMAs). These components are explained in more detail [here](./README.md#advanced-topic-data-movement-accelerators), however as a quick introduction, DMAs exist at every tile in the array and they are responsible for taking data arriving on the AXI stream interconnect and writing it into the tile's local memory, and inversely. DMAs can be given access patterns to express the order in which data should be sent onto the AXI stream by the Object FIFO's producer (using the `dims_to_stream` input) or read from it by each consumer (using the `dims_from_stream_per_cons` input). These inputs have their own dedicated section (see Data Layout Transformations in [section-2c](../section-2c/README.md#data-layout-transformations)). The `plio` input can be used when one of the Object FIFO's endpoints is a Shim tile to indicate to the compiler that the communication should be wired through a dedicated `plio` port. +As it traverses the AIE array, data can be restructured using the capabilities of Direct Memory Access channels (DMAs). These components are explained in more detail [here](./README.md#advanced-topic-direct-memory-access-channels), however as a quick introduction, DMAs exist at every tile in the array and they are responsible for taking data arriving on the AXI stream interconnect and writing it into the tile's local memory, and inversely. DMAs can be given access patterns to express the order in which data should be sent onto the AXI stream by the Object FIFO's producer (using the `dims_to_stream` input) or read from it by each consumer (using the `dims_from_stream_per_cons` input). These inputs have their own dedicated section (see Data Layout Transformations in [section-2c](../section-2c/README.md#data-layout-transformations)). The `plio` input can be used when one of the Object FIFO's endpoints is a Shim tile to indicate to the compiler that the communication should be wired through a dedicated `plio` port. Below is an example of how to initialize an Object FIFO named `in` of datatype `<256xi32>` with depth `3`: ```python @@ -50,7 +50,7 @@ line_type = np.ndarray[(line_size,), np.dtype[np.int32]] of_in = ObjectFifo(line_type, name="in", depth=3) ``` -Object FIFO endpoints are separated into producers and consumers, where an Object FIFO may only have one producer and one or multiple consumers. These endpoints are also refered to as the "actors" of the Object FIFO, based on dataflow theory terminology. At this level of abstraction the endpoints are typically Workers that have access to `ObjectFifoHandle`s, with one other use case being when an Object FIFO is filled from or drained to external memory at runtime (as explained in the Runtime Data Movement [section](../section-2d/README.md)). +Object FIFO endpoints are separated into producers and consumers, where an Object FIFO may only have one producer and one or multiple consumers. These endpoints are also referred to as the "actors" of the Object FIFO, based on dataflow theory terminology. At this level of abstraction the endpoints are typically Workers that have access to `ObjectFifoHandle`s, with one other use case being when an Object FIFO is filled from or drained to external memory at runtime (as explained in the Runtime Data Movement [section](../section-2d/README.md)). The code snippet below shows two Workers running processes defined by `core_fn` and `core_fn2` which take as input a producer or a consumer handle for `of_in` respectively: ```python @@ -109,7 +109,7 @@ Some of the inputs are the same as they were at the higher level, while the othe Just like at the highest level of abstraction, the Object FIFO functions as an ordered buffer that has a count of `depth` objects of specified `datatype`. Currently, all objects in an Object FIFO have to be of the same datatype. The `datatype` is a tensor-like attribute where the size of the tensor and the type of the individual elements are specified at the same time (i.e. `<16xi32>`). Unlike before, the `depth` can be defined as either an integer or an array of integers. The latter is explained further down in this section. -An Object FIFO is created between a producer, or source tile, and a consumer, or destination tile. The tiles are where producer and consumer processes accessing the Object FIFO will be executed. These processes are also refered to as the "actors" of the Object FIFO, based on dataflow theory terminology. Below, you can see an example where `of_in` is created between producer tile A and consumer tile B with depth `3`: +An Object FIFO is created between a producer, or source tile, and a consumer, or destination tile. The tiles are where producer and consumer processes accessing the Object FIFO will be executed. These processes are also referred to as the "actors" of the Object FIFO, based on dataflow theory terminology. Below, you can see an example where `of_in` is created between producer tile A and consumer tile B with depth `3`: ```python A = tile(1, 3) B = tile(2, 4) @@ -229,7 +229,7 @@ def core_fn(of_in, of_out, test_func, test_func2): of_in.release(1) elemOut = of_out.acquire(1) - test_func2(elemIn, line_size) + test_func2(elemOut, line_size) of_out.release(1) # Create workers to perform the tasks diff --git a/programming_guide/section-2/section-2d/DMATasks.md b/programming_guide/section-2/section-2d/DMATasks.md index 695ad1739d5..1311fbff28b 100644 --- a/programming_guide/section-2/section-2d/DMATasks.md +++ b/programming_guide/section-2/section-2d/DMATasks.md @@ -8,16 +8,16 @@ // //===----------------------------------------------------------------------===//--> -# Section 2g - Runtime Data Movement +# Section 2d - Runtime Data Movement * [Section 2 - Data Movement (Object FIFOs)](../../section-2/) * [Section 2a - Introduction](../section-2a/) * [Section 2b - Key Object FIFO Patterns](../section-2b/) * [Section 2c - Data Layout Transformations](../section-2c/) - * [Section 2d - Programming for multiple cores](../section-2d/) - * [Section 2e - Practical Examples](../section-2e/) - * [Section 2f - Data Movement Without Object FIFOs](../section-2f/) - * Section 2g - Runtime Data Movement + * Section 2d - Runtime Data Movement + * [Section 2e - Programming for multiple cores](../section-2e/) + * [Section 2f - Practical Examples](../section-2f/) + * [Section 2g - Data Movement Without Object FIFOs](../section-2g/) ----- @@ -153,7 +153,7 @@ def shim_dma_single_bd_task( issue_token: bool = False, ) ``` -- **`alloc`**: The `alloc` argument associates the DMA task with an ObjectFIFO. This argument is called `alloc` becuase the shim-side end of a data transfer (specifically a channel on a shim tile) is referenced through a so-called "shim DMA allocation". When an ObjectFIFO is created with a Shim Tile endpoint, an allocation with the same name as the ObjectFIFO is automatically generated. +- **`alloc`**: The `alloc` argument associates the DMA task with an ObjectFIFO. This argument is called `alloc` because the shim-side end of a data transfer (specifically a channel on a shim tile) is referenced through a so-called "shim DMA allocation". When an ObjectFIFO is created with a Shim Tile endpoint, an allocation with the same name as the ObjectFIFO is automatically generated. - **`mem`**: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to. - **`tap`** (optional): A `TensorAccessPattern` is an alternative method of specifying `offset`/`sizes`/`strides` for determining an access pattern over the `mem` buffer. - **`offset`** (optional): Starting point for the data transfer. Default values is `0`. @@ -162,7 +162,7 @@ def shim_dma_single_bd_task( - **`issue_token`** (optional): If a token is issued, one may call `dma_await_task` on the returned task. Default is `False`. - **`burst_length`** (optional): The configuration of the burst length for the DMA task. If `0`, defaults to the highest available value. -The strides and strides express data transformations analogously to those described in [Section 2C](../section-2c). +The strides and sizes express data transformations analogously to those described in [Section 2C](../section-2c). **Example Usage**: ```python @@ -227,7 +227,7 @@ dma_free_task(in_task, out_task) #### **Conclusion** -The `npu_dma_memcpy_nd` and `dma_wait` functions are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications. +Both the `npu_dma_memcpy_nd`/`dma_wait` interface and the `shim_dma_single_bd_task`/`dma_await_task`/`dma_free_task` interface are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications. ----- [[Up](./README.md)] diff --git a/programming_guide/section-2/section-2d/RuntimeTasks.md b/programming_guide/section-2/section-2d/RuntimeTasks.md index ca33a10dbd4..196663562db 100644 --- a/programming_guide/section-2/section-2d/RuntimeTasks.md +++ b/programming_guide/section-2/section-2d/RuntimeTasks.md @@ -8,16 +8,16 @@ // //===----------------------------------------------------------------------===//--> -# Section 2g - Runtime Data Movement +# Section 2d - Runtime Data Movement * [Section 2 - Data Movement (Object FIFOs)](../../section-2/) * [Section 2a - Introduction](../section-2a/) * [Section 2b - Key Object FIFO Patterns](../section-2b/) * [Section 2c - Data Layout Transformations](../section-2c/) - * [Section 2d - Programming for multiple cores](../section-2d/) - * [Section 2e - Practical Examples](../section-2e/) - * [Section 2f - Data Movement Without Object FIFOs](../section-2f/) - * Section 2g - Runtime Data Movement + * Section 2d - Runtime Data Movement + * [Section 2e - Programming for multiple cores](../section-2e/) + * [Section 2f - Practical Examples](../section-2f/) + * [Section 2g - Data Movement Without Object FIFOs](../section-2g/) ----- @@ -196,7 +196,7 @@ with rt.sequence(data_ty, data_ty, data_ty) as (a_in, _, c_out): rt.start(*workers) tg = rt.task_group() # start first task group - for groups in [0, 1]: + for _ in [0, 1]: rt.fill(of_in.prod(), a_in, task_group=tg) rt.drain(of_out.cons(), c_out, task_group=tg, wait=True) rt.finish_task_group(tg) diff --git a/programming_guide/section-2/section-2e/README.md b/programming_guide/section-2/section-2e/README.md index 4a8c9de9332..83ca919c78c 100644 --- a/programming_guide/section-2/section-2e/README.md +++ b/programming_guide/section-2/section-2e/README.md @@ -42,7 +42,7 @@ of_in1 = of_in.cons().forward(obj_type=data_ty, name="in1") of_out1 = ObjectFifo(data_ty, name="out1") of_out = of_out1.cons().forward(obj_type=data_ty, name="out") ``` -For our scale out design we will keep using a single Mem tile, but we will increase the number of Workers to three. Now each Worker will receive objects of datatype `<16xi32>`. Data brought into the AIE array via `of_in` will be split into three Object FIFOs for each Worker. Similarly data produced by each Worker will be joined and sent to external memory through `of_out`. Please [see distribute and join patterns](../section-2b/03_Link_Distribute_Join/README.md) for more details. These changes result in the following code: +For our scale out design we will keep using a single Mem tile, but we will increase the number of Workers to three. Now each Worker will receive objects of datatype `<16xi32>`. Data brought into the AIE array via `of_in` will be split into three Object FIFOs for each Worker. Similarly data produced by each Worker will be joined and sent to external memory through `of_out`. Please [see distribute and join patterns](../section-2b/03_Implicit_Copy/README.md) for more details. These changes result in the following code: ```python n_workers = 3 data_size = 48 @@ -82,7 +82,7 @@ The Worker of this simple design acquires one object of each Object FIFO, adds ` def core_fn(of_in, of_out): elem_in = of_in.acquire(1) elem_out = of_out.acquire(1) - for _ in range_(data_size): + for i in range_(tile_size): elem_out[i] = elem_in[i] + 1 of_in.release(1) of_out.release(1) @@ -110,7 +110,7 @@ Finally, in our simple design we write a runtime sequence to bring data to/from ```python # Runtime operations to move data to/from the AIE-array rt = Runtime() -with rt.sequence(data_size, data_size, data_size) as (a_in, b_out, _): +with rt.sequence(data_ty, data_ty, data_ty) as (a_in, b_out, _): rt.start(my_worker) rt.fill(of_in.prod(), a_in) rt.drain(of_out.cons(), b_out, wait=True) @@ -119,14 +119,14 @@ The runtime sequence remains largely unchanged for the larger design except that ```python # Runtime operations to move data to/from the AIE-array rt = Runtime() -with rt.sequence(data_size, data_size, data_size) as (a_in, b_out, _): +with rt.sequence(data_ty, data_ty, data_ty) as (a_in, b_out, _): rt.start(*workers) rt.fill(of_in.prod(), a_in) rt.drain(of_out.cons(), b_out, wait=True) ``` To compile the designs: -```python +```bash make all ``` @@ -235,14 +235,14 @@ for i in range(n_cores): for _ in range_(0xFFFFFFFF): elem_in = inX_fifos[i].acquire(ObjectFifoPort.Consume, 1) elem_out = outX_fifos[i].acquire(ObjectFifoPort.Produce, 1) - for i in range_(tile_size): - elem_out[i] = elem_in[i] + 1 + for j in range_(tile_size): + elem_out[j] = elem_in[j] + 1 inX_fifos[i].release(ObjectFifoPort.Consume, 1) outX_fifos[i].release(ObjectFifoPort.Produce, 1) ``` To compile the designs: -```python +```bash make placed ``` diff --git a/programming_guide/section-2/section-2e/aie2.py b/programming_guide/section-2/section-2e/aie2.py index cc0c99050c0..83700bc8d17 100644 --- a/programming_guide/section-2/section-2e/aie2.py +++ b/programming_guide/section-2/section-2e/aie2.py @@ -1,4 +1,4 @@ -# section-2/section-2d/aie2.py -*- Python -*- +# section-2/section-2e/aie2.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. diff --git a/programming_guide/section-2/section-2e/aie2_multi.py b/programming_guide/section-2/section-2e/aie2_multi.py index 58c0b302506..cb0f75a83d3 100644 --- a/programming_guide/section-2/section-2e/aie2_multi.py +++ b/programming_guide/section-2/section-2e/aie2_multi.py @@ -1,4 +1,4 @@ -# section-2/section-2d/aie2_multi.py -*- Python -*- +# section-2/section-2e/aie2_multi.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. @@ -52,7 +52,7 @@ def core_fn(of_in, of_out): elem_in = of_in.acquire(1) elem_out = of_out.acquire(1) - for i in range_(data_size): + for i in range_(tile_size): elem_out[i] = elem_in[i] + 1 of_in.release(1) of_out.release(1) diff --git a/programming_guide/section-2/section-2f/02_external_mem_to_core/ext_to_core.py b/programming_guide/section-2/section-2f/02_external_mem_to_core/ext_to_core.py index 1f60e247dee..5b1a5edb060 100644 --- a/programming_guide/section-2/section-2f/02_external_mem_to_core/ext_to_core.py +++ b/programming_guide/section-2/section-2f/02_external_mem_to_core/ext_to_core.py @@ -1,4 +1,4 @@ -# single_buffer.py -*- Python -*- +# ext_to_core.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. diff --git a/programming_guide/section-2/section-2f/03_external_mem_to_core_L2/README.md b/programming_guide/section-2/section-2f/03_external_mem_to_core_L2/README.md index b2d284085e6..efe9693ead2 100644 --- a/programming_guide/section-2/section-2f/03_external_mem_to_core_L2/README.md +++ b/programming_guide/section-2/section-2f/03_external_mem_to_core_L2/README.md @@ -10,7 +10,7 @@ # External Memory to Core through L2 -The design in [ext_to_coreL2.py](./ext_to_core.py) is very similar to the one in the previous [example](../02_external_mem_to_core/) with the difference being that in this design we first bring the `24xi32` data from external memory to L2 memory (i.e., a Mem tile) with `of_in0`. We then use `of_in1` to bring smaller `8xi32` slices of the data from the `MemTile` to `my_worker`. Two FIFOs then bring the data first to L2 via `of_out1` as `8xi32` tensors, then to external memory via `of_out0` as `24xi32` ones. All FIFOs use double buffers. +The design in [ext_to_core_L2.py](./ext_to_core_L2.py) is very similar to the one in the previous [example](../02_external_mem_to_core/) with the difference being that in this design we first bring the `24xi32` data from external memory to L2 memory (i.e., a Mem tile) with `of_in0`. We then use `of_in1` to bring smaller `8xi32` slices of the data from the `MemTile` to `my_worker`. Two FIFOs then bring the data first to L2 via `of_out1` as `8xi32` tensors, then to external memory via `of_out0` as `24xi32` ones. All FIFOs use double buffers. diff --git a/programming_guide/section-2/section-2g/README.md b/programming_guide/section-2/section-2g/README.md index b853727e636..6579e145e2a 100644 --- a/programming_guide/section-2/section-2g/README.md +++ b/programming_guide/section-2/section-2g/README.md @@ -25,7 +25,7 @@ Not all data movement patterns can be described with Object FIFOs. This **advanc **Please note that this part of the guide is described at the explicitly placed IRON level.** -The AIE architecture currently has three different types of tiles: compute tiles, referred to as "tile", memory tiles referred to as "Mem tiles", and external memory interface tiles referred to as "Shim tiles". Each of these tiles has its own attributes regarding compute capabilities and memory capacity, but the base design of their DMAs is the same. The different types of DMAs can be intialized using the constructors in [aie.py](../../../python/dialects/aie.py): +The AIE architecture currently has three different types of tiles: compute tiles, referred to as "tile", memory tiles referred to as "Mem tiles", and external memory interface tiles referred to as "Shim tiles". Each of these tiles has its own attributes regarding compute capabilities and memory capacity, but the base design of their DMAs is the same. The different types of DMAs can be initialized using the constructors in [aie.py](../../../python/dialects/aie.py): ```python @mem(tile) # compute tile DMA @shim_dma(tile) # Shim tile DMA @@ -49,7 +49,7 @@ def dma( ) ``` -The data movement on each channel is described by a chain of Buffer Descriptors (or "BDs"), where each BD describes what data is being moved and configures its synchornization mechanism. The `dma` constructor already creates space for one such BD as can be seen by its `num_blocks=1` default valued input. +The data movement on each channel is described by a chain of Buffer Descriptors (or "BDs"), where each BD describes what data is being moved and configures its synchronization mechanism. The `dma` constructor already creates space for one such BD as can be seen by its `num_blocks=1` default valued input. The code snippet below shows how to configure the DMA on `tile_a` such that data coming in on input channel 0 is written into `buff_in`: ```python @@ -132,11 +132,11 @@ tile_b = tile(1, 3) prod_lock_a = lock(tile_a, lock_id=0, init=1) cons_lock_a = lock(tile_a, lock_id=1, init=0) -buff_a = buffer(tile=tile_a, np.ndarray[(256,), np.dtype[np.int32]]) # 256xi32 +buff_a = buffer(tile=tile_a, datatype=np.ndarray[(256,), np.dtype[np.int32]]) # 256xi32 prod_lock_b = lock(tile_b, lock_id=0, init=1) cons_lock_b = lock(tile_b, lock_id=1, init=0) -buff_b = buffer(tile=tile_b, np.ndarray[(256,), np.dtype[np.int32]]) # 256xi32 +buff_b = buffer(tile=tile_b, datatype=np.ndarray[(256,), np.dtype[np.int32]]) # 256xi32 aie.flow(tile_a, WireBundle.DMA, 0, tile_b, WireBundle.DMA, 1) @@ -150,7 +150,7 @@ def mem_body(): @mem(tile_b) def mem_body(): - @dma(SS2M, 1) # input channel, port 1 + @dma(S2MM, 1) # input channel, port 1 def dma_in_0(): use_lock(prod_lock_b, AcquireGreaterEqual) dma_bd(buff_b) diff --git a/programming_guide/section-3/README.md b/programming_guide/section-3/README.md index 9b75973882d..6ff0f47568a 100644 --- a/programming_guide/section-3/README.md +++ b/programming_guide/section-3/README.md @@ -36,7 +36,7 @@ The compute core will run an external function: a kernel written in C++ that wil ```python tensor_size = 4096 -tile_size = data_size // 4 +tile_size = tensor_size // 4 # Define tensor types tensor_ty = np.ndarray[(tensor_size,), np.dtype[np.int32]] @@ -96,7 +96,7 @@ def core_fn(of_in, of_factor, of_out, scale_scalar): # Create a worker to perform the task -my_worker = Worker(core_fn, [of_in1.cons(), of_factor.cons() of_out1.prod(), scale_fn]) +my_worker = Worker(core_fn, [of_in.cons(), of_factor.cons(), of_out.prod(), scale_fn]) ``` ## Kernel Code @@ -104,7 +104,7 @@ my_worker = Worker(core_fn, [of_in1.cons(), of_factor.cons() of_out1.prod(), sca We can program the AIE compute core using C++ code and compile it with the selected single-core AIE compiler into a kernel object file. For our local version of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements. ```c -void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out, +void vector_scalar_mul_aie_scalar(int32_t *a, int32_t *c, int32_t *factor, int32_t N) { for (int i = 0; i < N; i++) { c[i] = *factor * a[i]; @@ -187,7 +187,7 @@ The host code contains the following sections (with C/C++ code examples): 1. *Initialize and synchronize*: host to device XRT buffer objects. - Here, we iniitliaze the values of our host buffer objects (including output) and call `sync` to synchronize that data to the device buffer object accessed by the kernel. + Here, we initialize the values of our host buffer objects (including output) and call `sync` to synchronize that data to the device buffer object accessed by the kernel. ```c // Copy instruction stream to xrt buffer object @@ -281,7 +281,7 @@ Because our design is defined in several different files such as: * kernel source - vector_scalar_mul.cc * host code - test.cpp/test.py -ensuring that top level design parameters stay consistent is important so we don't, for example, get system hangs when buffer sizes in the host code don't match the buffer size in the top level design. To help with this, we will share example design templates in [section-4b](../section4/section-4b) which puts these top level parameters in the `Makefile` and passes them to the other design files. More details will be described in [section-4b](../section-4/section-4b) or can be directly seen in example designs like [vector_scalar_mul](../../programming_examples/basic/vector_scalar_mul). +ensuring that top level design parameters stay consistent is important so we don't, for example, get system hangs when buffer sizes in the host code don't match the buffer size in the top level design. To help with this, we will share example design templates in [section-4b](../section-4/section-4b) which puts these top level parameters in the `Makefile` and passes them to the other design files. More details will be described in [section-4b](../section-4/section-4b) or can be directly seen in example designs like [vector_scalar_mul](../../programming_examples/basic/vector_scalar_mul). ----- [[Prev - Section 2](../section-2/)] [[Top](..)] [[Next - Section 4](../section-4/)] diff --git a/programming_guide/section-4/section-4a/README.md b/programming_guide/section-4/section-4a/README.md index cf39b376e78..5b375c78127 100644 --- a/programming_guide/section-4/section-4a/README.md +++ b/programming_guide/section-4/section-4a/README.md @@ -20,7 +20,7 @@ We begin by first looking at timers for measuring application performance and what that tells us. The performance of an accelerated AI Engine application involves a number of components on the software stack, from invoking the application at the OS level, to passing control on to the kernel drivers, moving and dispatching work to the AIE array, running the accelerated application on AIE cores, and finally returning the data to the application for next-step processing. The most straightforward way to capture the performance of this entire stack of communication and processing is with an application timer, also known as the "wall clock" time. This gives us the upper bounds for how long an AIE accelerated application takes but adds to it the OS and kernel driver overhead. This is something that can be minimized when running multiple iterations of an acclerated program or running a sufficiently compute intensive application. Let's take a look at how we add the "wall clock" timer to an example program. ## Application timer - Modifying [test.cpp](./test.cpp) -Adding the application timer is as simple as noting a start and stop time surrounding the calling of the kernel function. We can use the clock timer from the chrono library which is imported via `import ` but this may already be imported by other libraries (this is the case in our `test.cpp`). Then we record the start and stop time of our chrono timer with timer function calls surrounding our kernel function as follows: +Adding the application timer is as simple as noting a start and stop time surrounding the calling of the kernel function. We can use the clock timer from the chrono library which is included via `#include ` but this may already be imported by other libraries (this is the case in our `test.cpp`). Then we record the start and stop time of our chrono timer with timer function calls surrounding our kernel function as follows: ```c++ auto start = std::chrono::high_resolution_clock::now(); diff --git a/programming_guide/section-4/section-4b/README-placed.md b/programming_guide/section-4/section-4b/README-placed.md index 90c5f509e54..7220b09580d 100644 --- a/programming_guide/section-4/section-4b/README-placed.md +++ b/programming_guide/section-4/section-4b/README-placed.md @@ -26,7 +26,7 @@ In [section-4b](../section-4b), we introduced how trace is enabled in our high-l ## 1. Enable and configure AIE trace units for close-to-metal IRON Python ([aie2_placed.py](./aie2_placed.py)) -Enabling tracing means (1a) configuring the trace units for a given tile and then (1b) routing the generated events packets through the stream switches to the shim DMA where we can write them to a buffer in DDR for post-runtime processing. In close-to-metal IRON python, these steps require a more explicit declaration and descrbied below: +Enabling tracing means (1a) configuring the trace units for a given tile and then (1b) routing the generated events packets through the stream switches to the shim DMA where we can write them to a buffer in DDR for post-runtime processing. In close-to-metal IRON python, these steps require a more explicit declaration and are described below: ### (1a) Configure trace units for an AIE tile The first necessary component for trace configuration is setting the right values for the trace control registers for each tile that we want to enable tracing for. In addition, the generated trace packets will need to be routed to a shimDMA and then written to an inout buffers so the packet data can be written to DDR. We have abstracted these two steps with the python wrapper function `configure_packet_tracing_aie2` which is in [python/utils/trace.py](../../../python/utils/trace.py) and is described in more detail in the [README](../../../python/utils) under `python/utils`. An example of how this function is used is shown below for quick reference: @@ -39,7 +39,7 @@ The arguments for this example are: * *opts.trace_size* - the trace buffer size in bytes This block is defined within the sequence definition for `@runtime_sequence` where we define the shimDMA data movement to the inout buffers. -> **Note** This convenience python wrapper abtracts a number of sub-steps for configuring the trace unit in each tile and the shimDMA for writing to DDR. This uses packet switched routing to move the trace packets as opposed to circuit switched routing. More details on these sub-steps can be found in the [README](../../../python/utils) under `python/utils`. +> **Note** This convenience python wrapper abstracts a number of sub-steps for configuring the trace unit in each tile and the shimDMA for writing to DDR. This uses packet switched routing to move the trace packets as opposed to circuit switched routing. More details on these sub-steps can be found in the [README](../../../python/utils) under `python/utils`. Configuring the trace units with `configure_packet_tracing_aie2` should be declared at the beginning of the `@runtime_sequence` so the trace mechanisms are in place prior to any data being transferred from/ to DDR. At the end of the `@runtime_sequence` we add the following convenience python function to end the trace collection. ```python @@ -60,14 +60,14 @@ The arguments for this example are: ## Exercises -1. We can try building our close-to-metal IRON python design now. Run `make clean; make use_placed=1 trace`. This compiles the placed design, generates a trace data file, and runs `prase_trace.py` to generate the `trace_4b.json` waveform file. +1. We can try building our close-to-metal IRON python design now. Run `make clean; make use_placed=1 trace`. This compiles the placed design, generates a trace data file, and runs `parse_trace.py` to generate the `trace_4b.json` waveform file. Note that many of the designs under [programming_examples](../../../programming_examples/) have both an high-level IRON python version and a close-to-metal IRON python version, otherwise known as the placed version. Invoking make with the `use_placed=1` is a common way to build these versions of the design. ## 2. Customizing Trace Behavior The wrapper python function `configure_packet_tracing_flow` abstracts much of the configuration of trace by making some assumptions of the desired trace behavior. Some of those assumptions are listed below along with how to customize them further. -1. Additional configuration arguments for `configure_packet_tracing_flow` as descrbied in [utils/python](../../../python/utils) +1. Additional configuration arguments for `configure_packet_tracing_flow` as described in [utils/python](../../../python/utils) * `tiles to trace` - array of tiles to trace * `shim tile` - Single shim tile to configure for writing trace packets to DDR @@ -104,4 +104,4 @@ The wrapper python function `configure_packet_tracing_flow` abstracts much of th ``` ----- -[[Prev]](../section-4a) [[Up]](../../section-4b) [[Next]](../section-4c) +[[Prev]](../section-4a) [[Up]](../) [[Next]](../section-4c) diff --git a/programming_guide/section-4/section-4b/README.md b/programming_guide/section-4/section-4b/README.md index 3c42f8d00f9..596948bdb6b 100644 --- a/programming_guide/section-4/section-4b/README.md +++ b/programming_guide/section-4/section-4b/README.md @@ -54,15 +54,15 @@ An alternative to specifying the array of workers to trace would be to instead a rt.enable_trace(trace_size) ... ``` -Here, we addd `trace=1` to indicate that worker should be traced. And we can omit the `workers` argument from the `enable_trace` call in the runtime sequence. +Here, we add `trace=1` to indicate that worker should be traced. And we can omit the `workers` argument from the `enable_trace` call in the runtime sequence. ->**NOTE**: The `workers` argument in the runtime sequence `enable_trace` always takes precendence over the `trace=1` argument of the woker. So if you define both, we will go with the definition of the `enable_trace` argument. +>**NOTE**: The `workers` argument in the runtime sequence `enable_trace` always takes precedence over the `trace=1` argument of the worker. So if you define both, we will go with the definition of the `enable_trace` argument. Configuring the trace unit in each core tile and routing the trace packets to a valid shim tile is then done automatically. ### Customizing Trace Behavior -The trace configuration chooses helpful default settings so you can trace your design with little additional customization. However, if you have more control over some of these configuration, additional arguments are available in the runtime `enable_trace` function, such as customizing the trace buffer offset, which XRT buffer you want to use and the events you wish to trace for all core tiles, mem tiles and shim tiles. These are passed in as additionl arguments as descrbied belows: +The trace configuration chooses helpful default settings so you can trace your design with little additional customization. However, if you have more control over some of these configuration, additional arguments are available in the runtime `enable_trace` function, such as customizing the trace buffer offset, which XRT buffer you want to use and the events you wish to trace for all core tiles, mem tiles and shim tiles. These are passed in as additional arguments as described below: * `trace_offset` - offest (in bytes) where trace buffer data should begin. This is 0 by default but if you wish to share XRT buffer with an output buffer, you can use offsets to control where the trace data is written to. * `ddr_id` - XRT buffer we want to write to. See [below](#2-configure-host-code-to-read-trace-data-and-write-it-to-a-text-file) for more details on XRT buffers. * `coretile_events` - which 8 events do we use for all coretiles in array. Check [python/utils/trace_events_enum.py](../../../python/utils/trace_events_enum.py) for the full list. @@ -89,7 +89,7 @@ The trace configuration chooses helpful default settings so you can trace your d ) ``` -Additional customizations are available in the closer-to-metal IRON and is descrbied more in [README-placed](./README-placed.md). +Additional customizations are available in the closer-to-metal IRON and is described more in [README-placed](./README-placed.md). ## 2. Configure host code to read trace data and write it to a text file @@ -128,7 +128,7 @@ Once [aie2.py](./aie2.py) is configured to output trace data to the 5th inout bu > **NOTE** In our example design ([aie2.py](./aie2.py)), we provide a [Makefile](./Makefile) target `run` for standard build and `trace` for trace-enabled build. The trace-enabled build passes the trace buffer size as an argument to [aie2.py](./aie2.py) which is used under the hood to conditionally enable tracing as long as `trace_size` is > 0. This is also true for the [Vector Scalar Multiply example](../../../programming_examples/basic/vector_scalar_mul). -### (2a) C/C++ Host code ([test.cpp](./test.cpp), [../../../runtime_lib/test_lib/xrt_test_wrapper_.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h)) +### (2a) C/C++ Host code ([test.cpp](./test.cpp), [../../../runtime_lib/test_lib/xrt_test_wrapper.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h)) The main changes needed for the host code is declare a buffer object for trace data and pass that buffer object to the XRT kernel function call. This looks like the following snippets of code: ```c @@ -152,7 +152,7 @@ Once the design has been executed. We can then use the convenience function `wri ``` #### Templated host code (test.cpp) -Because the code patterns for measuring host code timing and configuring trace are so often repeated, they have been further wrapped into the convenience function `setup_and_run_aie` in [xrt_test_wrapper_.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h) which then allows us to create a simpler top level host code [test.cpp](./test.cpp). +Because the code patterns for measuring host code timing and configuring trace are so often repeated, they have been further wrapped into the convenience function `setup_and_run_aie` in [xrt_test_wrapper.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h) which then allows us to create a simpler top level host code [test.cpp](./test.cpp). In our template host code [test.cpp](./test.cpp) for 2 inputs and 1 output, we cusotmize the following: * Input and output buffer size (in bytes) - Specified in the [Makefile](./Makefile) and [CMakeLists.txt](./CMakeLists.txt) and then passed into the [aie2.py](./aie2.py) and [test.cpp](./test.cpp) @@ -234,10 +234,10 @@ These convenience python wrappers perform the `sync` steps under the hood when t trace_buffer = trace_buffer.view(np.uint32) write_out_trace(trace_buffer, str(opts.trace_file)) ``` -Just like the C/C++ host code wrapper `setup_and_run_aie` found in [../../../runtime_lib/test_lib/xrt_test_wrapper_.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h), for python, we have a similar wrapper `setup_and_run_aie` in [../../../python/utils/xrt.py](../../../python/utils/xrt.py). This likewise simplifies the `test.py` and can be used as a template for design patterns. +Just like the C/C++ host code wrapper `setup_and_run_aie` found in [../../../runtime_lib/test_lib/xrt_test_wrapper.h](../../../runtime_lib/test_lib/xrt_test_wrapper.h), for python, we have a similar wrapper `setup_and_run_aie` in [../../../python/utils/xrt.py](../../../python/utils/xrt.py). This likewise simplifies the `test.py` and can be used as a template for design patterns. ## 3. Parse text file to generate a waveform json file -Once the packet trace text file is generated (`trace.txt`), we use a python-based trace parser ([parse_trace.py](../../../python/utils/trace/parse.py)) to interpret the trace values and generate a waveform json file for visualization (with Perfetto). This is a step in the [Makefile](./Makefile) but can be executed from the command line as well. +Once the packet trace text file is generated (`trace.txt`), we use a python-based trace parser ([parse.py](../../../python/utils/trace/parse.py)) to interpret the trace values and generate a waveform json file for visualization (with Perfetto). This is a step in the [Makefile](./Makefile) but can be executed from the command line as well. ```Makefile ../../../python/utils/trace/parse.py --input trace.txt --mlir build/aie_trace.mlir --output trace_4b.json ``` @@ -254,12 +254,12 @@ Open https://ui.perfetto.dev in your browser and then open up the waveform json * It's possible that a simple core may have too few events to create a valid trace packet. To work around this in closer-to-metal IRON python, you can either (1) add a ShimTile to the array of `[tiles_to_trace]` as well to add more trace data or (2) reduce the shim dma burst length by adding the parameter `shim_burst_length=64` to the call `configure_packet_tracing_aie2`. Valid burst shim burst length for aie2 is 64B, 128B, 256B, 512B. The default burst length for regular data buffers is 256-Bytes, but for the trace buffer, it is 64-Bytes instead, which means you only need to define it if it was overwritten elsewhere. This also means that if the trace data is less than 64B, it will not be written out to DDR. Another scenario is that some trace data packets can be missing at the end if it's not am multiple of 64-Bytes. * If you're sharing a buffer object for both output and trace, ensure the offset for the trace configuration is the right size (based on output buffer size). Check both size and datatype. Offsets are usually in terms of bytes. * Check that the correct tile is being routed to the correct shim DMA. It's not uncommon in a multi core design to route the wrong tile if you're routing these manually, espeically if the tile names might be very similar. Using the convenience python wrappers should automatically handle this correctly. - * You may get an invalid tile error if the `colshift` doesn't match the actually starting column of the design. This should automatically be set by the `parse_trace.py` script but can also be specified manually, and you can specify the `colshift` value in the evnet the automatic value is incorect. Phoenix (npu) devices should have `colshift=1` while Strix (npu2) should have `colshift=0`when allocated to an unused NPU. + * You may get an invalid tile error if the `colshift` doesn't match the actually starting column of the design. This should automatically be set by the `parse.py` script but can also be specified manually, and you can specify the `colshift` value in the event the automatic value is incorrect. Phoenix (npu) devices should have `colshift=1` while Strix (npu2) should have `colshift=0` when allocated to an unused NPU. * For designs with packet-routing flows, check for correctly matching packet flow IDs. The packet flow ID must match the configured ID value in Trace Control 1 register or else the packets don't get routed. Using the convenience python wrappers should again automatically handle this correctly. However, if your design uses its own packet-routing flows, the default flow IDs may conflict with the trace ones (to be improved in future release) * At the moment, there is a ongoing bug where you may see intermittent seg faults or functional errors for some designs when trace is enabled. The current workaround is to allocate XRT buffer size much larger than the trace size (currently 4x). This may need to be bigger still as this size was experimentally determined. ## Exercises -1. Let's give tracing a try. In this directory, we will be examining a simplified version of the `vector scalar multiply` example. Run `make trace`. This compiles the design, generates a trace data file, and runs `prase_trace.py` to generate the `trace_4b.json` waveform file. +1. Let's give tracing a try. In this directory, we will be examining a simplified version of the `vector scalar multiply` example. Run `make trace`. This compiles the design, generates a trace data file, and runs `parse.py` to generate the `trace_4b.json` waveform file. Open this waveform json in http://ui.perfetto.dev. If you zoom into the region of interest with the keyboard shortcut keys W and S to zoom in and out respectively and A and D to pan left and right. You should see a wave like the following: diff --git a/programming_guide/section-4/section-4c/README.md b/programming_guide/section-4/section-4c/README.md index 918ac89131c..5993f95d4e5 100644 --- a/programming_guide/section-4/section-4c/README.md +++ b/programming_guide/section-4/section-4c/README.md @@ -127,7 +127,7 @@ In this example, the vectorization strategy was relatively straight forward. Ins That's quite an improvement, ~8X reduction in compute latency. However, there's more optimization that can be had with vector code and that involves optimization pragmas. -1. Go back to [scale.cc](../../../aie_kernels/aie2/scale.cc) and uncomment the lines with `AIE_PREPARE_FOR_PIPELINING AIE_LOOP_MIN_ITERATION_COUNT(16)` to enable those pragmas. Then rerun the compilation (`make clean; int_bit_width=32 trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? +1. Go back to [scale.cc](../../../aie_kernels/aie2/scale.cc) and uncomment the lines with `AIE_PREPARE_FOR_PIPELINING AIE_LOOP_MIN_ITERATION_COUNT(16)` to enable those pragmas. Then rerun the compilation (`make clean; make int_bit_width=32 trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? Now, we're really seeing some savings (another factor ~4X savings or ~36X compared to the scalar version). The line we added helps guide the compiler to find optimal schedules. For kernel loops, `AIE_PREPARE_FOR_PIPELINING` and `AIE_LOOP_MIN_ITERATION_COUNT(16)` are particularly useful: * `AIE_PREPARE_FOR_PIPELINING` - Used in the innermost loop to tell the compiler to enable software pipelining. This is needed to enable subsequent loop optimization pragmas. @@ -215,7 +215,7 @@ Looking at this table, we quickly see that the data movement is the bottleneck f 1. We can already see that our design is inbalanced between data movement and compute where we have 72 cycles for compute and 512 cycles for data movement. Let's take a look at the [Matrix Multiply Example](../../../programming_examples/basic/matrix_multiplication/single_core) and see if we can do better. In the description, it talks about how each iteration of the kernel is by default configured for MxKxN values of 64x64x64 giving us 262,144 MACs. Given that we're working with `int16_t` datatype which has 64 MACs per clock, how many cycles will the ideal case take? Given that the A and B matrix are each 64x64 x `int16_t` and our stream switch channels are 32-bits wide, how many cycles does it take to move data to the compute tile (bear in mind A and B can be moved in parallel via separate channels). -1. So this example should be perfectly balanced between compute and data movement! Navigate to the [Matrix Multiply Example](../../../programming_examples/basic/matrix_multiplication/single_core) and run the trace build (`make clean; make -f Makefile.chess use_placed=1 trace`). Then open the generated waveform json (`trace_mm.json`) and measure the delta between `event 0` and `event 1` in the first run. What value did you get and how close is it to ideal? You should now see that the compute cycles and the data movement cycles are much more closely matched! +1. So this example should be perfectly balanced between compute and data movement! Navigate to the [Matrix Multiply Example](../../../programming_examples/basic/matrix_multiplication/single_core) and run the trace build (`make clean; make use_placed=1 trace`). Then open the generated waveform json (`trace_mm.json`) and measure the delta between `event 0` and `event 1` in the first run. What value did you get and how close is it to ideal? You should now see that the compute cycles and the data movement cycles are much more closely matched! ## Diving Deep - Examining the Microcode Let's take another look at the results of our [vector_scalar_mul design](../../../programming_examples/basic/vector_scalar_mul/). Let's also go back one step and comment out `AIE_PREPARE_FOR_PIPELINING AIE_LOOP_MIN_ITERATION_COUNT(16)` and rerun the compilation (`make clean; make trace`). diff --git a/programming_guide/section-5/README.md b/programming_guide/section-5/README.md index afeaf35d461..686eb706231 100644 --- a/programming_guide/section-5/README.md +++ b/programming_guide/section-5/README.md @@ -18,7 +18,7 @@ The [programming examples](../../programming_examples) are a number of sample de The [passthrough](../../programming_examples/basic/passthrough_kernel/) example is the simplest "getting started" example. It copies 4096 bytes from the input to output using vectorized loads and stores. The design example shows a typical project organization which is easy to reproduce with other examples. There are only really 4 important files here. 1. [`passthrough_kernel.py`](../../programming_examples/basic/passthrough_kernel/passthrough_kernel.py) The AIE structural design which includes the shim tile connected to the external memory, and a single AIE core for performing the copy. It also shows a simple use of the Object FIFOs described in [section 2](../section-2). -1. [`passthrough.cc`](../../aie_kernels/generic/passThrough.cc) This is a C++ file which performs the vectorized copy operation. +1. [`passThrough.cc`](../../aie_kernels/generic/passThrough.cc) This is a C++ file which performs the vectorized copy operation. 1. [`test.cpp`](../../programming_examples/basic/passthrough_kernel/test.cpp) or [`test.py`](../../programming_examples/basic/passthrough_kernel/test.py) A C++ or Python main application for exercising the design, and comparing against a CPU reference 1. [`Makefile`](../../programming_examples/basic/passthrough_kernel/Makefile) A Makefile documenting (and implementing) the build process for the various artifacts. @@ -60,7 +60,7 @@ The [passthrough DMAs](../../programming_examples/basic/passthrough_dmas/) examp 1. Take a look at the testbench in our [Vector Exp](../../programming_examples/basic/vector_exp/) example [test.cpp](../../programming_examples/basic/vector_exp/test.cpp). Take note of the data type and the size of the test vector. What do you notice? -1. What is the communication-to-computation ratio in [ReLU](../../programming_examples/ml/relu/)? +1. What is the communication-to-computation ratio in [ReLU](../../programming_examples/ml/relu/)? 1. **HARD** Which basic example is a component in [Softmax](../../programming_examples/ml/softmax/)? diff --git a/python/iron/__init__.py b/python/iron/__init__.py index 06ec1d34fe4..23f0280a82e 100644 --- a/python/iron/__init__.py +++ b/python/iron/__init__.py @@ -1,4 +1,18 @@ # (c) Copyright 2026 Advanced Micro Devices, Inc. +"""IRON: High-level Python API for programming AMD Ryzen AI NPUs. + +Provides the primary abstractions for describing NPU designs: + +- :class:`Buffer` -- named memory region shared between Workers and the Runtime +- :class:`ObjectFifo` -- synchronized dataflow channel between program components +- :class:`Worker` -- a task running on an AIE compute core +- :class:`Runtime` -- host-side orchestration of data movement and worker execution +- :class:`Program` -- top-level container that compiles a design to MLIR +- :class:`Kernel` / :class:`ExternalFunction` -- pre-compiled or C++ kernel functions +- :class:`WorkerRuntimeBarrier` -- synchronization primitive between workers and runtime +- Tensor utilities (:func:`arange`, :func:`zeros`, :func:`ones`, etc.) for NPU-accessible buffers +""" + from .buffer import Buffer from .kernel import ExternalFunction, Kernel from .program import Program diff --git a/python/iron/algorithms/__init__.py b/python/iron/algorithms/__init__.py index 183a2ff745e..9e90663c444 100644 --- a/python/iron/algorithms/__init__.py +++ b/python/iron/algorithms/__init__.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2026 Advanced Micro Devices, Inc. +"""High-level algorithm templates built on IRON (transform, for_each, etc.).""" + from .for_each import for_each from .transform import ( transform, diff --git a/python/iron/algorithms/for_each.py b/python/iron/algorithms/for_each.py index 61ce4f0feab..68d703fafba 100644 --- a/python/iron/algorithms/for_each.py +++ b/python/iron/algorithms/for_each.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2026 Advanced Micro Devices, Inc. +"""``for_each``: apply a function in-place over a tiled tensor on an AIE core.""" + import numpy as np from aie.iron import ObjectFifo, Program, Runtime, Worker @@ -27,10 +29,14 @@ def for_each(func, tensor, *params, tile_size=16): array types are transferred via ObjectFifos. tile_size: Size of each tile processed by a worker (default: 16) - Example: - # kernel has separate in/out tile buffers, but only pass one tensor in + Example:: + + # kernel has separate in/out tile buffers, but only one tensor is passed scale = ExternalFunction("scale", arg_types=[tile_ty, tile_ty, scalar_ty, np.int32], ...) - for_each(scale, tensor, factor, tile_size) + for_each(scale, tensor, factor, tile_size=16) + + Returns: + mlir.ir.Module: The compiled MLIR module ready for execution. """ is_external_func = isinstance(func, iron.ExternalFunction) num_elements = np.size(tensor) diff --git a/python/iron/algorithms/transform.py b/python/iron/algorithms/transform.py index e4442ecb0d4..2ed0cbf35ba 100644 --- a/python/iron/algorithms/transform.py +++ b/python/iron/algorithms/transform.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2026 Advanced Micro Devices, Inc. +"""Tiled transform algorithms (unary/binary, single-core/parallel) built on IRON.""" + import numpy as np from aie.iron import ObjectFifo, Program, Runtime, Worker @@ -373,25 +375,74 @@ def core_body(*of_args): def transform(func, input, output, *params, tile_size=16): - """Transform input to output using tiled processing.""" + """Apply ``func`` to ``input`` and write results to ``output`` using tiled processing on a single AIE core. + + Args: + func: Function or :class:`~aie.iron.kernel.ExternalFunction` to apply. + input: Input tensor (NPU-accessible). + output: Output tensor (NPU-accessible, same shape and dtype as ``input``). + *params: Additional parameters forwarded to ``func``. + tile_size (int, optional): Number of elements per tile. Defaults to 16. + + Returns: + mlir.ir.Module: The compiled MLIR module. + """ return _transform_gen(func, [input], output, *params, tile_size=tile_size) def transform_binary(func, first, second, output, *params, tile_size=16): - """Transform binary inputs to output using tiled processing.""" + """Apply ``func`` to ``first`` and ``second`` and write results to ``output`` using tiled processing on a single AIE core. + + Args: + func: Function or :class:`~aie.iron.kernel.ExternalFunction` to apply. + first: First input tensor (NPU-accessible). + second: Second input tensor (NPU-accessible, same shape and dtype as ``first``). + output: Output tensor (NPU-accessible, same shape and dtype as inputs). + *params: Additional parameters forwarded to ``func``. + tile_size (int, optional): Number of elements per tile. Defaults to 16. + + Returns: + mlir.ir.Module: The compiled MLIR module. + """ return _transform_gen(func, [first, second], output, *params, tile_size=tile_size) def transform_parallel(func, input, output, *params, tile_size=16): - """Parallel unary transform across multiple AIE tiles.""" + """Apply ``func`` to ``input`` in parallel across all available NPU columns. + + Distributes the input tensor evenly across columns; each column processes + ``tile_size`` elements per iteration. + + Args: + func: Function or :class:`~aie.iron.kernel.ExternalFunction` to apply. + input: Input tensor (NPU-accessible). + output: Output tensor (NPU-accessible, same shape and dtype as ``input``). + *params: Additional parameters forwarded to ``func``. + tile_size (int, optional): Number of elements per tile per column. Defaults to 16. + + Returns: + mlir.ir.Module: The compiled MLIR module. + """ return _transform_parallel_gen(func, [input], output, *params, tile_size=tile_size) def transform_parallel_binary(func, first, second, output, *params, tile_size=16): - """Parallel binary transform across multiple AIE tiles.""" + """Apply ``func`` to ``first`` and ``second`` in parallel across all available NPU columns. + + Args: + func: Function or :class:`~aie.iron.kernel.ExternalFunction` to apply. + first: First input tensor (NPU-accessible). + second: Second input tensor (NPU-accessible, same shape and dtype as ``first``). + output: Output tensor (NPU-accessible, same shape and dtype as inputs). + *params: Additional parameters forwarded to ``func``. + tile_size (int, optional): Number of elements per tile per column. Defaults to 16. + + Returns: + mlir.ir.Module: The compiled MLIR module. + """ return _transform_parallel_gen( func, [first, second], output, *params, tile_size=tile_size diff --git a/python/iron/buffer.py b/python/iron/buffer.py index 0eb74ef4918..15ec7f287d8 100644 --- a/python/iron/buffer.py +++ b/python/iron/buffer.py @@ -1,10 +1,12 @@ -# globalbuffer.py -*- Python -*- +# buffer.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""Named memory region accessible by both Workers and the Runtime.""" + import numpy as np from typing import Sequence @@ -24,7 +26,7 @@ class Buffer(Resolvable, Placeable): This is often used for Runtime Parameters. """ - """This is used to generate unique names if none is given during construction""" + # Used to generate unique names when none is provided during construction. __gbuf_index = 0 def __init__( @@ -46,7 +48,7 @@ def __init__( use_write_rtp (bool, optional): If use_write_rtp, write_rtp/read_rtp operations will be generated. Otherwise, traditional write/read operations will be used. Defaults to False. Raises: - ValueError: Arguments are validated. + ValueError: If neither ``type`` nor ``initial_value`` is provided. """ if type is None and initial_value is None: raise ValueError("Must provide either type, initial value, or both.") @@ -70,12 +72,12 @@ def __get_index(cls) -> int: @property def shape(self) -> Sequence[int]: """The shape of the buffer""" - return np_ndarray_type_get_shape(self._obj_type) + return np_ndarray_type_get_shape(self._arr_type) @property def dtype(self) -> np.dtype: """The per-element datatype of the buffer.""" - return np_ndarray_type_get_dtype(self._obj_type) + return np_ndarray_type_get_dtype(self._arr_type) @property def op(self): @@ -85,14 +87,14 @@ def op(self): def __getitem__(self, idx): if self._op is None: - return AttributeError( + raise AttributeError( "Cannot index into Buffer before it has been resolved." ) return self._op[idx] def __setitem__(self, idx, source): if self._op is None: - return AttributeError( + raise AttributeError( "Cannot index into Buffer before it has been resolved." ) else: diff --git a/python/iron/dataflow/__init__.py b/python/iron/dataflow/__init__.py index 5b0abe57c1c..de68ebb57c9 100644 --- a/python/iron/dataflow/__init__.py +++ b/python/iron/dataflow/__init__.py @@ -1 +1,10 @@ +# __init__.py -*- Python -*- +# +# This file is licensed under the Apache License v2.0 with LLVM Exceptions. +# See https://llvm.org/LICENSE.txt for license information. +# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +# +# (c) Copyright 2026 Advanced Micro Devices, Inc. +"""ObjectFIFO dataflow primitives for IRON designs.""" + from .objectfifo import ObjectFifo, ObjectFifoHandle, ObjectFifoLink, ObjectFifoEndpoint diff --git a/python/iron/dataflow/endpoint.py b/python/iron/dataflow/endpoint.py index 513b93de701..33128f31966 100644 --- a/python/iron/dataflow/endpoint.py +++ b/python/iron/dataflow/endpoint.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""ObjectFifoEndpoint: base class for placeable endpoints of an ObjectFIFO.""" + from ..placeable import Placeable diff --git a/python/iron/dataflow/objectfifo.py b/python/iron/dataflow/objectfifo.py index a826c8f77bf..ef6bc9de614 100644 --- a/python/iron/dataflow/objectfifo.py +++ b/python/iron/dataflow/objectfifo.py @@ -34,7 +34,7 @@ class ObjectFifo(Resolvable): user has a Placeable endpoint. """ - """This is used to generate unique ObjectFifo names.""" + # Used to generate unique ObjectFifo names when none is provided. __of_index = 0 def __init__( @@ -55,10 +55,10 @@ def __init__( name (str | None, optional): The name of the ObjectFifo. If None is given, a unique name will be generated. Defaults to None. dims_to_stream (list[Sequence[int]] | None, optional): Data layout transformations applied when data is pushed onto the AXI stream, described as pairs of (size, stride) from highest to lowest dimension. Defaults to None. dims_from_stream_per_cons (list[Sequence[int]] | None, optional): List of data layout transformations applied by each consumer when data is read from the AXI stream, described as pairs of (size, stride) from highest to lowest dimension. Defaults to None. - plio (bool, optional): _description_. Defaults to False. + plio (bool, optional): Whether the ObjectFifo uses PLIO connections. Defaults to False. Raises: - ValueError: _description_ + ValueError: If ``depth`` is provided and is less than 1. """ self._depth = depth if isinstance(self._depth, int) and self._depth < 1: @@ -88,12 +88,12 @@ def __get_index(cls) -> int: @property def depth(self) -> int: - """The default depth of the ObjectFifo. This may be overriden by an ObjectFifoHandle upon construction.""" + """The default depth of the ObjectFifo. This may be overridden by an ObjectFifoHandle upon construction.""" return self._depth @property def dims_from_stream_per_cons(self) -> list[Sequence[int]]: - """The default dimensions from stream per consumer value. This may be overriden by an ObjectFifoHandle of type consumer.""" + """The default dimensions from stream per consumer value. This may be overridden by an ObjectFifoHandle of type consumer.""" return self._dims_from_stream_per_cons @property @@ -179,7 +179,7 @@ def cons( dims_from_stream: list[Sequence[int]] | None = None, ) -> ObjectFifoHandle: """Returns an ObjectFifoHandle of type consumer. Each ObjectFifo may have multiple consumers, so this - will return a new consumer handle every time is it callled. + will return a new consumer handle every time it is called. Args: depth (int | None, optional): The depth of the buffers at the endpoint corresponding to this consumer handle. Defaults to None. @@ -246,7 +246,7 @@ def _prod_tile_op(self) -> Tile: def _cons_tiles_ops(self) -> list[Tile]: if len(self._cons) < 1: raise ValueError( - f"Cannot return cons.tile.op for ObjectFifo {self.name} because no consumers were not created." + f"Cannot return cons.tile.op for ObjectFifo {self.name} because no consumers were created." ) return [cons.endpoint.tile.op for cons in self._cons] @@ -745,11 +745,11 @@ def __init__( ) if len(self._src_offsets) > 0 and len(self._src_offsets) != len(self._srcs): raise ValueError( - "Then number of source offsets does not match the number of sources" + "The number of source offsets does not match the number of sources" ) if len(self._dst_offsets) > 0 and len(self._dst_offsets) != len(self._dsts): raise ValueError( - "Then number of destination offsets does not match the number of destinations" + "The number of destination offsets does not match the number of destinations" ) self._op = None for s in self._srcs: diff --git a/python/iron/device/__init__.py b/python/iron/device/__init__.py index 9d4efe00600..1786507754d 100644 --- a/python/iron/device/__init__.py +++ b/python/iron/device/__init__.py @@ -1,3 +1,12 @@ +# __init__.py -*- Python -*- +# +# This file is licensed under the Apache License v2.0 with LLVM Exceptions. +# See https://llvm.org/LICENSE.txt for license information. +# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +# +# (c) Copyright 2026 Advanced Micro Devices, Inc. +"""Device representations for supported AMD Ryzen AI NPU targets.""" + from .device import ( Device, NPU1, diff --git a/python/iron/device/device.py b/python/iron/device/device.py index 1ccacfb3094..09e3b2e8369 100644 --- a/python/iron/device/device.py +++ b/python/iron/device/device.py @@ -88,10 +88,12 @@ def tile_iterator(self) -> Generator[Tile, None, None]: @property def rows(self) -> int: + """Number of rows in the device tile array.""" return self._tm.rows() @property def cols(self) -> int: + """Number of columns in the device tile array.""" return self._tm.columns() def get_shim_tiles(self) -> list[Tile]: @@ -133,6 +135,9 @@ def get_compute_tiles(self) -> list[Tile]: def get_num_source_switchbox_connections(self, t: Tile) -> int: """Returns number of DMA source ports in the switchbox for the given tile on the device. + Args: + t (Tile): The tile to query. + Returns: int: Number of DMA source ports. """ @@ -144,6 +149,9 @@ def get_num_source_switchbox_connections(self, t: Tile) -> int: def get_num_dest_switchbox_connections(self, t: Tile) -> int: """Returns number of DMA dest ports in the switchbox for the given tile on the device. + Args: + t (Tile): The tile to query. + Returns: int: Number of DMA dest ports. """ @@ -155,6 +163,9 @@ def get_num_dest_switchbox_connections(self, t: Tile) -> int: def get_num_source_shim_mux_connections(self, t: Tile) -> int: """Returns number of DMA source ports in the shim mux for the given tile on the device. + Args: + t (Tile): The tile to query. + Returns: int: Number of DMA source ports. """ @@ -166,6 +177,9 @@ def get_num_source_shim_mux_connections(self, t: Tile) -> int: def get_num_dest_shim_mux_connections(self, t: Tile) -> int: """Returns number of DMA dest ports in the shim mux for the given tile on the device. + Args: + t (Tile): The tile to query. + Returns: int: Number of DMA dest ports. """ @@ -177,7 +191,7 @@ def get_num_dest_shim_mux_connections(self, t: Tile) -> int: def get_num_connections(self, tile: Tile, output: bool) -> int: """Returns number of DMA input or output "channels" available on the tile. Returns: - int: Number of connections (channels) available on the tile + int: Number of connections (channels) available on the tile. """ if tile.row == 0: if output: @@ -192,7 +206,7 @@ def get_num_connections(self, tile: Tile, output: bool) -> int: def is_mem_accessible(self, source_tile: Tile, tiles: list[Tile]) -> bool: """Returns whether there exists a memory region on source_tile which all destination tiles can access. Returns: - int: Number of connections (channels) available on the tile + bool: True if the given source tile has a memory region accessible by all destination tiles. """ if not isinstance(source_tile, Tile): raise ValueError(f"Expected a source Tile, but got {source_tile}") diff --git a/python/iron/dtype.py b/python/iron/dtype.py index e4d51a161d8..5ea76501e3b 100644 --- a/python/iron/dtype.py +++ b/python/iron/dtype.py @@ -1,14 +1,16 @@ -# config.py -*- Python -*- +# dtype.py -*- Python -*- # # This file is licensed under the Apache License v2.0 with LLVM Exceptions. # See https://llvm.org/LICENSE.txt for license information. # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2025 Advanced Micro Devices, Inc. +"""Utilities for converting between short string names and numpy dtype objects.""" import numpy as np from ml_dtypes import bfloat16 +# Mapping from short string names (e.g. 'bf16', 'i32') to numpy/ml_dtypes dtype objects. dtype_map = { "bf16": bfloat16, "i8": np.int8, diff --git a/python/iron/kernel.py b/python/iron/kernel.py index ec7c8183d69..74540031b22 100644 --- a/python/iron/kernel.py +++ b/python/iron/kernel.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024-2026 Advanced Micro Devices, Inc. +"""Kernel and ExternalFunction: wrappers for pre-compiled and C++ AIE compute kernels.""" import hashlib import logging diff --git a/python/iron/placeable.py b/python/iron/placeable.py index 655eaa342f1..0ba7195dbb4 100644 --- a/python/iron/placeable.py +++ b/python/iron/placeable.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""Base class for objects that can be placed on a specific device tile.""" + from .device import Tile, PlacementTile @@ -37,7 +39,7 @@ def tile(self) -> PlacementTile | None: """Return the tile of the placeable object. Returns: - PlacementTile: The current placement of the object. + PlacementTile | None: The current placement of the object, or None if unplaced. """ return self._tile @@ -49,8 +51,9 @@ def __init__(self, cls, current_tile: Tile, new_tile: Tile): """Create an AlreadyPlacedError Args: - current_tile (Tile): The current placement tile - new_tile (Tile): The placement tile given for the second attempt to place the object. + cls (type): The class of the object that is already placed. + current_tile (Tile): The current placement tile. + new_tile (Tile): The placement tile given for the second attempt. """ self.message = ( f"{cls} already placed at {current_tile}; cannot place at {new_tile}" diff --git a/python/iron/placers.py b/python/iron/placers.py index b4555431ac9..d45dc4e070c 100644 --- a/python/iron/placers.py +++ b/python/iron/placers.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""Placement algorithms that assign IRON program components to physical device tiles.""" from abc import ABCMeta, abstractmethod from typing import Optional @@ -30,7 +31,7 @@ def make_placement( workers: list[Worker], object_fifos: list[ObjectFifoHandle], ): - """Assign placement informatio to a program. + """Assign placement information to a program. Args: device (Device): The device to use for placement. @@ -56,6 +57,12 @@ class SequentialPlacer(Placer): """ def __init__(self, cores_per_col: Optional[int] = None): + """Initialize a SequentialPlacer. + + Args: + cores_per_col (int | None, optional): Maximum number of workers to place per + column. If None, all available compute tiles are used. Defaults to None. + """ super().__init__() self.cores_per_col = cores_per_col @@ -66,6 +73,17 @@ def make_placement( workers: list[Worker], object_fifos: list[ObjectFifoHandle], ): + """Assign placement to all unplaced Workers and ObjectFIFO endpoints. + + Args: + device (Device): The device to use for placement. + rt (Runtime): The runtime information for the program. + workers (list[Worker]): The workers included in the program. + object_fifos (list[ObjectFifoHandle]): The object fifo handles used by the program. + + Raises: + ValueError: If there are not enough tiles available for placement. + """ # Keep track of tiles available for placement based # on number of available input / output DMA channels shims_in = device.get_shim_tiles() diff --git a/python/iron/program.py b/python/iron/program.py index 8e2bbe7ecc0..928d3d4b53c 100644 --- a/python/iron/program.py +++ b/python/iron/program.py @@ -29,8 +29,8 @@ def __init__( ): """A Program represents all design information needed to run the design on a device. - ctx.module.operation.verify() is called within this function to verify the correctness - of the MLIR module. + Note: MLIR verification (``ctx.module.operation.verify()``) is performed inside + :meth:`resolve_program`, not during construction. Args: device (Device): The device used to generate the final MLIR for the design. diff --git a/python/iron/resolvable.py b/python/iron/resolvable.py index 04d94ac28a3..3b3c8e9bef0 100644 --- a/python/iron/resolvable.py +++ b/python/iron/resolvable.py @@ -5,7 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. - +"""Abstract base class for objects that lower to MLIR operations.""" from abc import ABC, abstractmethod @@ -15,7 +15,7 @@ class Resolvable(ABC): @abstractmethod def resolve( - cls, + self, loc: ir.Location | None = None, ip: ir.InsertionPoint | None = None, ) -> None: @@ -30,11 +30,8 @@ def resolve( class NotResolvedError(Exception): - """If the current object is Resolvable but the resolve() method has not been called, - before resolution information is accessed, they should raise this error. - - Args: - Exception (_type_): _description_ + """Raised when a property or operation is accessed on a :class:`Resolvable` object + before :meth:`resolve` has been called. """ def __init__(self, message="Cannot get operation; class not resolved."): diff --git a/python/iron/runtime/__init__.py b/python/iron/runtime/__init__.py index 529b7e656cc..0c4def5bb28 100644 --- a/python/iron/runtime/__init__.py +++ b/python/iron/runtime/__init__.py @@ -1 +1,10 @@ +# __init__.py -*- Python -*- +# +# This file is licensed under the Apache License v2.0 with LLVM Exceptions. +# See https://llvm.org/LICENSE.txt for license information. +# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +# +# (c) Copyright 2026 Advanced Micro Devices, Inc. +"""Runtime: host-side data movement and worker execution orchestration.""" + from .runtime import Runtime diff --git a/python/iron/runtime/data.py b/python/iron/runtime/data.py index 336a5433853..f64cf31775c 100644 --- a/python/iron/runtime/data.py +++ b/python/iron/runtime/data.py @@ -40,7 +40,7 @@ def dtype(self) -> np.dtype: return np_ndarray_type_get_dtype(self._arr_type) @property - def arr_type(self) -> np.ndarray: + def arr_type(self) -> type[np.ndarray]: """The tensor type of the buffer.""" return self._arr_type diff --git a/python/iron/runtime/dmatask.py b/python/iron/runtime/dmatask.py index b6db21e4b81..958f6a04df2 100644 --- a/python/iron/runtime/dmatask.py +++ b/python/iron/runtime/dmatask.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""DMATask: a RuntimeTask that generates a shim DMA transfer operation.""" from ... import ir # type: ignore @@ -43,7 +44,7 @@ def __init__( RuntimeTask.__init__(self, task_group) def will_wait(self) -> bool: - """If this tasks shoudl conclude with an await operation.""" + """Whether this task should conclude with an await operation.""" return self._wait @property diff --git a/python/iron/runtime/endpoint.py b/python/iron/runtime/endpoint.py index b6daccc0932..6a929965abd 100644 --- a/python/iron/runtime/endpoint.py +++ b/python/iron/runtime/endpoint.py @@ -17,7 +17,7 @@ class RuntimeEndpoint(ObjectFifoEndpoint): The placement of this Endpoint should be a Shim Tile. """ - def __init__(self, placement: PlacementTile) -> RuntimeEndpoint: + def __init__(self, placement: PlacementTile) -> None: super().__init__(placement) def __eq__(self, other: object) -> bool: diff --git a/python/iron/runtime/runtime.py b/python/iron/runtime/runtime.py index 3c52313a096..e510a22338c 100644 --- a/python/iron/runtime/runtime.py +++ b/python/iron/runtime/runtime.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024-2026 Advanced Micro Devices, Inc. +"""Runtime: orchestrates host-side data movement and worker execution for an IRON program.""" from __future__ import annotations from collections import defaultdict @@ -44,17 +45,17 @@ class Runtime(Resolvable): need to be taken care of by the host/runtime in order to run a program. """ - """This is used to generate unique task group ids""" + # Used to generate unique task group IDs within this Runtime. __task_group_index = 0 def __init__( self, strict_task_groups: bool = True, - ) -> Runtime: + ) -> None: """Initialize a runtime object. Args: - check_task_groups: Disallows mixing the default group and explicit task groups during resolution. + strict_task_groups (bool): Disallows mixing the default group and explicit task groups during resolution. This can catch common errors, but can be set to False to disable the checks. """ @@ -79,7 +80,7 @@ def sequence(self, *input_types: type[np.ndarray]): ValueError: If task groups are not finished within the sequence() context, and error will be raised. Yields: - _type_: Handles to the buffers matching the input types. + RuntimeData | tuple[RuntimeData, ...]: Handles to the runtime buffers matching the declared input types. """ try: self._rt_data = list(map(RuntimeData, input_types)) @@ -133,7 +134,7 @@ def finish_task_group(self, task_group: RuntimeTaskGroup): This should be called within a Runtime.sequence() context. Args: - task_group (RuntimeTaskGroup): _description_ + task_group (RuntimeTaskGroup): The task group to close. All associated tasks will be awaited or freed. """ self._open_task_groups.remove(task_group) self._tasks.append(FinishTaskGroupTask(task_group)) @@ -190,9 +191,9 @@ def drain( Args: out_fifo (ObjectFifoHandle): The consumer ObjectFifoHandle. dest (RuntimeData): The output Runtime data buffer. - tap (TensorAccessPattern | None, optional): A way of specifying how data in the buffer is accessed when sending it to the in_fifo. - If None is given, this will default to a linear transfer containing all data in the source buffer. Defaults to None. - task_group (RuntimeTaskGroup | None, optional): A TaskGroup to associate this task with. Defaults to None. Defaults to None. + tap (TensorAccessPattern | None, optional): A way of specifying how data in the buffer is accessed when reading from the out_fifo. + If None is given, this will default to a linear transfer containing all data in the destination buffer. Defaults to None. + task_group (RuntimeTaskGroup | None, optional): A TaskGroup to associate this task with. Defaults to None. wait (bool, optional): Whether this Task should be awaited on or not. If not, it will be freed when the task group is finished. Defaults to False. placement (PlacementTile, optional): The Shim tile to associate the data transfer with. Defaults to AnyShimTile. @@ -242,13 +243,33 @@ def enable_trace( self, trace_size: int = None, trace_offset: int = None, - workers: [] = None, + workers: list | None = None, ddr_id: int = None, - coretile_events: [] = None, - memtile_events: [] = None, - shimtile_events: [] = None, + coretile_events: list | None = None, + memtile_events: list | None = None, + shimtile_events: list | None = None, ): - """Enable trace.""" + """Enable hardware tracing for this program. + + Configures the AIE trace units and routes trace packets to DDR via the shim DMA. + Should be called within a :meth:`sequence` context before data movement operations. + + Args: + trace_size (int): Size of the trace buffer in bytes. + trace_offset (int | None, optional): Byte offset into the DDR buffer where trace + data should begin. Defaults to None (treated as 0). + workers (list[Worker] | None, optional): Specific workers to trace. If None, + all workers with ``trace`` set will be traced. Defaults to None. + ddr_id (int | None, optional): XRT inout buffer index to write trace data into. + Defaults to None (treated as 4, the conventional last buffer slot). + coretile_events (list | None, optional): List of up to 8 core tile trace events. + See ``python/utils/trace_events_enum.py`` for available events. + Defaults to None (uses hardware defaults). + memtile_events (list | None, optional): List of up to 8 mem tile trace events. + Defaults to None (uses hardware defaults). + shimtile_events (list | None, optional): List of up to 8 shim tile trace events. + Defaults to None (uses hardware defaults). + """ self._trace_size = trace_size self._trace_offset = trace_offset self._trace_workers = workers diff --git a/python/iron/runtime/taskgroup.py b/python/iron/runtime/taskgroup.py index 587110e32cc..082a95f9390 100644 --- a/python/iron/runtime/taskgroup.py +++ b/python/iron/runtime/taskgroup.py @@ -5,10 +5,11 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""RuntimeTaskGroup: a tag for grouping related RuntimeTasks for concurrent execution.""" class RuntimeTaskGroup: - """A RuntimeTaskGroup is a structured tag to indicated groupings of RuntimeTasks.""" + """A RuntimeTaskGroup is a structured tag to indicate groupings of RuntimeTasks.""" def __init__(self, id: int): """Construct a RuntimeTaskGroup diff --git a/python/iron/worker.py b/python/iron/worker.py index 10f5225d056..3616492c386 100644 --- a/python/iron/worker.py +++ b/python/iron/worker.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2024 Advanced Micro Devices, Inc. +"""Worker and WorkerRuntimeBarrier: compute-core tasks and runtime synchronization primitives.""" + import sys from typing import Callable @@ -20,12 +22,12 @@ class Worker(ObjectFifoEndpoint): - """_summary_ - Worker is an object that takes a `core_fn` and a set of arguments. - A Worker must be placed on a Compute Core. - """ + """A task to be run on an AIE compute core. - """This variable is the current core if resolving() within the Worker, or None otherwise.""" + A Worker takes a ``core_fn`` callable and the arguments it needs (ObjectFIFO handles, + Buffers, Kernels, etc.). Each Worker must be placed on a single compute tile, either + explicitly via ``placement`` or automatically by a :class:`~aie.iron.placers.Placer`. + """ def __init__( self, @@ -49,6 +51,7 @@ def __init__( allocation_scheme (str, optional): The memory allocation scheme to use for the Worker, either 'basic-sequential' or 'bank-aware'. If None, defaults to bank-aware. Will override any allocation scheme set on the tile given as placement. trace (int, optional): If >0, enable tracing for this worker. + trace_events (list | None, optional): Custom list of trace events for this worker. Defaults to None. Raises: ValueError: Parameters are validated. @@ -118,7 +121,7 @@ def fifos(self) -> list[ObjectFifoHandle]: @property def buffers(self) -> list[Buffer]: - """Returns a list of Buffer given to the Worker via fn_args. + """Returns a list of Buffers given to the Worker via fn_args. Returns: list[Buffer]: Buffer used by the Worker. @@ -150,6 +153,11 @@ class WorkerRuntimeBarrier: """A barrier allowing individual workers to synchronize with the runtime sequence.""" def __init__(self, initial_value: int = 0): + """Initialize a WorkerRuntimeBarrier. + + Args: + initial_value (int, optional): The initial lock value. Defaults to 0. + """ self.initial_value = initial_value self.worker_locks = [] @@ -197,6 +205,12 @@ class _BarrierSetOp(Resolvable): """A resolvable instance of a WorkerRuntimeBarrier. This should not be used directly.""" def __init__(self, barrier: WorkerRuntimeBarrier, value: int): + """Construct a _BarrierSetOp. + + Args: + barrier (WorkerRuntimeBarrier): The barrier whose value will be set. + value (int): The value to set. + """ self.barrier: WorkerRuntimeBarrier = barrier self.value: int = value diff --git a/python/utils/__init__.py b/python/utils/__init__.py index 46770a5eb7d..96f83b58a5d 100644 --- a/python/utils/__init__.py +++ b/python/utils/__init__.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2025-2026 Advanced Micro Devices, Inc. +"""Tensor factories, device helpers, and re-exports for the IRON runtime.""" + import logging import os diff --git a/python/utils/compile/__init__.py b/python/utils/compile/__init__.py index 9206688a30e..ad94cc321a4 100644 --- a/python/utils/compile/__init__.py +++ b/python/utils/compile/__init__.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2025-2026 Advanced Micro Devices, Inc. +"""Compilation utilities: MLIR module compilation, kernel linking, and cache management.""" import os from pathlib import Path diff --git a/python/utils/compile/utils.py b/python/utils/compile/utils.py index 113567a3666..f1ec0dd9632 100644 --- a/python/utils/compile/utils.py +++ b/python/utils/compile/utils.py @@ -5,10 +5,13 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2025-2026 Advanced Micro Devices, Inc. +"""Low-level helpers for compiling MLIR modules and external C++ kernels to NPU artifacts.""" + import logging import os import shutil import subprocess +from pathlib import Path import aie.compiler.aiecc.main as aiecc import aie.utils.config as config @@ -77,33 +80,24 @@ def compile_cxx_core_function( def compile_mlir_module( mlir_module: str, - insts_path: str | None = None, - pdi_path: str | None = None, - xclbin_path: str | None = None, + insts_path: str | Path | None = None, + pdi_path: str | Path | None = None, + xclbin_path: str | Path | None = None, verbose=False, - work_dir: str | None = None, + work_dir: str | Path | None = None, options=None, ): """ - Compile an MLIR module to instruction, PDI, and/or xclbin files using aiecc. - - By default uses the Peano compiler backend (--no-xchesscc --no-xbridge). - Pass additional flags via ``options`` to override. - - When ``work_dir`` is provided, the MLIR is written to a file inside that - directory so that the C++ aiecc binary resolves relative ``link_with`` - paths on ``func.func`` declarations against the same directory where - ``compile_external_kernel`` placed the compiled object files. - - Args: - mlir_module: MLIR module to compile. - insts_path: Output path for the NPU instruction binary. - pdi_path: Output path for the PDI file. - xclbin_path: Output path for the xclbin package. - verbose: If True, pass --verbose to aiecc. - work_dir: Compilation working directory; also determines where the - MLIR input file is written when invoking the C++ aiecc binary. - options: Additional aiecc command-line options. + Compile an MLIR module to instruction, PDI, and/or xclbin files using the aiecc module. + This function supports only the Peano compiler. + Parameters: + mlir_module (str): MLIR module to compile. + insts_path (str): Path to the instructions binary file. + pdi_path (str): Path to the PDI file. + xclbin_path (str): Path to the xclbin file. + verbose (bool): If True, enable verbose output. + work_dir (str): Compilation working directory. + options (list[str]): List of additional options. """ args = [ diff --git a/python/utils/hostruntime/__init__.py b/python/utils/hostruntime/__init__.py index 56e1acc3cfa..e541629a358 100644 --- a/python/utils/hostruntime/__init__.py +++ b/python/utils/hostruntime/__init__.py @@ -5,6 +5,8 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2026 Advanced Micro Devices, Inc. +"""Host runtime utilities: device selection, tensor allocation, and numerical helpers.""" + from typing import TYPE_CHECKING from ml_dtypes import bfloat16 import numpy as np diff --git a/python/utils/jit.py b/python/utils/jit.py index 5a90b0cb6c0..cf48651c050 100644 --- a/python/utils/jit.py +++ b/python/utils/jit.py @@ -5,6 +5,7 @@ # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception # # (c) Copyright 2025-2026 Advanced Micro Devices, Inc. +"""JIT decorator for compiling and running IRON-decorated functions on the NPU.""" import os import functools @@ -64,6 +65,10 @@ def decorator(*args, **kwargs): cache_key = _create_function_cache_key(function, args, kwargs) if effective_use_cache and cache_key in _compiled_kernels: cached_kernel = _compiled_kernels[cache_key] + if cached_kernel is None: + raise RuntimeError( + f"Cached kernel for '{function.__name__}' is None; this is a bug." + ) # Filter out non-tensor arguments (ExternalFunction, scalars) # Only tensor args should be passed to the kernel tensor_args = _filter_tensor_args(args) diff --git a/python/utils/npukernel.py b/python/utils/npukernel.py index dd99fe06839..d0388c9f029 100644 --- a/python/utils/npukernel.py +++ b/python/utils/npukernel.py @@ -36,6 +36,7 @@ def __init__( self._insts_path = insts_path self._kernel_name = kernel_name self._trace_config = trace_config + self._device_index = device_index @property def trace_config(self) -> TraceConfig | None: @@ -88,10 +89,12 @@ def __call__(self, *args, **kwargs): **kwargs: Additional arguments passed to the runtime load_and_run method. Returns: - KernelResult: The result of the kernel execution. + The result returned by the runtime ``load_and_run`` call. """ from . import DefaultNPURuntime + if DefaultNPURuntime is None: + raise Exception("Cannot run kernel; DefaultNPURuntime not set.") return DefaultNPURuntime.load_and_run( self, list(args),