Xilinx · hunhoffe · Mar 13, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
@@ -18,11 +18,15 @@ Each IRON example has one or more implementations:
 
 They are organized into the following directories:
 
-## [getting_started](./getting_started) 
+## [getting_started](./getting_started)
 
 Designs tailored to the new user experience that span from basic applications such as SAXPY to more complicated ones such as tiled matrix multiplication, for the NPU in Ryzen™ AI.
 
-## [basic](./basic) 
+## [algorithms](./algorithms)
+
+Higher-level algorithm templates (transform, for_each, and parallel variants) that handle Workers, ObjectFIFOs, and data movement automatically for common element-wise dataflow patterns on the NPU in Ryzen™ AI.
+
+## [basic](./basic)
 
 Basic building blocks to understand the NPU architecture and first steps towards building applications for the NPU in Ryzen™ AI. 
 

@@ -12,17 +12,30 @@
 
 These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs.
 
-* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core. 
-* [Passthrough Kernel](./passthrough_kernel) - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming.
-* [DMA Transpose](./dma_transpose) - Transposes a matrix with the Shim DMA using `npu_dma_memcpy_nd` 
-* [Vector Scalar Add](./vector_scalar_add) - Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back.
-* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096`. The kernel does a `1024` vector multiply and is invoked multiple times to complete the full `vector * scalar` compute.
+* [Passthrough DMAs](./passthrough_dmas) - Data movement memcpy using object FIFOs via DMAs only, without involving the AIE core.
+* [Passthrough Kernel](./passthrough_kernel) - Vectorized memcpy via a single AIE core kernel.
+* [Passthrough PyKernel](./passthrough_pykernel) - Memcpy where the AIE kernel is written as an inline Python function rather than a C++ external function.
+* [Passthrough DMAs PLIO](./passthrough_dmas_plio) - **Targets the Xilinx VCK5000, not Ryzen AI NPU.** Demonstrates PLIO-connected soft DMAs in programmable logic.
+* [DMA Transpose](./dma_transpose) - Matrix transpose using the Shim DMA with `npu_dma_memcpy_nd`.
+* [DMA Transpose Packet](./dma_transpose_packet) - Matrix transpose using packet-switched DMA flows.
+* [Chaining Channels](./chaining_channels) - Demonstrates chaining multiple DMA buffer descriptors in sequence on a single channel.
+* [Combined Transpose](./combined_transpose) - Matrix transpose combining Shim DMA strides with AIE core VSHUFFLE instructions.
+* [Shuffle Transpose](./shuffle_transpose) - Matrix transpose using only AIE core VSHUFFLE instructions.
+* [Vector Scalar Add](./vector_scalar_add) - Single tile increments every element of a vector by `1`.
+* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096` in `1024`-element chunks.
+* [Vector Scalar Add Runlist](./vector_scalar_add_runlist) - Vector scalar add using the run-list execution model.
 * [Vector Vector Add](./vector_vector_add) - Single tile performs `vector + vector` of size `1024`.
+* [Vector Vector Add BDs Init Values](./vector_vector_add_BDs_init_values) - Vector addition with buffer descriptors pre-initialized with values.
 * [Vector Vector Modulo](./vector_vector_modulo) - Single tile performs `vector % vector` of size `1024`.
 * [Vector Vector Multiply](./vector_vector_mul) - Single tile performs `vector * vector` of size `1024`.
-* [Vector Reduce Add](./vector_reduce_add) - Single tile performs a reduction of a vector to return the `sum` of the elements.
-* [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements.
-* [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements.
-* [Vector Exp](./vector_exp) - A simple element-wise exponent function, using the look up table capabilities of the AI Engine.
-* [Matrix Scalar Add](./matrix_scalar_add) - Single tile performs `matrix * vector` with matrix size of `16x8`.
-* [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking.
+* [Vector Reduce Add](./vector_reduce_add) - Single tile reduction returning the `sum` of a vector.
+* [Vector Reduce Max](./vector_reduce_max) - Single tile reduction returning the `max` of a vector.
+* [Vector Reduce Min](./vector_reduce_min) - Single tile reduction returning the `min` of a vector.
+* [Vector Exp](./vector_exp) - Element-wise $e^x$ using the AIE look-up table capability.
+* [Matrix Scalar Add](./matrix_scalar_add) - Single tile adds a scalar constant to every element of a `16x8` matrix.
+* [Matrix Multiplication](./matrix_multiplication) - Single-core, multi-core (whole array), and matrix-vector multiply designs, plus sweep benchmarking infrastructure.
+* [Row Wise Bias Add](./row_wise_bias_add) - Adds a bias vector to each row of a matrix using DMA tiling.
+* [Event Trace](./event_trace) - Demonstrates the AIE hardware trace unit for measuring kernel cycle counts and stall events. See also [Section 4b](../../programming_guide/section-4/section-4b/) of the programming guide.
+* [Packet Switch](./packet_switch) - Demonstrates packet-switched routing for multiplexing multiple data streams over shared interconnect.
+* [Tiling Exploration](./tiling_exploration) - Interactive exploration of `TensorAccessPattern` and `TensorTiler2D` for n-dimensional DMA tiling. Includes visualization tools.
+* [Memcpy](./memcpy) - **Exercise design.** A parameterized multi-column memcpy with an intentionally unoptimized runtime sequence. The goal is to add task groups to achieve peak bandwidth. See [getting_started/00_memcpy](../getting_started/00_memcpy/) for the reference solution.
@@ -1,8 +1,10 @@
---- 
+---
 
 # **Memcpy**
 
-The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance charactaristics.
+> **Exercise Design:** The runtime sequence in `memcpy.py` is intentionally left unoptimized — drain operations run serially rather than in parallel, which limits measured bandwidth. Your task is to restructure the runtime sequence using `task_group()` to achieve full concurrency across all columns and channels. See Step 4 below for guidance, and [getting_started/00_memcpy/memcpy.py](../../../getting_started/00_memcpy/memcpy.py) for the reference solution.
+
+The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance characteristics.
 
 ---
 

diff --git a/programming_examples/basic/passthrough_dmas_plio/README.md b/programming_examples/basic/passthrough_dmas_plio/README.md
@@ -10,6 +10,8 @@
 
 # <ins>Passthrough DMAs with PLIO</ins>
 
+> **Hardware Note:** This design targets the **Xilinx VCK5000 Versal evaluation board**, not a Ryzen AI NPU. It will not build or run on Ryzen AI hardware.
+
 This reference design can be run on the VCK5000 Versal device. This design leverages the same data movement pattern as the [Passthrough DMAs](../passthrough-dmas) example design but it uses a soft DMA. Please see the [platforms repo](https://github.com/Xilinx/ROCm-air-platforms) for more information on how the programmable logic is integrated with the AIEs. This is meant to be an illustrative example to highlight how to integrate PL designs with AIE designs programmed using mlir-aie.
 
 In the platform, tile (26, 0) has PLIO connected to a DMA implemented in the programmable logic. There are two designs, `aie2-input-plio.py` uses the soft DMA to push data from DRAM into the AIEs, wheras `aie2-output-plio.py` uses the soft DMA to receive data from the AIEs and push it to DRAM. The soft DMA is programmed using the same mechanism as the ShimDMAs.

@@ -10,7 +10,7 @@
 
 # Passthrough Kernel:
 
-This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
+This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_kernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
 
 ## Source Files Overview
 

@@ -10,7 +10,7 @@
 
 # Passthrough Kernel:
 
-This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_pykernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
+This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of a primary design file `passthrough_pykernel.py` and a testbench `test.cpp` or `test.py`.
 
 ## Source Files Overview
 
@@ -31,7 +31,7 @@ This IRON design flow example, called "Passthrough Kernel", demonstrates a simpl
 This simple example effectively passes data through a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows:
 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 
 1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory. 
-1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python fucntion is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object".
+1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python function is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object".
 1. After the copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile,  "of_out" and "of_in" respectively.
 
 It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing `depth` is `2` when constructing the `ObjectFifo`, for example, `ObjectFifo(line_ty, depth=2)` to denote ping-pong buffers. By default, the depth is `2` in recognition of this common pattern.
@@ -64,7 +64,7 @@ This design performs a memcpy operation on a vector of input data. The AIE desig
 
 ## Usage
 
-### Compile the desing:
+### Compile the design:
 
 To compile the design:
 

@@ -46,7 +46,7 @@ env use_placed=1 make
 
 To compile the C++ testbench:
 ```shell
-make text_exp.exe
+make vector_exp.exe
 ```
 
 To run the design:

@@ -22,9 +22,9 @@ This IRON design flow example, called "Vector Scalar Multiplication", demonstrat
 
 1. `vector_scalar_mul_jit.py`: A JIT version that passes `scale.cc` to the transform algorithm. JIT compilation allows combining the host code with AIE design into one file.
 
-1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
+1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data.
 
-1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
+1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data.
 
 ## Design Overview
 

@@ -122,6 +122,16 @@ def core_fn(of_in, of_out, passThroughLine):
     # Create a TensorAccessPattern for each channel to describe the data movement.
     # The pattern chops the data in equal chunks and moves them in parallel across
     # the columns and channels.
+    #
+    # TensorAccessPattern arguments (see programming_guide/section-2/section-2c/
+    # for a full explanation of data layout transformations):
+    #   tensor_dims : logical shape of the full transfer buffer — (1, size)
+    #   offset      : starting element index into that buffer for this chunk
+    #   sizes       : [dim3, dim2, dim1, dim0] — number of elements in each
+    #                 dimension. [1, 1, 1, chunk] means a single 1-D transfer
+    #                 of `chunk` elements (the higher dimensions are unused).
+    #   strides     : [dim3, dim2, dim1, dim0] — step between elements in each
+    #                 dimension. [0, 0, 0, 1] means contiguous (stride-1) access.
     taps = [
         TensorAccessPattern(
             (1, size),

@@ -15,11 +15,11 @@
 #include <stdio.h>
 #include <stdlib.h>
 
-#define REL_WRITE 0
-#define REL_READ 1
-
 #include <aie_api/aie.hpp>
 
+// NOTE: Both kernels below are hardcoded for N=4096 elements. The Python
+// design file (saxpy.py) must be called with a tensor of exactly this size.
+// Calling with any other size will produce silently incorrect results.
 extern "C" {
 void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) {
   event0();
@@ -39,6 +39,10 @@ void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) {
   event1();
 }
 
+// saxpy_scalar: a non-vectorized reference implementation of SAXPY.
+// Useful for verifying correctness and understanding the algorithm before
+// examining the vectorized version above. Can be selected from Python by
+// changing the ExternalFunction name from "saxpy" to "saxpy_scalar".
 void saxpy_scalar(bfloat16 *x, bfloat16 *y, bfloat16 *z) {
   event0();
   float a = 3.f;

@@ -10,11 +10,9 @@
 import os
 
 import aie.iron as iron
-from aie.iron import ExternalFunction, jit
-from aie.iron import Kernel, ObjectFifo, Program, Runtime, Worker
+from aie.iron import ExternalFunction
+from aie.iron import ObjectFifo, Program, Runtime, Worker
 from aie.iron.placers import SequentialPlacer
-from aie.iron.controlflow import range_
-from aie.helpers.taplib import TensorAccessPattern, TensorTiler2D
 from aie.utils.config import cxx_header_path
 
 
@@ -86,8 +84,10 @@ def core_body(of_x, of_y, of_z, saxpy_kernel):
 
 
 def main():
-    # Define tensor shapes and data types
-    data_size = 2048
+    # Define tensor shapes and data types.
+    # NOTE: saxpy.cc hardcodes the loop bound to 4096 elements. This value
+    # must match data_size or the kernel will produce silently wrong results.
+    data_size = 4096
     element_type = bfloat16
 
     # Construct an input tensor and an output zeroed tensor
@@ -100,21 +100,14 @@ def main():
     # to the kernel will use the same compiled kernel and loaded code objects
     saxpy(input0, input1, output)
 
-    # Check the correctness of the result and print
+    # Check the correctness of the result and print any mismatches
     ref_vec = [3 * input0[i] + input1[i] for i in range(data_size)]
 
     errors = 0
-    for index, (actual, ref) in enumerate(
-        zip(
-            output,
-            ref_vec,
-        )
-    ):
+    for index, (actual, ref) in enumerate(zip(output, ref_vec)):
         if actual != ref:
             print(f"Error at {index}: {actual} != {ref}")
             errors += 1
-        else:
-            print(f"Correct output at {index}: {actual} == {ref}")
 
     # If the result is correct, exit with a success code
     # Otherwise, exit with a failure code