Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
c368b98
Fix bugs, broken links, and typos in programming_guide and programmin…
hunhoffe Mar 11, 2026
501a37e
Improve getting_started examples: prereqs, bug fixes, clarity
hunhoffe Mar 11, 2026
95f4b45
Fix bugs and inconsistencies in programming_guide identified in audit
hunhoffe Mar 11, 2026
3acf9d9
section-4c: acknowledge hardware-dependent cycle count variance
hunhoffe Mar 11, 2026
1065149
section-2d DMATasks: remove unsourced API preference claim
hunhoffe Mar 11, 2026
e0532a0
section-4c: revert cycle count changes, keep valid fixes
hunhoffe Mar 11, 2026
39ebbc8
Fix accuracy, completeness, and consistency issues in programming_exa…
hunhoffe Mar 11, 2026
cbbbbeb
Format changed Python files with black
hunhoffe Mar 11, 2026
164ec68
Fix bugs, placeholder text, and docstring quality in python/iron and …
hunhoffe Mar 12, 2026
a38de02
remove old file that was previously renamed
hunhoffe Mar 12, 2026
5e83ba6
Fix swapped docstrings in device.py and inaccurate algorithms descrip…
hunhoffe Mar 12, 2026
04fe440
Fix pyright type errors in python/utils (jit, npukernel, compile/utils)
hunhoffe Mar 12, 2026
08939a8
Merge branch 'main' into tutorial-review
hunhoffe Mar 12, 2026
3d022ad
Merge main into tutorial-review + fix resnet.py RTP bug
hunhoffe Mar 12, 2026
ccf9b83
Fix remaining audit issues in tutorial-review branch
hunhoffe Mar 12, 2026
8e0bc80
Fix inaccurate documentation across programming_guide
hunhoffe Mar 12, 2026
84b5bf3
more fixes
hunhoffe Mar 12, 2026
182543e
Merge branch 'main' into tutorial-review
hunhoffe Mar 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions programming_examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,15 @@ Each IRON example has one or more implementations:

They are organized into the following directories:

## [getting_started](./getting_started)
## [getting_started](./getting_started)

Designs tailored to the new user experience that span from basic applications such as SAXPY to more complicated ones such as tiled matrix multiplication, for the NPU in Ryzen™ AI.

## [basic](./basic)
## [algorithms](./algorithms)

Higher-level algorithm templates (transform, for_each, and parallel variants) that handle Workers, ObjectFIFOs, and data movement automatically for common element-wise dataflow patterns on the NPU in Ryzen™ AI.

## [basic](./basic)

Basic building blocks to understand the NPU architecture and first steps towards building applications for the NPU in Ryzen™ AI.

Expand Down
35 changes: 24 additions & 11 deletions programming_examples/basic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,30 @@

These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs.

* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core.
* [Passthrough Kernel](./passthrough_kernel) - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming.
* [DMA Transpose](./dma_transpose) - Transposes a matrix with the Shim DMA using `npu_dma_memcpy_nd`
* [Vector Scalar Add](./vector_scalar_add) - Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back.
* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096`. The kernel does a `1024` vector multiply and is invoked multiple times to complete the full `vector * scalar` compute.
* [Passthrough DMAs](./passthrough_dmas) - Data movement memcpy using object FIFOs via DMAs only, without involving the AIE core.
* [Passthrough Kernel](./passthrough_kernel) - Vectorized memcpy via a single AIE core kernel.
* [Passthrough PyKernel](./passthrough_pykernel) - Memcpy where the AIE kernel is written as an inline Python function rather than a C++ external function.
* [Passthrough DMAs PLIO](./passthrough_dmas_plio) - **Targets the Xilinx VCK5000, not Ryzen AI NPU.** Demonstrates PLIO-connected soft DMAs in programmable logic.
* [DMA Transpose](./dma_transpose) - Matrix transpose using the Shim DMA with `npu_dma_memcpy_nd`.
* [DMA Transpose Packet](./dma_transpose_packet) - Matrix transpose using packet-switched DMA flows.
* [Chaining Channels](./chaining_channels) - Demonstrates chaining multiple DMA buffer descriptors in sequence on a single channel.
* [Combined Transpose](./combined_transpose) - Matrix transpose combining Shim DMA strides with AIE core VSHUFFLE instructions.
* [Shuffle Transpose](./shuffle_transpose) - Matrix transpose using only AIE core VSHUFFLE instructions.
* [Vector Scalar Add](./vector_scalar_add) - Single tile increments every element of a vector by `1`.
* [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096` in `1024`-element chunks.
* [Vector Scalar Add Runlist](./vector_scalar_add_runlist) - Vector scalar add using the run-list execution model.
* [Vector Vector Add](./vector_vector_add) - Single tile performs `vector + vector` of size `1024`.
* [Vector Vector Add BDs Init Values](./vector_vector_add_BDs_init_values) - Vector addition with buffer descriptors pre-initialized with values.
* [Vector Vector Modulo](./vector_vector_modulo) - Single tile performs `vector % vector` of size `1024`.
* [Vector Vector Multiply](./vector_vector_mul) - Single tile performs `vector * vector` of size `1024`.
* [Vector Reduce Add](./vector_reduce_add) - Single tile performs a reduction of a vector to return the `sum` of the elements.
* [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements.
* [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements.
* [Vector Exp](./vector_exp) - A simple element-wise exponent function, using the look up table capabilities of the AI Engine.
* [Matrix Scalar Add](./matrix_scalar_add) - Single tile performs `matrix * vector` with matrix size of `16x8`.
* [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking.
* [Vector Reduce Add](./vector_reduce_add) - Single tile reduction returning the `sum` of a vector.
* [Vector Reduce Max](./vector_reduce_max) - Single tile reduction returning the `max` of a vector.
* [Vector Reduce Min](./vector_reduce_min) - Single tile reduction returning the `min` of a vector.
* [Vector Exp](./vector_exp) - Element-wise $e^x$ using the AIE look-up table capability.
* [Matrix Scalar Add](./matrix_scalar_add) - Single tile adds a scalar constant to every element of a `16x8` matrix.
* [Matrix Multiplication](./matrix_multiplication) - Single-core, multi-core (whole array), and matrix-vector multiply designs, plus sweep benchmarking infrastructure.
* [Row Wise Bias Add](./row_wise_bias_add) - Adds a bias vector to each row of a matrix using DMA tiling.
* [Event Trace](./event_trace) - Demonstrates the AIE hardware trace unit for measuring kernel cycle counts and stall events. See also [Section 4b](../../programming_guide/section-4/section-4b/) of the programming guide.
* [Packet Switch](./packet_switch) - Demonstrates packet-switched routing for multiplexing multiple data streams over shared interconnect.
* [Tiling Exploration](./tiling_exploration) - Interactive exploration of `TensorAccessPattern` and `TensorTiler2D` for n-dimensional DMA tiling. Includes visualization tools.
* [Memcpy](./memcpy) - **Exercise design.** A parameterized multi-column memcpy with an intentionally unoptimized runtime sequence. The goal is to add task groups to achieve peak bandwidth. See [getting_started/00_memcpy](../getting_started/00_memcpy/) for the reference solution.
6 changes: 4 additions & 2 deletions programming_examples/basic/memcpy/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
---
---

# **Memcpy**

The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance charactaristics.
> **Exercise Design:** The runtime sequence in `memcpy.py` is intentionally left unoptimized — drain operations run serially rather than in parallel, which limits measured bandwidth. Your task is to restructure the runtime sequence using `task_group()` to achieve full concurrency across all columns and channels. See Step 4 below for guidance, and [getting_started/00_memcpy/memcpy.py](../../../getting_started/00_memcpy/memcpy.py) for the reference solution.

The `memcpy.py` design is a highly parallel, parameterized design that uses shim DMAs in every NPU column. It enables both compute and bypass modes to help you analyze performance characteristics.

---

Expand Down
2 changes: 2 additions & 0 deletions programming_examples/basic/passthrough_dmas_plio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

# <ins>Passthrough DMAs with PLIO</ins>

> **Hardware Note:** This design targets the **Xilinx VCK5000 Versal evaluation board**, not a Ryzen AI NPU. It will not build or run on Ryzen AI hardware.

This reference design can be run on the VCK5000 Versal device. This design leverages the same data movement pattern as the [Passthrough DMAs](../passthrough-dmas) example design but it uses a soft DMA. Please see the [platforms repo](https://github.com/Xilinx/ROCm-air-platforms) for more information on how the programmable logic is integrated with the AIEs. This is meant to be an illustrative example to highlight how to integrate PL designs with AIE designs programmed using mlir-aie.

In the platform, tile (26, 0) has PLIO connected to a DMA implemented in the programmable logic. There are two designs, `aie2-input-plio.py` uses the soft DMA to push data from DRAM into the AIEs, wheras `aie2-output-plio.py` uses the soft DMA to receive data from the AIEs and push it to DRAM. The soft DMA is programmed using the same mechanism as the ShimDMAs.
Expand Down
2 changes: 1 addition & 1 deletion programming_examples/basic/passthrough_kernel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

# Passthrough Kernel:

This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_kernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.

## Source Files Overview

Expand Down
6 changes: 3 additions & 3 deletions programming_examples/basic/passthrough_pykernel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

# Passthrough Kernel:

This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `passthrough_pykernel.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for a non-vectorized (scalar) memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel, defined in Python code as a function, is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of a primary design file `passthrough_pykernel.py` and a testbench `test.cpp` or `test.py`.

## Source Files Overview

Expand All @@ -31,7 +31,7 @@ This IRON design flow example, called "Passthrough Kernel", demonstrates a simpl
This simple example effectively passes data through a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows:
1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile.
1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory.
1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python fucntion is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object".
1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". A scalar kernel defined via a Python function is invoked on the Compute Tile's AIE core to copy the data from the input "object" to the output "object".
1. After the copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively.

It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing `depth` is `2` when constructing the `ObjectFifo`, for example, `ObjectFifo(line_ty, depth=2)` to denote ping-pong buffers. By default, the depth is `2` in recognition of this common pattern.
Expand Down Expand Up @@ -64,7 +64,7 @@ This design performs a memcpy operation on a vector of input data. The AIE desig

## Usage

### Compile the desing:
### Compile the design:

To compile the design:

Expand Down
2 changes: 1 addition & 1 deletion programming_examples/basic/vector_exp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ env use_placed=1 make

To compile the C++ testbench:
```shell
make text_exp.exe
make vector_exp.exe
```

To run the design:
Expand Down
4 changes: 2 additions & 2 deletions programming_examples/basic/vector_scalar_mul/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ This IRON design flow example, called "Vector Scalar Multiplication", demonstrat

1. `vector_scalar_mul_jit.py`: A JIT version that passes `scale.cc` to the transform algorithm. JIT compilation allows combining the host code with AIE design into one file.

1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
1. `test.cpp`: This C++ code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data.

1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
1. `test.py`: This Python code is a testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the testbench verifies the vector scalar multiply results against a CPU reference and optionally outputs trace data.

## Design Overview

Expand Down
10 changes: 10 additions & 0 deletions programming_examples/getting_started/00_memcpy/memcpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,16 @@ def core_fn(of_in, of_out, passThroughLine):
# Create a TensorAccessPattern for each channel to describe the data movement.
# The pattern chops the data in equal chunks and moves them in parallel across
# the columns and channels.
#
# TensorAccessPattern arguments (see programming_guide/section-2/section-2c/
# for a full explanation of data layout transformations):
# tensor_dims : logical shape of the full transfer buffer — (1, size)
# offset : starting element index into that buffer for this chunk
# sizes : [dim3, dim2, dim1, dim0] — number of elements in each
# dimension. [1, 1, 1, chunk] means a single 1-D transfer
# of `chunk` elements (the higher dimensions are unused).
# strides : [dim3, dim2, dim1, dim0] — step between elements in each
# dimension. [0, 0, 0, 1] means contiguous (stride-1) access.
taps = [
TensorAccessPattern(
(1, size),
Expand Down
10 changes: 7 additions & 3 deletions programming_examples/getting_started/01_SAXPY/saxpy.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@
#include <stdio.h>
#include <stdlib.h>

#define REL_WRITE 0
#define REL_READ 1

#include <aie_api/aie.hpp>

// NOTE: Both kernels below are hardcoded for N=4096 elements. The Python
// design file (saxpy.py) must be called with a tensor of exactly this size.
// Calling with any other size will produce silently incorrect results.
extern "C" {
void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) {
event0();
Expand All @@ -39,6 +39,10 @@ void saxpy(bfloat16 *restrict x, bfloat16 *restrict y, bfloat16 *restrict z) {
event1();
}

// saxpy_scalar: a non-vectorized reference implementation of SAXPY.
// Useful for verifying correctness and understanding the algorithm before
// examining the vectorized version above. Can be selected from Python by
// changing the ExternalFunction name from "saxpy" to "saxpy_scalar".
void saxpy_scalar(bfloat16 *x, bfloat16 *y, bfloat16 *z) {
event0();
float a = 3.f;
Expand Down
23 changes: 8 additions & 15 deletions programming_examples/getting_started/01_SAXPY/saxpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,9 @@
import os

import aie.iron as iron
from aie.iron import ExternalFunction, jit
from aie.iron import Kernel, ObjectFifo, Program, Runtime, Worker
from aie.iron import ExternalFunction
from aie.iron import ObjectFifo, Program, Runtime, Worker
from aie.iron.placers import SequentialPlacer
from aie.iron.controlflow import range_
from aie.helpers.taplib import TensorAccessPattern, TensorTiler2D
from aie.utils.config import cxx_header_path


Expand Down Expand Up @@ -86,8 +84,10 @@ def core_body(of_x, of_y, of_z, saxpy_kernel):


def main():
# Define tensor shapes and data types
data_size = 2048
# Define tensor shapes and data types.
# NOTE: saxpy.cc hardcodes the loop bound to 4096 elements. This value
# must match data_size or the kernel will produce silently wrong results.
data_size = 4096
element_type = bfloat16

# Construct an input tensor and an output zeroed tensor
Expand All @@ -100,21 +100,14 @@ def main():
# to the kernel will use the same compiled kernel and loaded code objects
saxpy(input0, input1, output)

# Check the correctness of the result and print
# Check the correctness of the result and print any mismatches
ref_vec = [3 * input0[i] + input1[i] for i in range(data_size)]

errors = 0
for index, (actual, ref) in enumerate(
zip(
output,
ref_vec,
)
):
for index, (actual, ref) in enumerate(zip(output, ref_vec)):
if actual != ref:
print(f"Error at {index}: {actual} != {ref}")
errors += 1
else:
print(f"Correct output at {index}: {actual} == {ref}")

# If the result is correct, exit with a success code
# Otherwise, exit with a failure code
Expand Down
Loading
Loading