Skip to content

Tutorial review#2954

Open
hunhoffe wants to merge 18 commits intomainfrom
tutorial-review
Open

Tutorial review#2954
hunhoffe wants to merge 18 commits intomainfrom
tutorial-review

Conversation

@hunhoffe
Copy link
Collaborator

@hunhoffe hunhoffe commented Mar 11, 2026

In preparation for future tutorials on mlir-aie/IRON, Claude helped me audit the programming_guide, programming_examples, and python/iron code + documentation for correctness and clarity.

hunhoffe and others added 16 commits March 11, 2026 15:04
…g_examples

Code bugs (would cause failures during tutorial):
- Fix SAXPY data_size 2048->4096 to match C++ kernel and README
- Fix undefined variable: section-2e loop `for _ in range_` -> `for i in range_`
- Fix section-3 Worker construction: add missing comma, correct variable names
  (of_in1/of_out1 -> of_in/of_out to match aie2.py)
- Fix section-2g invalid DMA direction: SS2M -> S2MM
- Fix section-2g buffer() calls: add missing datatype= keyword
- Fix section-4c make command: `int_bit_width=32 trace` -> `make int_bit_width=32 trace`
- Fix vector_exp README: `make text_exp.exe` -> `make vector_exp.exe`

Navigation bugs:
- Fix 34+ broken image links: mlir_tutorials -> mlir_exercises (9 files)
- Fix section-2d/DMATasks.md and RuntimeTasks.md: retitle from Section 2g to
  Section 2d and correct TOC ordering
- Fix section-2e broken link: 03_Link_Distribute_Join -> 03_Implicit_Copy
- Fix section-3 broken nav link: ../section4/section-4b -> ../section-4/section-4b
- Fix section-2f L2 README: correct filename ext_to_coreL2.py/ext_to_core.py
  -> ext_to_core_L2.py
- Fix section-4b README-placed.md self-referential Up link: ../../section-4b -> ../

Wrong/misleading content:
- Remove reference to non-existent passThrough.cc from passthrough_pykernel README
- Fix scale_shift README title: "Eltwise Multiplication" -> "Scale Shift"
- Fix eltwise_add README: 3x "addtiplication" -> "addition"
- Update ml/README.md to include all 20 ML examples (was only listing 8)
- Add algorithms/ directory to programming_examples/README.md
- Fix vector_reduce_max_1col.py assert comment: "4 bytes = 1 integer" ->
  "4 bytes = 2 bfloat16 elements"

Typos: Peformance, becuase, strides/strides, intialized, synchornization,
iniitliaze, addd, additionl/descrbied/belows, evnet, incorect, colshift spacing,
prase_trace.py (2 files), abtracts, candiate, fucntion, desing (3 files),
multiplcation, Defintions, techncially, dimenioni/dimenion, chanenls

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add Prerequisites section to getting_started/README.md listing required
  hardware, software, packages, and links to setup docs (rec 2)
- Fix vector_reduce_max reference: initialize ref_max to bfloat16('-inf')
  instead of 0 so all-negative inputs are handled correctly (rec 4)
- Fix saxpy.py validation: only print on error, not on every correct element;
  suppress 4096 lines of noise on a passing run (rec 5)
- Remove dead imports from saxpy.py (range_, TensorTiler2D, TensorAccessPattern,
  Kernel, jit — none used in this file) (rec 6)
- Remove dead REL_WRITE/REL_READ macro definitions from saxpy.cc; add
  explanatory comment to saxpy_scalar marking it as a readable reference
  implementation (rec 6)
- Add inline comments to TensorAccessPattern calls in memcpy.py and
  matrix_multiplication_single_core.py explaining tiler_shape/offset/sizes/
  strides dimension ordering (rec 7)
- Rename start_core_body -> final_core_body in vector_reduce_max_1col.py;
  the old name implied it ran first when it is actually the terminal node
  in the cascade that produces the final output (rec 9)
- Add links to programming_guide sections (section-2b, section-2c, section-2f)
  from files that use ObjectFIFO split(), data layout transforms, and
  multi-level memory hierarchy patterns (rec 11)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
mini_tutorial:
- Rename aie2p.py -> aie2.py; the file auto-detects hardware via
  iron.get_current_device() and runs on both npu1 and npu2 — the
  old name misleadingly implied AIE2P/npu2-only targeting
- Update run.lit and README.md references accordingly
- Add hardware prerequisites note to README.md: requires Ryzen AI NPU
  (any generation), XRT, and mlir-aie environment; note that @iron.jit
  handles device detection automatically

section-1/README.md:
- Fix LocalBuffer -> Buffer; LocalBuffer does not exist in the IRON API
  (readers copying the snippet would get a NameError)

section-2a/README.md:
- Fix 2x "refered" -> "referred"

section-2d/RuntimeTasks.md:
- Fix unused loop variable: `for groups in [0, 1]` -> `for _ in [0, 1]`

section-2d/DMATasks.md:
- Update Conclusion to acknowledge both npu_dma_memcpy_nd/dma_wait and
  the newer shim_dma_single_bd_task/dma_await_task/dma_free_task APIs;
  note that dma_task operations are preferred for new designs

section-2e/aie2.py:
- Fix file header comment: said section-2d/aie2.py, is in section-2e

section-2f/02_external_mem_to_core/ext_to_core.py:
- Fix file header comment: said single_buffer.py, is ext_to_core.py

section-3/README.md:
- Fix undefined variable: tile_size = data_size // 4 -> tensor_size // 4
- Fix kernel snippet parameter names: a_in/c_out -> a/c to match the
  actual vector_scalar_mul.cc source (body used c[i] and a[i] which
  were undeclared under the old parameter names)

section-4a/README.md:
- Fix C++ syntax: "imported via `import <chrono>`" -> "included via
  `#include <chrono>`"

section-4c/README.md:
- Reconcile cycle count: update prose, efficiency table, and load/store
  analysis from 72 -> 78 cycles to match the exercise answer tooltip,
  which reflects a more recent hardware measurement; update derived
  efficiency percentages accordingly

quick_reference.md:
- Remove duplicate "AIE2 - Table of supported data types and vector
  sizes" link that appeared in both "Helpful References" and
  "AIE Detailed References" sections

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The prose said 72 cycles while the exercise answer tooltip said 78.
Rather than picking one without running the design, acknowledge that
results vary by hardware revision and compiler version (~72-78 cycles).
Update all derived efficiency values and prose accordingly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The conclusion now acknowledges both the npu_dma_memcpy_nd/dma_wait
and dma_task interfaces without asserting one is preferred — that
claim was not grounded in any project documentation.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Restore original 72-cycle / 44% / 22% values throughout — these are
the numbers the original author measured and the ones consistent with
the rest of the document's worked examples. The discrepancy with the
exercise tooltip (78 cycles) should be resolved by someone running the
design, not by editing the prose to hedge.

Retain the two valid fixes from the earlier pass:
- mlir_tutorials -> mlir_exercises image links
- missing `make` in `make clean; int_bit_width=32 trace` command

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…mples

Fix 1 - basic/passthrough_kernel/README.md:
  Correct stale filename: 'aie2.py' -> 'passthrough_kernel.py'

Fix 2 - getting_started/01_SAXPY/saxpy.cc + saxpy.py:
  Document the hardcoded N=4096 loop bound in both saxpy() and
  saxpy_scalar(); add matching comment in saxpy.py main() warning
  that changing data_size without updating the C++ kernel will
  produce silently incorrect results

Fix 3 - ml/gelu/README.md, ml/silu/README.md, ml/swiglu/README.md:
  Write READMEs for all three missing ML activation function examples;
  all three are listed in ml/README.md's table but had no documentation

Fix 5 - basic/README.md:
  Correct 'matrix * vector' -> 'adds a scalar constant' for
  matrix_scalar_add; expand example list from 14 to all 29 subdirectories,
  including chaining_channels, combined_transpose, dma_transpose_packet,
  event_trace, packet_switch, passthrough_pykernel, row_wise_bias_add,
  shuffle_transpose, tiling_exploration, and others not previously listed;
  add VCK5000 hardware note for passthrough_dmas_plio

Fix 6 - basic/passthrough_dmas_plio/README.md:
  Add prominent hardware note at top: targets VCK5000, not Ryzen AI NPU

Fix 7 - ml/relu, ml/eltwise_add, ml/scale_shift, basic/vector_scalar_mul:
  Replace boilerplate 'verifies the memcpy results' with accurate
  per-design descriptions of what each testbench actually verifies

Fix 8 - vision/edge_detect/README.md, vision/color_threshold/README.md:
  Replace deprecated 'OrderedObjectBuffer (OOB)' terminology with
  current 'ObjectFIFO' throughout both READMEs

Fix 9 - mlir/README.md:
  Write top-level README for the mlir/ category; explains that these
  examples target Versal hardware (not Ryzen AI NPU), use raw MLIR
  rather than the Python IRON API, and describes each of the 5 examples

Fix 10 - basic/memcpy/README.md:
  Add prominent exercise note at top making explicit that the runtime
  sequence is intentionally unoptimized; link to the reference solution
  in getting_started/00_memcpy/

Co-Authored-By: Claude Opus 4.6 <[email protected]>
black reformatted vector_reduce_max_1col.py: wrapped a long assert
statement to fit within the line length limit. All other changed Python
files (memcpy.py, saxpy.py, matrix_multiplication_single_core.py,
aie2.py x3) were already compliant. saxpy.cc was already compliant
with clang-format.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…python/utils

Code bugs fixed:
- buffer.py: Buffer.shape and Buffer.dtype referenced self._obj_type which
  does not exist; fixed to self._arr_type
- buffer.py: __getitem__ and __setitem__ used `return AttributeError(...)`
  instead of `raise AttributeError(...)`; now raises correctly
- worker.py: error message said "per works" instead of "per worker"
- npukernel.py: __init__ received device_index but never stored it;
  added self._device_index = device_index
- objectfifo.py: error message said "no consumers were not created"
  (double negative); fixed to "no consumers were created"
- objectfifo.py: ObjectFifoLink.__init__ error messages said "Then number
  of" (two occurrences); fixed to "The number of"

Placeholder text removed (would render literally on GitHub Pages):
- worker.py: Worker class docstring had _summary_ placeholder; replaced
  with proper description
- worker.py: orphaned floating string literal before __init__ removed
- resolvable.py: NotResolvedError docstring had _type_ and _description_
  placeholders; replaced with accurate description
- objectfifo.py: ObjectFifo.__init__ plio arg and Raises both had
  _description_ placeholders; replaced with real text
- objectfifo.py: floating string literal converted to comment
- runtime/runtime.py: sequence() Yields had _type_ placeholder; fixed
  runtime/runtime.py: finish_task_group() had _description_ placeholder; fixed
- buffer.py: floating string literal converted to comment
- runtime/runtime.py: floating string literal converted to comment

Inaccurate docstrings corrected:
- runtime/runtime.py: __init__ documented parameter as check_task_groups
  but actual name is strict_task_groups; also fixed wrong -> Runtime
  return annotation to -> None
- runtime/runtime.py: drain() tap arg said "sending it to the in_fifo";
  fixed to "reading from the out_fifo"; removed duplicate "Defaults to None"
- runtime/runtime.py: enable_trace() had "Enable trace." as its entire
  docstring with invalid [] type annotations; now fully documented with
  all 7 parameters and correct list | None annotations
- runtime/endpoint.py: __init__ had wrong -> RuntimeEndpoint return
  annotation; fixed to -> None
- runtime/data.py: arr_type property had -> np.ndarray return annotation;
  fixed to -> type[np.ndarray]
- program.py: __init__ docstring claimed verify() is called there;
  it is called in resolve_program(); corrected
- device/device.py: is_mem_accessible Returns said "int: Number of
  connections"; it returns bool; corrected
- resolvable.py: Resolvable.resolve first param named cls but is an
  instance method; renamed to self
- placeable.py: tile property Returns omitted | None; corrected
- objectfifo.py: "overriden" -> "overridden" (2x), "callled" -> "called"

Documentation added:
- Module docstrings added to all 30 changed files (previously absent from all)
- Worker.__init__: added trace_events parameter to Args section
- WorkerRuntimeBarrier.__init__: added docstring
- _BarrierSetOp.__init__: added docstring
- FinishTaskGroupTask.__init__, RuntimeStartTask.__init__: added docstrings
- DMATask: added class docstring
- Program: added class docstring
- SequentialPlacer.__init__: added docstring
- SequentialPlacer.make_placement: added docstring (was undocumented override)
- Device.rows, Device.cols: added property docstrings
- Device.get_num_*_connections (x4): added Args section for t parameter
- Tile.__init__: added docstring
- transform(), transform_binary(), transform_parallel(),
  transform_parallel_binary(): expanded from one-liners to full Args/Returns
- for_each(): fixed example (tile_size was positional, should be keyword);
  added Returns section
- controlflow.py: documented range_ semantics (NPU loop vs Python unroll)
- dtype.py: fixed stale filename comment (config.py -> dtype.py); added
  comment explaining dtype_map entries
- buffer.py: fixed stale filename comment (globalbuffer.py -> buffer.py)
- npukernel.py: fixed __call__ Returns which referenced undefined KernelResult
- compile/utils.py: fixed "xbclbin" typo to "xclbin"
- runtime/taskgroup.py: fixed "to indicated" -> "to indicate"
- runtime/dmatask.py: fixed "shoudl" -> "should"
- placers.py: fixed "informatio" -> "information"

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tion

- get_num_connections and is_mem_accessible had their Returns: sections
  swapped; restore each to match its actual return type and semantics
- programming_examples/README.md described the algorithms/ directory as
  containing "sorting, searching" patterns; replace with an accurate
  description of the element-wise transform/for_each templates it actually
  contains

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- jit.py: add `assert cached_kernel is not None` to narrow type before
  calling the cached kernel, resolving reportOptionalCall
- compile/utils.py: widen path parameters from `str | None` to
  `str | Path | None` to match actual call sites in jit.py
- npukernel.py: add None guard before DefaultNPURuntime.load_and_run()
  to resolve reportOptionalMemberAccess; mirrors the guard already
  present in jit.py

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Merge origin/main ('Replace mlir_tutorials with mlir_exercises')
- Resolve conflicts in section-4c (keep correct 'make' command) and
  section-5 (keep fixed 'candidate' spelling)
- Fix resnet.py: remove dead RTP writes in set_rtps() for conv2_fn
  tiles (rtp[0][1], rtp[0][2], rtp[1][0], rtp[1][2], rtp[2][1],
  rtp[2][3]) that are never placed/resolved because conv2_fn workers
  use a hardcoded scale=1 and have no RTP argument. This was a latent
  bug exposed by the buffer.py fix (return->raise AttributeError).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- aie2.py: correct stale header comment (aie2p.py -> aie2.py)
- quick_reference.md: remove empty '## AIE Detailed References' section
- objectfifo.py: fix word order in cons() docstring ('is it' -> 'it is')
- section-2e/aie2_multi.py: fix header path (section-2d -> section-2e)
  and fix out-of-bounds loop (range_(data_size) -> range_(tile_size));
  each worker tile holds tile_size elements, not data_size
- section-2e/README.md: sync core_fn loop bound to match aie2_multi.py
  (range_(data_size) -> range_(tile_size))
- section-4b/README.md: fix typos 'precendence' -> 'precedence' and
  'woker' -> 'worker'

Co-Authored-By: Claude Opus 4.6 <[email protected]>
section-1/README.md:
- Prose said AIEDevice.npu1_1col but code examples use AIEDevice.npu1
- NOTE said AIEDevice.npu (nonexistent enum value), corrected to npu1

section-2a/README.md:
- Fix broken internal anchor (#advanced-topic-data-movement-accelerators
  -> #advanced-topic-direct-memory-access-channels)
- Fix core_fn example: test_func2(elemIn,...) -> test_func2(elemOut,...)
  (elemIn was already released at that point)

section-2e/README.md:
- rt.sequence(data_size,...) -> rt.sequence(data_ty,...) (×2): sequence()
  takes numpy array types, not integer sizes
- Fix inner loop variable shadowing: 'for i in range_(tile_size)' -> 'for j'
  to avoid shadowing the outer 'for i in range(n_cores)' loop
- Fix shell commands in python code fences -> bash fences (×2)

section-4b/README.md:
- xrt_test_wrapper_.h -> xrt_test_wrapper.h (spurious underscore, ×3)
- parse_trace.py -> parse.py (actual script name, ×3)

section-4c/README.md:
- 'make -f Makefile.chess use_placed=1 trace' -> 'make use_placed=1 trace'
  (Makefile.chess does not exist in that directory)

section-5/README.md:
- passthrough.cc -> passThrough.cc (match actual filename case)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@hunhoffe hunhoffe marked this pull request as ready for review March 12, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant