Skip to content

Insert trace flows pass #2962

Open
yenjames wants to merge 3 commits intomainfrom
insert-trace-flows
Open

Insert trace flows pass #2962
yenjames wants to merge 3 commits intomainfrom
insert-trace-flows

Conversation

@yenjames
Copy link
Collaborator

@yenjames yenjames commented Mar 12, 2026

Not ready for review, working on copilot suggestions and increasing test coverage

Summary

This PR implements a new MLIR pass to automatically insert trace packet flows and runtime sequence configuration for AIE trace operations. Following #2705, this should eliminate the need for manual trace npu.write32 configurations.

Changes

  • Automatically inserts packet flows from trace sources to shim DMA and generates all required runtime sequence operations for trace collection. This is inserted at the beginning of runtime sequence but after trace.start_configs.
  • Changed tests to be a fuller example with tracing on all 4 types.
  • Updated test in test/npu-xrt/vec_mul_event_trace to be the full end to end trace test. Checks trace output after hardware run.
  • When aie.trace.packet type is not specified, the pass auto-detects it from the tile type. For core tiles, it defaults to core instead of mem.
  • Fixed AIETraceToConfig.cpp which incorrectly allows MemTile and ShimTiles to be configured with a Mode field in their trace. Based on my understanding, only core module has the Mode field in its trace control register.

Example

What user writes:

aie.device(npu1_1col) {
  %tile02 = aie.tile(0, 2)
  
    // Trace configuration for compute tile (0,2) - core events
    aie.trace @core_trace(%tile_0_2) {
      // Set trace mode (Event-Time captures timestamps)
      aie.trace.mode "Event-Time"

      // Configure packet routing (ID and type for packet-switched routing)
      aie.trace.packet id=1 type=core

      // Specify which events to capture (up to 8 events)
      aie.trace.event<"INSTR_EVENT_0">        // User event 0 (start marker)
      aie.trace.event<"INSTR_EVENT_1">        // User event 1 (end marker)
      aie.trace.event<"INSTR_VECTOR">         // Vector instructions
      aie.trace.event<"MEMORY_STALL">         // Memory access stalls
      aie.trace.event<"STREAM_STALL">         // Stream buffer stalls
      aie.trace.event<"LOCK_STALL">           // Lock acquisition stalls
      aie.trace.event<"PORT_RUNNING_0">       // DMA:0 S2MM running
      aie.trace.event<"PORT_IDLE_1">          // DMA:1 MM2S idle
      aie.trace.port<0> port=DMA channel=0 direction=S2MM
      aie.trace.port<1> port=DMA channel=0 direction=MM2S

      // Specify start/stop control (broadcast events)
      aie.trace.start event=<"BROADCAST_15">
      aie.trace.stop event=<"BROADCAST_14">
    }
  
  // Runtime sequence with trace invocation
  aiex.runtime_sequence @seq(%arg0: memref<32xi32>) {
    aie.trace.start_config @core_trace
    // ... other runtime operations
  }
}

--aie-insert-trace-flows generates:

// Stays the same
aie.trace @core_trace(%tile_0_2) { ... }

// Packet route tile trace port → shim DMA
aie.packet_flow(1) {
  aie.packet_source<%tile_0_2, Trace : 0>
  aie.packet_dest<%tile_0_0, DMA : 1>
} {keep_pkt_header = true}

aiex.runtime_sequence(%arg0: memref<32xi32>) {
  aie.trace.start_config @core_trace

  // Tile timer control (reset on broadcast 15)
  aiex.npu.write32 {column = 0, row = 2, address = 0x34000, value = 31232}

  // Shim DMA setup (buffer descriptor + address patch)
  aiex.npu.writebd {column = 0, bd_id = 15, buffer_length = 1048576, ...}
  aiex.npu.address_patch {addr = 0x1D1E4, arg_idx = 4}

  // Shim DMA channel enable + start task queue
  aiex.npu.maskwrite32 {column = 0, row = 0, address = 0x1D208, ...}
  aiex.npu.write32 {column = 0, row = 0, address = 0x1D20C, value = 0x8000000F}

  // Trigger broadcast 15 (trace start)
  aiex.npu.write32 {column = 0, row = 0, address = 0x3404C, value = 127}

  // ... user DMA operations ...

  // Trigger broadcast 14 (trace stop)
  aiex.npu.write32 {column = 0, row = 0, address = 0x34048, value = 126}
}

aie.trace is lowered to an aie.trace.config sequence of register writes, then to npu.write32 (Implemented in #2705):

  // Intermediate representation (after -aie-trace-to-config)
  aie.trace.config @core_trace_config(%tile_0_2) packet_type = core {
    aie.trace.reg register = "Trace_Control0" field = "Mode" value = 0 : i32 comment = "trace mode"
    aie.trace.reg register = "Trace_Control1" field = "ID" value = 1 : i32 comment = "packet ID"
    aie.trace.reg register = "Trace_Control1" field = "Packet_Type" value = 0 : i32 comment = "packet type"
    aie.trace.reg register = "Trace_Control0" field = "Trace_Start_Event" value = 122 : i32 comment = "start event"
    aie.trace.reg register = "Trace_Control0" field = "Trace_Stop_Event" value = 121 : i32 comment = "stop event"
    aie.trace.reg register = "Stream_Switch_Event_Port_Selection_0" field = "Port_0_ID" value = "DMA:0" comment = "port 0 ID"
    aie.trace.reg register = "Stream_Switch_Event_Port_Selection_0" field = "Port_0_Master_Slave" value = 1 : i32 comment = "port 0 master/slave"
    aie.trace.reg register = "Stream_Switch_Event_Port_Selection_0" field = "Port_1_ID" value = "DMA:0" comment = "port 1 ID"
    aie.trace.reg register = "Stream_Switch_Event_Port_Selection_0" field = "Port_1_Master_Slave" value = 0 : i32 comment = "port 1 master/slave"
    aie.trace.reg register = "Trace_Event0" field = "Trace_Event0" value = 33 : i32 comment = "event slot 0"
    aie.trace.reg register = "Trace_Event0" field = "Trace_Event1" value = 34 : i32 comment = "event slot 1"
    aie.trace.reg register = "Trace_Event0" field = "Trace_Event2" value = 37 : i32 comment = "event slot 2"
    aie.trace.reg register = "Trace_Event0" field = "Trace_Event3" value = 23 : i32 comment = "event slot 3"
    aie.trace.reg register = "Trace_Event1" field = "Trace_Event4" value = 24 : i32 comment = "event slot 4"
    aie.trace.reg register = "Trace_Event1" field = "Trace_Event5" value = 26 : i32 comment = "event slot 5"
    aie.trace.reg register = "Trace_Event1" field = "Trace_Event6" value = 79 : i32 comment = "event slot 6"
    aie.trace.reg register = "Trace_Event1" field = "Trace_Event7" value = 78 : i32 comment = "event slot 7"
  }

  // Intermediate representation (after -aie-trace-pack-reg-writes)
  aie.trace.config @core_trace_config(%tile_0_2) packet_type = core {
    aie.trace.reg register = "Trace_Control0" value = 2038038528 : i32 mask = 2139029507 comment = "trace mode + start event + stop event"
    aie.trace.reg register = "Trace_Control1" value = 1 : i32 mask = 28703 comment = "packet ID + packet type"
    aie.trace.reg register = "Stream_Switch_Event_Port_Selection_0" value = 289 : i32 mask = 16191 comment = "port 0 ID + port 0 master/slave + port 1 ID + port 1 master/slave"
    aie.trace.reg register = "Trace_Event0" value = 388309537 : i32 mask = 2139062143 comment = "event slot 0 + event slot 1 + event slot 2 + event slot 3"
    aie.trace.reg register = "Trace_Event1" value = 1313806872 : i32 mask = 2139062143 comment = "event slot 4 + event slot 5 + event slot 6 + event slot 7"
  }

  // Final output (after -aiex-inline-trace-config)
  aiex.runtime_sequence @seq(%arg0: memref<32xi32>) {
    aiex.npu.write32 {address = 213200 : ui32, column = 0 : i32, row = 2 : i32, value = 2038038528 : ui32}
    aiex.npu.write32 {address = 213204 : ui32, column = 0 : i32, row = 2 : i32, value = 1 : ui32}
    aiex.npu.write32 {address = 261888 : ui32, column = 0 : i32, row = 2 : i32, value = 289 : ui32}
    aiex.npu.write32 {address = 213216 : ui32, column = 0 : i32, row = 2 : i32, value = 388309537 : ui32}
    aiex.npu.write32 {address = 213220 : ui32, column = 0 : i32, row = 2 : i32, value = 1313806872 : ui32}
    // Additional npu.write32 for other registers...
  }

Follow-up PR

  • Update placed/unplaced IRON to use this new trace lowering infrastructure.
  • Convert python/utils/trace/setup.py to emit aie.trace ops instead of direct npu_write32 calls.

@yenjames yenjames changed the title Insert trace flows Insert trace flows pass Mar 12, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

Coverage Report

Created: 2026-03-13 16:27

Click here for information about interpreting this report.

FilenameFunction CoverageLine CoverageRegion CoverageBranch Coverage
include/aie/Dialect/AIE/Transforms/AIEPasses.h 100.00% 100.00% 100.00% -
lib/Dialect/AIE/Transforms/AIEInsertTraceFlows.cpp 37.50% 33.22% 27.45% 26.09%
lib/Dialect/AIE/Transforms/AIETraceToConfig.cpp 100.00% 88.81% 83.53% 67.79%
Totals 68.75% 72.08% 67.60% 55.00%
Generated by llvm-cov -- llvm version 18.1.3

@FIM43-Redeye
Copy link

FIM43-Redeye commented Mar 13, 2026

Really great to see true declarative tracing! I ended up building my own entire trace injection toolkit for running tests with arbitrary tracing (tools/trace-*.py in xdna-emu -- fair warning, the parent project is a messy work-in-progress), so it's lovely to have the capability in mlir-aie proper. Would love to contribute here.

Some things I can help with:

  • EVENT_PC validation on real hardware -- I have a Phoenix and a (finicky due to being brand new) Strix Halo NPU, with a dual-compiler testing architecture around the silicon in question to make sure both Chess and Peano play nice. I've been working towards using EVENT_PC mode specifically (for cycle-by-cycle PC capture as part of emulator development work), so I can validate any given test case on hardware. Also happy to contribute tests for vec_mul_event_trace's path and other configs.

  • Multi-column trace routing -- the minimizeShims / preferSameColumn options are a great start! I've been working on my end on a way to distribute trace load across all available S2MM channels with pathfinder validation while staying out of the way of all existing streams.

  • Minimizing trace-induced timing perturbation -- ties in very nicely to multi-column routing. Trace packet flows fight with data flows for stream switch ports, so adding tracing can change the timing of the design under test. I care a lot about getting as close as possible to cycle-identical behavior (I'm using my hardware to validate an intended-to-be-cycle-accurate emulator), but I imagine it's valuable for many others too. The current way I like to do it is to just go sideways with the trace data, expand by one column and move everything laterally where possible so it never fights with the vertical data flow.

  • setup.py migration -- saw this is listed as a follow-up, happy to help clear it up! I've gotten experienced with the configure_coretile_tracing_aie2() flow and event ID mapping, so I'd be glad to help the Python-side migration along.

  • Test coverage -- coverage report shows 33% line coverage in AIEInsertTraceFlows.cpp. I can help with FileCheck tests on multi-tile scenarios, MemTile tracing, and edge cases like occupied S2MM channels.

Let me know if contributions are welcome, and if so, what scope. Just don't want to step on anyone's toes -- this project has been evolving incredibly rapidly recently.

@yenjames
Copy link
Collaborator Author

yenjames commented Mar 13, 2026

Hi @FIM43-Redeye,

Thank you for your interest in the repo! I think the enhancements you're proposing here are very interesting and useful.
To address your points:

  • Hardware validation and test coverage: Any FileCheck tests you want to contribute for multi-tile scenarios and improving line coverage would be very helpful (the coverage I have right now is definitely too minimal). Similarly, I am very interested to see how you will utilize EVENT_PC to validate any tests.
  • Mulit-column trace routing and timing perturbation: minimizeShims / preferSameColumn are definitely basic options that can use enhancements. I can see higher traffic designs having trouble with port contention, so your approach to distribute trace load sounds like a good direction.
  • Setup.py migration: I will need to hold onto this one for now. My plan is to proceed with the Python migration immediately following this task (needing to unblock that IRON work is what diverted me to this tracing task in the first place!)

Contributions on trace enhancements are more than welcome, so you are definitely not stepping on anyone's toes! Please feel free to open PRs whenever you're ready. I will also let others with deeper trace expertise weigh in. @jackl-xilinx @fifield

James

…e lowered through trace passes.

- Added auto-detection of packet type based on tile when not defined in `aie.trace`.
- Fixed field mode emission. Only core trace control registers have a mode field.
- Updated expected broadcast event values to actual hardware event codes
- Update manual `aiex.npu.write32` trace configuration in test/npu-xrt/vec_mul_event_trace/aie.mlir and
  programming_examples/basic/event_trace/aie_trace.mlir to declarative `aie.trace` ops
- Updated examples to feature trace config for all 4 options: coretile, core_mem, memtile, shimtile.
- Register -aie-insert-trace-flows to aiecc.
- TODO: Future PR to update python to use `aie.trace` bindings in python/utils/trace.
@yenjames yenjames force-pushed the insert-trace-flows branch from d70bdb5 to e9c2a06 Compare March 13, 2026 15:57
@yenjames yenjames marked this pull request as ready for review March 13, 2026 15:59
Copilot AI review requested due to automatic review settings March 13, 2026 15:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new AIE MLIR transform pass to automatically insert trace packet flows (trace source → shim DMA) and the required runtime-sequence NPU operations for trace capture, so users no longer need to manually author low-level npu.write32/DMA setup for tracing. It also updates the trace lowering pipeline, examples, and tests to use aie.trace + aie.trace.start_config, and tightens trace-to-config lowering for broadcast events and mode-field emission.

Changes:

  • Add -aie-insert-trace-flows pass and wire it into the aiecc trace-lowering pipeline before -aie-trace-to-config.
  • Update AIETraceToConfig broadcast lowering (broadcast channel → HW event ID) and prevent emitting Mode for non-core trace modules.
  • Refresh tests/examples to use declarative aie.trace and validate end-to-end trace output.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tools/aiecc/aiecc.cpp Adds the new trace-flow insertion pass into the existing trace lowering pipeline.
lib/Dialect/AIE/Transforms/AIEInsertTraceFlows.cpp Implements the new pass that inserts aie.packet_flow ops and runtime-sequence NPU trace setup/teardown.
lib/Dialect/AIE/Transforms/AIETraceToConfig.cpp Resolves broadcast start/stop to hardware event IDs and restricts Mode field emission to core traces.
lib/Dialect/AIE/Transforms/CMakeLists.txt Adds the new pass source file to the build.
include/aie/Dialect/AIE/Transforms/AIEPasses.td Declares the new pass, its options, and documentation.
include/aie/Dialect/AIE/Transforms/AIEPasses.h Exposes createAIEInsertTraceFlowsPass() factory.
test/dialect/AIE/trace/test_insert_trace_flows_simple.mlir New lit test for basic packet-flow insertion.
test/dialect/AIE/trace/test_insert_trace_flows_multiple.mlir New lit test covering multiple trace sources/IDs.
test/dialect/AIE/trace/test_trace_to_config.mlir Updates FileCheck expectations for resolved broadcast event IDs.
test/Dialect/AIE/trace/test_trace_port_to_config.mlir Updates FileCheck expectations for resolved broadcast event IDs.
test/Dialect/AIE/combo_edge/test_combo_edge_full.mlir Updates FileCheck expectations for resolved broadcast event IDs.
test/npu-xrt/vec_mul_event_trace/aie.mlir Converts the end-to-end XRT test to declarative aie.trace across core/mem/memtile/shim.
test/npu-xrt/vec_mul_event_trace/test.py Updates to an NPU2 end-to-end run and validates events from multiple trace types.
test/npu-xrt/vec_mul_event_trace/vector_scalar_mul.cc Updates copyright header.
programming_examples/basic/event_trace/aie_trace.py Updates example generator to use memtile forwarding and expanded trace coverage.
programming_examples/basic/event_trace/aie_trace.mlir Updates MLIR example similarly (memtile forwarding + multiple trace types).
programming_examples/basic/event_trace/README.md Documents the updated trace lowering pipeline including the new pass.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let constructor = "xilinx::AIE::createAIEInsertTraceFlowsPass()";

let dependentDialects = [
"xilinx::AIE::AIEDialect",
Comment on lines +465 to +467
"Minimize number of shim tiles used (prefer one shim for all traces)">,
Option<"preferSameColumn", "prefer-same-column", "bool", "true",
"When choosing shim, prefer same column as trace sources">
// CHECK: aie.trace @core_trace
aie.trace @core_trace(%tile02) {
aie.trace.mode "Event-Time"
aie.trace.packet id=1 type="core"
Comment on lines +257 to +268
std::set<std::pair<int, int>> processedTiles; // (col, row)
for (auto &info : traceInfos) {
int col = info.tile.getCol();
int row = info.tile.getRow();

if (processedTiles.find({col, row}) != processedTiles.end())
continue;
processedTiles.insert({col, row});

// Compute timer control address
uint32_t timerCtrlAddr = computeTimerCtrlAddress(
info.tile, targetModel, info.packetType == TracePacketType::Mem);
Comment on lines +266 to +276
// Compute timer control address
uint32_t timerCtrlAddr = computeTimerCtrlAddress(
info.tile, targetModel, info.packetType == TracePacketType::Mem);

// Timer control value: BROADCAST_15 event (122 << 8 = 31232)
uint32_t timerCtrlValue = 31232; // Event 122 (BROADCAST_15) << 8

builder.create<xilinx::AIEX::NpuWrite32Op>(
runtimeSeq.getLoc(), timerCtrlAddr, timerCtrlValue, nullptr,
builder.getI32IntegerAttr(col), builder.getI32IntegerAttr(row));
}
Comment on lines +24 to +38
aie.trace.packet id=1 type="core"
aie.trace.event<"INSTR_VECTOR">
aie.trace.start broadcast=15
}

// Mem trace on tile (0,2)
aie.trace @mem_trace_02(%tile02) {
aie.trace.packet id=2 type="mem"
aie.trace.event<"DMA_S2MM_0_START_TASK">
aie.trace.start broadcast=15
}

// Core trace on tile (0,3)
aie.trace @core_trace_03(%tile03) {
aie.trace.packet id=3 type="core"
Comment on lines +159 to +193
} else {
// Fallback: Use one shim for all traces (column 0)
int targetCol = 0;
TileOp shimTile = nullptr;
for (auto tile : device.getOps<TileOp>()) {
if (tile.getCol() == targetCol && tile.getRow() == 0) {
shimTile = tile;
break;
}
}

if (!shimTile) {
builder.setInsertionPointToStart(&device.getRegion().front());
shimTile = builder.create<TileOp>(device.getLoc(), targetCol, 0);
}

ShimInfo shimInfo;
shimInfo.shimTile = shimTile;
shimInfo.channel = shimChannel;
shimInfo.bdId = defaultBdId;
shimInfo.argIdx = traceArgIdx;
shimInfo.traceSources = traceInfos;
shimInfos[targetCol] = shimInfo;
}

// Phase 3: Insert packet flows
// Insert before the device terminator
Block &deviceBlock = device.getRegion().front();
builder.setInsertionPoint(deviceBlock.getTerminator());

for (auto &info : traceInfos) {
// Find target shim for this trace
int col = info.tile.getCol();
ShimInfo &shimInfo = shimInfos[col];

Comment on lines +75 to +105
// Find packet ID and type from trace body
std::optional<int> packetId;
std::optional<TracePacketType> packetType;
for (auto &op : trace.getBody().getOps()) {
if (auto packetOp = dyn_cast<TracePacketOp>(op)) {
packetId = packetOp.getId();
packetType = packetOp.getType();
break;
}
}

// Determine packet type from tile type if not specified
if (!packetType) {
if (tile.isShimTile()) {
packetType = TracePacketType::ShimTile;
} else if (tile.isMemTile()) {
packetType = TracePacketType::MemTile;
} else {
// Core tile defaults to core type
packetType = TracePacketType::Core;
}
}

// Allocate packet ID if not specified
if (!packetId) {
if (usedPacketIds.find(col) == usedPacketIds.end()) {
usedPacketIds[col] = nextPacketId;
}
packetId = usedPacketIds[col]++;
}

Comment on lines +270 to +276
// Timer control value: BROADCAST_15 event (122 << 8 = 31232)
uint32_t timerCtrlValue = 31232; // Event 122 (BROADCAST_15) << 8

builder.create<xilinx::AIEX::NpuWrite32Op>(
runtimeSeq.getLoc(), timerCtrlAddr, timerCtrlValue, nullptr,
builder.getI32IntegerAttr(col), builder.getI32IntegerAttr(row));
}
Comment on lines +282 to +305
// 4c. Write buffer descriptor
builder.create<xilinx::AIEX::NpuWriteBdOp>(
runtimeSeq.getLoc(),
shimCol, // column
shimInfo.bdId, // bd_id
traceBufferSize, // buffer_length
0, // buffer_offset
1, // enable_packet
0, // out_of_order_id
0, // packet_id (not used for reception)
0, // packet_type (not used for reception)
0, 0, 0, 0, 0,
0, // d0_size, d0_stride, d1_size, d1_stride, d2_size, d2_stride
0, 0, 0, // iteration_current, iteration_size, iteration_stride
0, // next_bd
0, // row
0, // use_next_bd
1, // valid_bd
0, 0, 0, 0, 0, // lock_rel_val, lock_rel_id, lock_acq_enable,
// lock_acq_val, lock_acq_id
0, 0, 0, 0, 0, 0, // d0_zero_before, d1_zero_before, d2_zero_before,
// d0_zero_after, d1_zero_after, d2_zero_after
traceBurstLength // burst_length
);
@fifield
Copy link
Collaborator

fifield commented Mar 13, 2026

Some things I can help with:

  • EVENT_PC validation on real hardware -- I have a Phoenix and a (finicky due to being brand new) Strix Halo NPU, with a dual-compiler testing architecture around the silicon in question to make sure both Chess and Peano play nice. I've been working towards using EVENT_PC mode specifically (for cycle-by-cycle PC capture as part of emulator development work), so I can validate any given test case on hardware. Also happy to contribute tests for vec_mul_event_trace's path and other configs.

I have some examples + draft code for parsing the other two trace modes (event-pc and execution-trace) which I used as a quick proof of life check when developing the declarative trace ops. I'll try to dig it out and stick it in a branch if other folks want to take a look.

@FIM43-Redeye
Copy link

@yenjames Thanks for the warm welcome! I'll start with FileCheck tests for multi-tile scenarios -- that's low-friction and immediately useful for your coverage numbers.

@fifield That would be incredibly valuable! I've been reverse-engineering the event-pc and execution-trace formats from raw hardware captures on my Phoenix NPU and have gotten reasonably far -- event-pc cross-references against the ELF, execution mode appears to be ARM CoreSight ETM-style atom encoding -- but I've been working without any format specification, so I'd love to compare notes against your parser to see where I've gotten things right vs wrong.

For broader context: I'm building an open-source cycle-accurate NPU emulator (xdna-emu dev branch) intended as both a learning tool and a verification aid -- so documentation on any aspect of the NPU is hugely useful to me, not just trace formats. I've been deriving behavior from the open-source toolchain (aie-rt, mlir-aie, llvm-aie) and hardware observation, which has gotten me surprisingly far, but I'll be honest -- I'd much rather work from open documentation than reverse-engineer proprietary tooling. It's not just inefficient, it makes me genuinely uncomfortable. AM020/AM025 cover register layouts but not things like trace wire formats or detailed DMA pipeline behavior.

If there's any documentation that can be shared, even informally, it would save a lot of effort. And if not, I completely understand -- I'll keep working from hardware observations and the toolchain source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants