Skip to content

feat(rocpd): add AI-powered GPU trace analysis module (rocpd analyze)#4030

Open
ammarwa wants to merge 161 commits intodevelopfrom
aelwazir/rocpd-ai-analysis
Open

feat(rocpd): add AI-powered GPU trace analysis module (rocpd analyze)#4030
ammarwa wants to merge 161 commits intodevelopfrom
aelwazir/rocpd-ai-analysis

Conversation

@ammarwa
Copy link
Collaborator

@ammarwa ammarwa commented Mar 13, 2026

Motivation

AMD ROCm users need actionable guidance from their GPU profiling data, but interpreting raw rocprofv3 output requires deep GPU architecture knowledge. This PR introduces an AI-powered analysis module (rocpd analyze) that reads .rpd trace databases and produces human-readable performance insights, bottleneck detection, and optimization recommendations — without requiring an internet connection or LLM API key.

The module follows a tiered progressive analysis strategy (Tier 0–4), giving users immediately useful output from any profiling run while allowing deeper analysis as more data is collected.

Technical Details

New subcommand: rocpd analyze [-i trace.db] [--source-dir ./src] [--interactive "<app>"]

Core analysis (analyze.py, ai_analysis/)

  • Tier 0 — Static source analysis (--source-dir): Scans .hip/.cpp/.cu/.py files for GPU programming patterns (kernels, memcpy, sync, ROCTx, frameworks). Produces a profiling plan with a suggested first rocprofv3 command and recommended PMC counters. Works without a .db file.
  • Tier 1 — Trace analysis: Time breakdown (kernel/memcpy/API/idle), hotspot identification, memory transfer analysis, 8 rule-based recommendations (high/medium/low/info priority).
  • Tier 2 — Hardware counter analysis: Roofline model, Speed-of-Light, GPU utilization (GRBM), wave occupancy (SQ_WAVES). Auto-activates when pmc_events table is present.
  • Output formats: text, json (schema v0.1.x/v0.2.0), markdown, webview (self-contained AMD-themed HTML with SVG gauges, collapsible recommendation cards, sortable hotspot table, hover tooltips, light/dark toggle).
  • LLM enhancement (--llm anthropic|openai|private): Optional natural-language explanations via Anthropic Claude, OpenAI, or any OpenAI-compatible private/enterprise endpoint. Kernel names and paths are sanitized before transmission. Falls back gracefully when unavailable.
  • Custom prompts (--prompt): Target the analysis at a specific question (e.g. --prompt "Why is my matmul kernel slow?").
  • PMC pass splitting: _split_pmc_into_passes() automatically separates TCC-derived counters (FETCH_SIZE, WRITE_SIZE) into dedicated passes to avoid hardware block limit errors (rocprofv3 error code 38).

Interactive workflow (ai_analysis/interactive.py)

Two session classes:

InteractiveSession — menu-driven [p]/[a]/[o]/[s]/[q] loop launched after standard analysis:

  • Persistent LLMConversation shared across all [a]/[o] calls in a session; history survives --resume-session
  • LLMConversation auto-compacts every N turns (configurable via --llm-compact-every) using an LLM-generated summary to stay within context limits
  • AI-suggested rocprofv3 commands extracted from LLM responses and offered as a numbered run menu
  • Session saved to ~/.rocpd/sessions/ on [s], [q], and Ctrl+C

WorkflowSession — 7-phase automated profiling + optimization loop triggered by --interactive "<app>":

  • Phase 1b: Classifies the app command and scans source (if provided) to pick the optimal starter rocprofv3 flags
  • Multi-process support: Detects MPI, torchrun, DDP, and other fork-based workloads; automatically adds --process-sync and -o results_%nid% so each process writes its own DB. After profiling, per-process databases are merged via rocpd.merge.merge_sqlite_dbs() before analysis.
  • Phase 6 AI code editing: LLM rewrites source files based on recommendations; diff shown for approval; .bak backup created before any edit
  • AI-edit revert: [v]/r reverts the last edit, prompts for error context, calls LLM to analyze the failure and propose an alternative, then shows a what-next menu ([f] retry fix / [p] re-profile / [q] exit)
  • Cycle-break detection: prevents infinite [r] re-profile → same INFO → [r] loops by fingerprinting collected counters and flags across all prior runs
  • Session checkpoints: each AI edit batch creates a git commit + GC-pinned ref + browsable worktree; [b] rollback menu in Phase 5 lets users revert to any prior state
  • Session state persisted to ~/.rocpd/sessions/workflow_<ts>_<slug>.json

WorkflowSession — Session Checkpoints

Each AI source-file edit creates a git-worktree checkpoint so the user can roll back to any prior state and blacklist approaches that caused regressions.

Phase 6 AI edit
  └─► git commit all modified files
  └─► git update-ref refs/rocpd/<session_id>/cp-N  (GC-pinned, not a branch)
  └─► git worktree add --detach ~/.rocpd/sessions/<session_id>/cp-N
  └─► CheckpointRecord stored in WorkflowState
        ├── file_snapshots (full contents — offline restore when git unavailable)
        ├── run_index + performance_delta_pct (filled in after Phase 3/4)
        └── blacklisted flag + description
  • [b] rollback menu in Phase 5: shows checkpoint table with performance deltas; regression checkpoints flagged; user prompted to blacklist before rollback
  • Blacklist injection: blacklisted approach descriptions are prepended to Phase 6 LLM prompt so the same pattern is not repeated
  • Blacklist persistence: stored in WorkflowState.blacklisted_approaches (never truncated by rollback)
  • Two-strategy restore: git checkout <hash> -- <file> (fast path) or file-snapshot write (fallback when git unavailable)
  • Dirty working tree OK: commit_files stages only the AI-modified files (git add -- <file>), so in-progress user changes are never touched or included in checkpoint commits
  • Session lifecycle: _init_checkpoints + _prune_stale_worktrees at start; _teardown_checkpoints in finally (removes worktrees; refs kept for GC protection)

LLM conversation (ai_analysis/llm_conversation.py)

New LLMConversation class replacing the previous SessionContext dict approach:

  • Streaming responses via Anthropic, OpenAI, and private/enterprise OpenAI-compatible APIs
  • Response chunks accumulated with list.append + "".join() (O(n)) instead of string concatenation (O(n²)) to avoid quadratic allocation on long responses
  • Automatic context compaction: keeps keep_recent_turns verbatim, summarizes older turns with a non-streaming LLM call
  • Conversation history archived to ~/.rocpd/sessions/<id>_history.jsonl
  • to_dict()/from_dict() for full session persistence and resume

LLM hardening

  • ROCPD_LLM_PRIVATE_HEADERS dict validation: After json.loads() the result is validated to be a dict; a non-dict JSON value (e.g. an array) raises a ValueError with a clear message showing the expected format, rather than an opaque TypeError from headers.update()

Build & packaging (utilities.cmake)

  • file(COPY ... DESTINATION ...) replaces configure_file ... COPYONLY for AI analysis assets — fixes EPERM on binary files (e.g. PNG) during CMake configure
  • *.png added to rocpd_AI_ANALYSIS_FILES glob so ai_analysis/share/amd_rocm_logo.png (used by the interactive session banner) is installed alongside .py/.md/.json files
  • tracelens_port.py added to rocpd_PYTHON_SOURCES
  • GPU-less CMake build fix: guards list(GET ...) calls in rocprofiler-sdk-utilities.cmake with an early-return when rocminfo returns an empty GPU list. Note: GPU is required to run the integration tests; builds on GPU-less machines configure cleanly but the test suite is not expected to pass without hardware.

Python 3.6 compatibility (RHEL 8.8 / SLES 15.6)

  • tracelens_port.py: Changed _CATEGORY_PATTERNS: List[Tuple[str, re.Pattern]] annotation to List[Tuple[str, Any]]. re.Pattern was introduced in Python 3.7; Python 3.6 evaluates module-level annotations eagerly, causing an AttributeError at import time that cascaded into all tests importing analyze.py or llm_analyzer.py.
  • test_analyze_schema.py: Added try/except ImportError shim for importlib.resources (Python 3.7+), falling back to pkgutil.get_data() on Python 3.6.

Schema file corrections

The analysis-output.schema.json file was corrected to match the already-documented v0.2.0 specification. The emitted JSON format was never wrong; only the validator was:

Bug Fix
profiling_mode enum missing "source_only" Added
analysis_tier minimum was 1 Lowered to 0
execution_breakdown type "object" only Changed to ["object", "null"]
tier0 property undeclared Added full property definition with 14 sub-fields
$id embedded version string Changed to stable "rocpd-ai-analysis-output"

Tier 0 source-only JSON output (schema_version: "0.2.0") now passes jsonschema.validate().

Tests

  • tests/rocprofv3/rocpd/test_analyze.py — 76 unit tests covering all recommendation rules, helper functions, PMC filter, and output formatters
  • tests/rocprofv3/rocpd/test_analyze_schema.py28 JSON schema conformance tests (v0.1.x, v0.2.0 source-only, and combined Tier 0+Tier 1/2; was 17)
  • tests/rocprofv3/rocpd/test_ai_analysis_standalone.py — 23 Python API unit tests (analyze_database, analyze_source, AnalysisResult)
  • tests/rocprofv3/rocpd/test_guide_filter_standalone.py — LLM reference guide section filter tests
  • ai_analysis/tests/test_interactive.py — 22 interactive session unit tests
  • ai_analysis/tests/test_llm_conversation.pyLLMConversation streaming/compaction/persistence tests
  • ai_analysis/tests/test_workflow.py52 WorkflowSession phase tests including full checkpoint system coverage (CheckpointRecord, GitCheckpointManager, rollback, blacklist, teardown, stale pruning)

JIRA ID

N/A

Test Plan

  • Unit tests run with pytest --noconftest from the build output directory
  • Integration tests run via ctest -R rocpd-analyze after a full build (requires AMD GPU)
  • Manual end-to-end testing with merged_db.db (2000 kernel dispatches + 64000 PMC samples) for Tier 1/2 analysis and all four output formats
  • Interactive workflow tested against a HIP demo app with intentional performance issues for Phase 1b workload classification, Phase 6 AI code editing, and the revert flow
  • Checkpoint system tested with mock git operations: rollback (git fast path + snapshot fallback), blacklist persistence across rollbacks, worktree teardown/prune
  • CMake configure verified on a system without AMD GPUs (configure succeeds; tests require GPU)

Test Result

  • All 76 test_analyze.py unit tests pass
  • All 28 schema conformance tests pass (including 11 new Tier 0 / combined tests)
  • All 23 AI analysis API tests pass
  • All 52 test_workflow.py tests pass (checkpoint system coverage)
  • All 22 test_interactive.py tests pass (no regressions)
  • All 51 test_llm_conversation.py tests pass (no regressions)
  • All output formats (text/json/markdown/webview) verified with real trace data
  • jsonschema.validate() passes for Tier 0, Tier 1/2, and combined JSON output
  • CMake configures cleanly on both GPU and GPU-less systems

Submission Checklist

ammarwa and others added 30 commits March 12, 2026 19:25
analyze.py:
  - Bug: execute() passed CLI key 'format' to analyze_performance() which
    expects 'output_format', so --format json/markdown was silently ignored
    and text was always written.  Fix by mapping the key before the call.

cmake/Modules/rocprofiler-sdk-utilities.cmake:
  - rocprofiler_sdk_pc_sampling_disabled and
    rocprofiler_sdk_pc_sampling_stochastic_disabled called list(GET ...)
    on the result of rocprofiler_sdk_get_gfx_architectures without
    guarding against an empty list.  On build machines without GPUs
    (CI containers, cross-compile hosts) CMake configure failed with
    "list GET given empty list".  Add length check and early-return with
    PC sampling disabled when no GPUs are present.

tests/CMakeLists.txt:
  - rocprofiler-sdk-tests-gfx-info was left empty on no-GPU hosts,
    causing all sub-CMakeLists that do list(GET rocprofiler-sdk-tests-gfx-info 0 ...)
    to fail at configure time.  Populate the variable with placeholder
    "gfx000" when no hardware is detected; this matches none of the
    known GPU patterns so all hardware-dependent tests are correctly
    disabled while configure completes without errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _format_as_webview() function: self-contained HTML report with AMD
  dark theme, interactive sortable tables, SVG donut gauges for GPU util
  and wave occupancy, collapsible recommendation cards with priority
  color-coding, stacked execution breakdown bar, and copy-to-clipboard
  profiling commands. No external CDN dependencies.

- Wire 'webview' format into format_analysis_output() dispatch

- Add 'webview' to --format CLI choices (text/json/markdown/webview)

- Fix output file extension: execute() now appends .txt/.json/.md/.html
  automatically based on the selected format, so output files always
  have the correct extension

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md: add webview to feature list, CLI examples, data flow
  diagram, AnalysisResult method list, and a new Example 4 section
- AI_ANALYSIS_API.md: add webview to feature list and AnalysisResult
  methods; document each format's output file extension (.txt/.json/
  .md/.html); add full Webview section under Output Formats covering
  features, CLI usage, and Python API usage
- SCHEMA_CHANGELOG.md: add v0.1.1 entry noting webview format addition
  and auto-extension behavior (no JSON schema changes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a pure CSS+JS floating tooltip system to the webview HTML report
so every visual element explains itself on hover. No external deps.

Tooltips added to:
- Gauge widgets (GPU Utilization, Wave Occupancy): explain the
  underlying hardware counter formula (GRBM_GUI_ACTIVE/GRBM_COUNT,
  SQ_WAVES), target thresholds, and current status
- Execution breakdown: stacked bar segments and individual bars for
  Kernel Execution, Memory Copies, API Overhead, and GPU Idle — each
  explains what the metric means, good/bad thresholds, and how to fix
- Overview stat cards: Primary Bottleneck (per-type explanation of
  what it means and how to address it), Total Runtime, Kernel Time,
  Analysis Tier (explains Tier 1 vs Tier 2 and how to upgrade)
- Hotspot table column headers: Calls, Total/Avg/Min Time, % Total
- Memory transfer table: direction cells (H2D, D2H, D2D, P2P with
  PCIe/HBM bandwidth context) and all column headers
- Hardware counter table rows (via COUNTER_TIPS JS lookup):
  GRBM_COUNT, GRBM_GUI_ACTIVE, SQ_WAVES, SQ_WAVE_CYCLES,
  SQ_INSTS_VALU/SALU/VMEM_RD/VMEM_WR/LDS/SMEM, FETCH_SIZE,
  WRITE_SIZE, TCP/TCC cache counters, TA_TA_BUSY, and more.
  Unknown counters get a generic fallback message.

Implementation details:
- #tt floating div follows mouse cursor, repositions at viewport edges
- [data-tip] elements use single-quoted HTML attributes; tip content
  can include <strong>, <em>, <code>, .tok/.twarn colored spans
- Counter tips use data-ctr attribute + JS COUNTER_TIPS object lookup
  to decouple tip content from Python string generation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AI_ANALYSIS_API.md: expand Webview features list with full tooltip
  coverage details — gauges (counter formula, thresholds), breakdown
  bars, overview stats (per-bottleneck guidance), hotspot columns,
  memory direction cells, and 20+ AMD GPU hardware counter definitions
- README.md: add tooltip note to Example 4 (Interactive HTML Webview)
  explaining that every visual element is self-documenting on hover
- SCHEMA_CHANGELOG.md: add v0.1.2 entry — no schema changes; notes
  the COUNTER_TIPS JS lookup, tooltip coverage, and fallback behavior
  for unknown counters

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Overhaul the --format webview HTML report inspired by AMD dashboard
design patterns for a cleaner, more scannable interface:

- Light/Dark theme toggle with localStorage persistence (defaults dark)
- Sticky header with AMD gradient, status summary badges (Critical/
  Warning/Low/Info counts from recommendations), and metric pills row
  (runtime, kernel count, analysis tier, timestamp, DB path)
- Status-colored KPI cards in overview: kernel %, bottleneck type,
  total runtime, and tier each have a colored top border (ok/warn/crit)
  reflecting health status at a glance
- Section card pattern (.scard) with icon+title+badge headers throughout
- Priority icons on recommendation cards: 🔴 HIGH 🟠 MEDIUM 🟡 LOW ℹ INFO
- Gradient execution breakdown bars and grid-aligned legend rows
- FAB scroll-to-top button (appears after 250px scroll)
- Staggered @Keyframes fadeInUp entrance animations on section cards
- Improved typography (system font stack; works fully offline)
- Gauge cards: background fill + hover border effect (Tier 2)
- Improved table headers: uppercase + 2px bottom border

Also updates SCHEMA_CHANGELOG.md (v0.1.3), README.md, and
AI_ANALYSIS_API.md to document all new webview UI features.
No changes to JSON output schema or analysis logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CSS `content` property does not process HTML entities. Replace
`content:'&#8594;'` with `content:'→'` (U+2192) in the .findings
li::before rule so the right-arrow bullet renders correctly instead
of displaying as literal text '&#8594;'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the root cause and fix for the key findings bullet icons
rendering as literal HTML entity text in the webview report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The #tt floating tooltip used color:var(--text) which in light mode
resolves to ~#181828 (near-black) — invisible against the always-dark
#0e0e1c tooltip background. Replace with a fixed light color (#dde0f2)
so the tooltip remains readable regardless of the active theme.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the root cause and fix for tooltip text being invisible in
light theme (color:var(--text) resolving to near-black against an
always-dark tooltip background).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The recommendation engine was suggesting rocprofv3 flags (e.g.
--hip-api-trace, --hsa-trace) that were already covered by the
user's original --sys-trace run, creating confusing advice.

Fix: inspect the database before generating recommendations to
infer which collection flags were already used:
- kernels rows       → --kernel-trace covered
- regions rows       → --hip-trace / --hsa-trace covered (API spans)
- memory_copies rows → --memory-copy-trace covered
- kernels + regions  → full --sys-trace implied (subsumes all trace flags)

Redundant flags are stripped from recommended rocprofv3 commands.
Commands whose stripped flags leave nothing new to collect are
dropped entirely. rocprof-sys and rocprof-compute commands are
always preserved (different tool, always a new perspective).

New helpers: _detect_already_collected(), _filter_rec_commands(),
_SYS_TRACE_IMPLIED constant. generate_recommendations() gains an
already_collected parameter; analyze_performance() calls the
detector and threads the result through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…esent

rocprof-sys --trace collects the same HIP/HSA API call data as
rocprofv3 --sys-trace (just in Perfetto format instead of rocpd).
Treat it as equivalent and drop it when sys-trace data is already
in the database.

Rules in _filter_rec_commands() are now per-tool:
- rocprofv3: strip covered flags; drop if nothing meaningful remains
- rocprof-sys: drop if only --trace (≡ sys-trace); keep when it
  carries extra flags like --trace-gpu-memory that rocprofv3 can't
- rocprof-compute: always keep (deep hardware counter analysis)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The LLM was recommending flags already covered by the user's original
--sys-trace run (e.g. --hip-api-trace, --hsa-trace, rocprof-sys --trace).

Add a new "Context-Aware Profiling Recommendations" section to the LLM
reference guide (the "fence") that explicitly instructs the model to:
1. Read profiling_info.profiling_mode to identify what was already collected
2. Know that --sys-trace subsumes --hip-trace, --hsa-trace, --hip-api-trace,
   --kernel-trace, --memory-copy-trace, --marker-trace, --roctx-trace
3. Know that rocprof-sys --trace is equivalent to --sys-trace (same API data,
   different format) and must not be recommended when sys-trace exists
4. Only recommend the INCREMENTAL next step (--pmc, rocprof-compute, etc.)
5. State "no additional run needed" when all required data is present

Also add an explicit prohibition in the "What NOT to Do" section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of enumerating every flag equivalence (--sys-trace subsumes
--hip-trace, --hsa-trace, etc.), instruct the LLM to reason from the
tool documentation already present in the guide to determine flag
overlap and tool equivalence itself.

The "Context-Aware Profiling Recommendations" section is now concise:
tell the model what to do (read profiling_mode, use the docs to reason
about equivalence, recommend only the incremental next step) without
hardcoding every combination that should be in the model's reasoning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Suppresses .claude/, __pycache__/, *.pyc, and rocpd-output-data/
from appearing as untracked files in git status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes all 13 issues from the deep-research-report audit:

Critical:
- AIA-001: fix analyze_database() — call individual analysis functions
  (compute_time_breakdown, identify_hotspots, analyze_memory_copies,
  analyze_hardware_counters, generate_recommendations) instead of the
  broken analyze_performance() wrapper that returns str not dict

High:
- AIA-002: fix _build_analysis_result() key mapping (issue/suggestion/
  estimated_impact/actions, uppercase priority comparison)
- AIA-003: add WEBVIEW to OutputFormat enum
- AIA-004: fix to_json() to return schema-conformant output via
  format_analysis_output(); add to_webview() method; store raw payloads
  as result._raw for schema-conformant serialization
- AIA-012: create ai_analysis/tests/test_api_standalone.py (23 tests)
  and tests/rocprofv3/rocpd/test_ai_analysis_standalone.py; update docs

Medium:
- AIA-005: re-raise LLMAuthenticationError/LLMRateLimitError instead of
  silently downgrading to warnings
- AIA-006: fix _convert_result_to_llm_format() to use real hotspot/
  memory/counter data from result._raw instead of empty placeholders
- AIA-007: implement file path redaction in _sanitize_data() using regex
- AIA-008: ReferenceGuideNotFoundError now lists all attempted paths;
  get_reference_guide_path() collects all paths before raising
- AIA-009: add DEFAULT_ANTHROPIC_MODEL/DEFAULT_OPENAI_MODEL constants;
  model names configurable via ROCPD_LLM_MODEL env var and new
  --llm-model CLI flag
- AIA-013: fix validate_database() to query type IN ('table','view')

Low:
- AIA-010: fix Optional type hints in exceptions.py
- AIA-011: export ReferenceGuideNotFoundError from __init__.py

Additional:
- Add --llm-model CLI flag to rocpd analyze (passes model to LLMAnalyzer
  via ROCPD_LLM_MODEL env var with proper save/restore)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sanitize_input_list() iterates over its argument, so passing a plain str
causes it to iterate over individual characters (e.g. 'p', 'r', 'o', ...).
Wrap the single path string in a list in both analyze_database() and
validate_database() so the path is treated as one item.

Fixes: analyze_database() returning 0 kernels when called via the Python
API even though the CLI works correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add if __name__ == '__main__' entry point to test_ai_analysis_standalone.py
  so it can be invoked directly by Python (required for CTest integration)
- Add configure_file() to copy test file to build directory at cmake time
- Add rocprofiler_add_integration_execute_test() registering
  rocprofv3-test-rocpd-ai-analysis-unit-tests (test #597) with labels
  integration-tests;rocpd;pytest and 120s timeout
- 23 tests pass via: ctest -R rocprofv3-test-rocpd-ai-analysis-unit-tests -V

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fy output format

- llm_analyzer.py: try max_completion_tokens first (required by gpt-5, o1, o3,
  and newer gpt-4o variants); fall back to legacy max_tokens transparently if
  the model reports max_completion_tokens as unsupported (old models)
- analyze.py: print a format hint when output defaults to text (.txt), so users
  know to add --format webview / --format json / --format markdown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ters

The recommendation engine was suggesting commands like:
  rocprofv3 --pmc GRBM_COUNT GRBM_GUI_ACTIVE SQ_WAVES ...
even when those exact counters were already present in pmc_events.

Root cause: _detect_already_collected() tracked trace flags (--sys-trace,
--kernel-trace, etc.) but never inspected pmc_events for counter names.
_filter_rec_commands() only checked command flags, not --pmc arg values.

Fixes:
- _detect_already_collected(): query pmc_events for DISTINCT counter_name;
  add "pmc:<NAME>" entries to the covered frozenset for each counter found
- _filter_rec_commands(): for rocprofv3 commands, strip already-collected
  counters from the --pmc arg value; drop --pmc entirely if all counters
  are covered; treat --kernel-names as a scope filter (not data collection)
  so a command reduced to only scope+output args is dropped cleanly;
  append note listing removed counters to recommendation description
- Add 7 unit tests covering full/partial/zero PMC stripping, full_command
  update, description note, kernel-names-only drop, and rocprof-compute
  always-kept behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… hint

SCHEMA_CHANGELOG.md — add v0.1.8 entry covering:
- PMC counter deduplication: _detect_already_collected() now inspects
  pmc_events; _filter_rec_commands() strips already-collected counters
  from --pmc args and drops fully-redundant commands
- OpenAI max_completion_tokens compatibility for gpt-5/o1/o3
- Output format hint when text is the default
- CTest registration of 23 AI analysis API unit tests

AI_ANALYSIS_API.md:
- Add "Recommendation Deduplication" section explaining the PMC and
  trace-flag deduplication table and behavior
- Note OpenAI model compatibility (max_completion_tokens auto-fallback)

CLAUDE.md:
- Bump schema version reference: v0.1.1 → v0.1.8
- Update test count: 69 → 76 (7 new PMC filter tests)
- Add PMC deduplication and OpenAI compat notes to Python API section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nter knowledge

Incorporate knowledge from four AMD ROCm profiling blog articles to improve
LLM-guided analysis quality and progressive recommendation accuracy.

Key additions:
- Recommended AMD 3-step profiling workflow: rocprof-sys (system timeline)
  → rocprofv3 (hardware counters on hot kernels) → rocprof-compute (deep
  analysis); guide LLM to recommend only the incremental next step
- Amdahl's Law as the core prioritization principle (focus on kernels >10%
  of total time only)
- VGPR→Occupancy table for all CDNA architectures (32/64/96/128/168/256
  VGPRs mapped to occupancy %)
- Hardware Counter Reference table with 10+ counters and derived metric
  formulas (GPU utilization, BW, L2 hit rate, VALU util, LDS util)
- Bandwidth formula: (FETCH_SIZE + WRITE_SIZE) * 64 bytes / duration_ns
- Memory Hierarchy section: VGPR→LDS→L1→L2→HBM with per-GPU cache sizes
  and hit-rate thresholds that indicate problems
- LDS bank conflicts: 32 banks, detection and avoidance patterns
- API/Launch Overhead as a new explicit bottleneck type
- ILP and HIP Streams as new optimization techniques
- Multi-GPU/MPI profiling guidance in the rocprof-sys section
- Ridge points per GPU: MI300X ~31, MI250X ~15, MI100 ~19 FLOP/Byte
- Confidence level examples with concrete counter-based phrasing
- Expanded GPU specs: SIMDs per CU (4), max waves per SIMD (8), L1 sizes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements interactive.py with SessionData, PersistentMenuItem, HistoryEntry
dataclasses and SessionStore (save/load/find_by_source_dir) for --interactive
session file I/O under ~/.rocpd/sessions/. All 5 unit tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap load() body in try/except; on failure emit warnings.warn and return None
- Replace lambda sort key in find_by_source_dir with _safe_dt() using datetime.fromisoformat + fallback to datetime.min
- Remove redundant 'import dataclasses' inside to_dict() (already at module level)
- Widen SessionStore.__init__ type hint to Union[str, pathlib.Path]; add Union to imports
- Add 5 new tests: malformed JSON skipped, make_session_id slug/spaces/fallback, newest-first ordering

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… resume prompt

Implements Task 2 of the interactive session feature:
- Add rendering helpers (_print, _input, _PRI_STYLE) with optional rich console support
- Add InteractiveSession class with main event loop, session init/resume logic, and save-on-quit
- Add _prompt_resume() for auto-detecting and offering to resume prior sessions
- Add _render_main_menu() showing persistent menu items from previous analyses
- Add stubs for _path_profiling(), _path_optimize(), _pursue_recommendation()
- Add TestInteractiveSessionMenu with 3 tests (new session, quit-saves, resume-loads)
- All 13 tests pass (10 existing TestSessionStore + 3 new TestInteractiveSessionMenu)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…remove duplicate import

- Wrap `_input()` in `run()` with try/except EOFError to call save-and-quit gracefully
- Print feedback message in `_prompt_resume()` when selection is out of range or unrecognized
- Remove duplicate `from rich.panel import Panel` inside `_render_main_menu()` (module-level import already covers it)
- Add 4 new tests: [s] save without quit, EOF exits cleanly, numeric entry pursues recommendation, invalid resume choice starts new session (17 tests total, all passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…k 3)

Replace the _path_profiling stub with a full implementation that displays
profiling commands from tier0 and existing recommendations, optionally
annotates them via LLM (metadata only, no source text), prompts for a .db
file path, runs Tier 1/2 analysis, and promotes resulting recommendations
to the persistent menu. Add _collect_profiling_commands,
_llm_annotate_profiling_plan, and _run_tier1_analysis helpers. Add
TestPathProfiling with 2 tests; all 19 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ammarwa and others added 14 commits March 13, 2026 00:45
Add _update_checkpoint_with_run() to WorkflowSession that finds the most
recent CheckpointRecord without a run attached, sets its run_index to the
latest trace_history index, and computes performance_delta_pct from
total_runtime_ns when two or more analysis snapshots are available. Hook
the method into _phase3_run_profiler after both successful trace-run save
sites (trace-files-found path and manual-DB-entry path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…parate methods

_update_checkpoint_with_run() was computing performance_delta_pct from
analysis_history before Phase 4 had appended the current run's analysis,
causing delta to always read stale data. Refactor: Phase 3 only sets
run_index via the existing method; new _update_checkpoint_delta() is called
from Phase 4 after _record_analysis() so analysis_history[-1] is always the
current run. Add test_update_checkpoint_delta_noop_when_insufficient_history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _rollback_to_checkpoint, _blacklist_checkpoint, and _build_blacklist_block
to WorkflowSession, plus _restore_from_snapshots helper. Rollback uses git fast
path when commit is reachable, falls back to file_snapshots otherwise. 9 new
tests added; all 45 workflow tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ves session

- Remove early `return` in `_rollback_to_checkpoint` when target_cp_id==-1
  and git is unavailable: execution now falls through to the cleanup section
  so checkpoints, trace_history, analysis_history, and iteration_count are
  always cleared even when file restore is impossible.
- Add `self._save_session()` at the end of `_blacklist_checkpoint` so the
  blacklisted flag is persisted to disk immediately after mutation.
- Add test `test_rollback_baseline_no_git_still_clears_state` to verify the
  baseline-no-git path clears all state (46 tests total, all passing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _show_checkpoint_picker() to WorkflowSession displaying a checkpoint
table with performance deltas and prompting for optional blacklisting of
regression checkpoints before restoring. Wire [b] into _phase5_rec_menu
across all three menu paths (already_reprofiled, all_info, HIGH/MEDIUM).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Validate cp_id against actual cp_id set (not list length) to handle non-contiguous ids
- Show blacklist prompt for baseline rollback (not just partial rollbacks)
- Replace raw input() calls with _input() wrapper for EOFError safety
- Strengthen test assertion to verify _blacklist_checkpoint called with correct cp_id

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When _build_blacklist_block() returns a non-empty string, prepend it to
the suggestions passed to _llm_rewrite_file so the LLM avoids previously
failed approaches when rewriting source files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _teardown_checkpoints() to remove all checkpoint worktrees when a
WorkflowSession exits (refs are preserved for GC protection). Add
_prune_stale_worktrees() to clean up orphaned worktrees from crashed
sessions at startup. Both are hooked into run(): pruning after
_init_checkpoints(), teardown in the finally block. Current session
worktrees are never pruned.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e path

Wrap remove_worktree calls in _teardown_checkpoints with try/except so
any exception (e.g. FileNotFoundError when git is missing) cannot
propagate out of the finally block and suppress _save_session.

Also add an early-return guard in GitCheckpointManager.remove_worktree
for empty worktree_path strings, preventing a spurious git error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CheckpointRecord dataclass with file_snapshots for offline restore
- GitCheckpointManager: git commit + update-ref + worktree add --detach per edit
- WorkflowState: repo_root, baseline_commit, checkpoints, active_checkpoint
- Phase 6: creates checkpoint after each AI edit batch
- Phase 3: records run_index and performance_delta_pct per checkpoint
- Phase 5: [b] rollback menu with checkpoint picker and blacklist prompt
- Blacklist: uses edit_summary directly; deduplicates; injects into Phase 6 LLM prompt
- Session exit: removes worktrees (refs stay for GC protection)
- Session start: dirty-tree abort; stale worktree pruning
- Fix: remove spurious _conv attribute from WorkflowSession (test_workflow_session_has_no_conv)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix blacklist lost after rollback: persist blacklisted_approaches on WorkflowState
- Fix suggestions accumulating blacklist prefix on each retry: use effective_suggestions
- Fix cp_id lookup: use search-by-id instead of list index in rollback and blacklist
- Fix _gcm left set after dirty-tree abort in _init_checkpoints
- Fix pathlib.Path.exists mock in test to use return_value=False

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ammarwa ammarwa requested a review from Copilot March 13, 2026 06:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ammarwa and others added 2 commits March 13, 2026 01:50
…ck formatting

- Remove is_dirty() from GitCheckpointManager — dirty working tree is not an
  obstacle because commit_files uses git add -- <specific_file> which only
  stages the exact files modified by each AI edit, leaving other in-progress
  changes untouched
- Remove the dirty-tree guard from _init_checkpoints() so sessions continue
  normally even when the repo has uncommitted changes
- Fix flake8 F841 in remove_worktree: drop unused result = assignment
- Apply black formatting to interactive.py and test_workflow.py
- Update tests: replace test_session_start_aborts_when_dirty with
  test_checkpoints_work_with_dirty_tree confirming checkpoints initialise
  successfully despite a dirty tree; remove two now-deleted is_dirty tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new rocpd analyze module to generate offline, human-readable GPU trace insights (with optional LLM enhancement), plus supporting packaging/build integration and a substantial unit/integration test suite.

Changes:

  • Adds AI analysis Python package (rocpd.ai_analysis), including persistent LLM conversation support and TraceLens-derived analysis utilities.
  • Integrates the new analyze subcommand into the rocpd CLI and CMake test/packaging flows.
  • Introduces extensive standalone/unit/integration tests for schema conformance, guide filtering, interactive workflow/checkpoints, and TraceLens port logic.

Reviewed changes

Copilot reviewed 29 out of 36 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_guide_filter_standalone.py Adds standalone tests for guide section tag selection and filtering logic.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_analyze_schema.py Adds schema structure + conformance tests (incl. Tier 0 and combined outputs) with Py3.6 shim.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_ai_analysis_standalone.py Adds standalone API + regression tests for AI analysis behaviors and security/correctness fixes.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/CMakeLists.txt Wires rocpd analyze into integration tests and runs standalone pytest-based test scripts.
projects/rocprofiler-sdk/tests/pytest-packages/pytest_utils/perfetto_reader.py Minor SQL formatting cleanup in trace reader query.
projects/rocprofiler-sdk/source/scripts/format-deps.py Removes unused import and reformats argparse definition.
projects/rocprofiler-sdk/source/lib/python/utilities.cmake Installs analyze.py, tracelens_port.py, and copies ai_analysis runtime assets (excluding tests).
projects/rocprofiler-sdk/source/lib/python/rocpd/tracelens_port.py Adds TraceLens-derived interval/categorization/short-kernel analysis utilities.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_workflow.py Adds mock-based tests for workflow session phases and git checkpoint manager behavior.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_tracelens_port.py Adds unit + optional integration tests for tracelens_port functions.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_local_llm.py Adds tests for local OpenAI-compatible endpoint provider behavior.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_llm_conversation.py Adds tests for streaming, compaction, persistence, and interactive integration for LLMConversation.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_interactive.py Adds tests for session storage/menu behavior and profiling/optimize flows.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_api_standalone.py Adds standalone tests for public API, exceptions, serialization, and recommendation bucketing.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/init.py Marks ai_analysis tests package.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/share/amd_rocm_logo.png Adds branding asset used by interactive UI.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/llm_conversation.py Introduces persistent multi-turn LLM session with streaming + compaction + archive.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/exceptions.py Adds typed exception hierarchy for AI analysis module.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/docs/LLM_GUIDE_SECTIONS.md Documents context-tagged guide section filtering system.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/init.py Exposes public AI analysis API surface + lazy interactive imports.
projects/rocprofiler-sdk/source/lib/python/rocpd/main.py Adds rocpd analyze CLI subcommand and argument validation.
projects/rocprofiler-sdk/source/bin/rocprofv3.py Formatting-only change to env var update call.
projects/rocprofiler-sdk/cmake/Modules/rocprofiler-sdk-utilities.cmake Avoids list(GET ...) errors when no GPUs are detected at configure time.
.gitignore Ignores Claude session data, Python bytecode, and generated analysis output directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ammarwa and others added 3 commits March 13, 2026 02:03
…compat in CMake schema test

- llm_conversation.py: after parsing ROCPD_LLM_PRIVATE_HEADERS, validate
  the result is a dict and raise a clear ValueError if it is not (e.g. if
  the env var was set to a JSON array or string instead of an object)
- tests/rocprofv3/rocpd/CMakeLists.txt: replace importlib.resources.files()
  with pkgutil.get_data() in the inline schema-validate test so it works on
  Python 3.6 where importlib.resources.files() is not available; also
  replace f-strings with str concatenation for broad Python compatibility

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…own tag

- test_ai_analysis_standalone.py: test_kernel_name_shell_quoted_in_full_command
  was filtering for rocprofv3 commands but the kernel name only appears in the
  rocprof-compute command (rocprofv3 collects general PMC counters without
  kernel-name scoping). Switch filter to rocprof-compute where shlex.quote()
  is correctly applied.

- test_guide_filter_standalone.py: add tracelens_metrics to KNOWN_TAGS —
  this tag is used in llm_analyzer.py (_select_tags adds it when TraceLens
  data is present) and tagged in llm-reference-guide.md, but was missing
  from the vocabulary guard set causing test_all_tags_are_from_known_vocabulary
  to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or test scripts

configure_file(COPYONLY) runs only at cmake configure time, leaving stale
copies in the build directory when test files are edited during development.

Introduce rocpd_stage_test_script() helper function that uses:
  add_custom_command(OUTPUT ... DEPENDS <src>) + add_custom_target(ALL ...)

This means cmake --build re-copies any test file whose source has changed,
without requiring the developer to re-run cmake configure.

Also adds set_property(CMAKE_CONFIGURE_DEPENDS) so cmake does re-configure
automatically when a CI system or fresh checkout triggers it.

Replace all configure_file COPYONLY calls for Python test scripts (both the
tests/rocprofv3/rocpd/ originals and the ai_analysis/tests/ sub-package
copies) with rocpd_stage_test_script().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ammarwa ammarwa marked this pull request as ready for review March 13, 2026 07:25
@ammarwa ammarwa requested review from a team as code owners March 13, 2026 07:25
@ammarwa ammarwa requested a review from bwelton March 13, 2026 07:25
@ammarwa ammarwa requested a review from bgopesh March 13, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants