Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@github-actions
Copy link

github-actions bot commented Nov 15, 2025

Review updated until commit 5868dd7

Description

  • Add comprehensive debug logging to TMA analysis code for better troubleshooting

  • Implement 256-element hardware limitation check in MergeTileGroupsByRotation

  • Add size validation in DomainMerger to prevent exceeding hardware limits

  • Add three new test cases: two illegal size tests and one legal size test

Changes walkthrough

Relevant files
Enhancement
tma.cpp
Add TMA debug logging and 256-element limit checks             

csrc/device_lower/analysis/tma.cpp

  • Added extensive debug output throughout TMA analysis passes for better
    troubleshooting
  • Implemented 256-element hardware limitation check in
    MergeTileGroupsByRotation::condition to prevent invalid merges
  • Enhanced DomainMerger::shouldMerge with size validation against
    256-element limit
  • Added comprehensive logging for TMA domain merging phases and final
    dimension states
  • +265/-6 
    Tests
    test_memory.cpp
    Add TMA illegal size test cases                                                   

    tests/cpp/test_memory.cpp

  • Added OneDimTensor2dTile2x256Illegal test for 1D tensor with invalid
    2x256 tiling
  • Added TwoDimTensor2dTile2x256Illegal test for 2D tensor with invalid
    2x256 tiling
  • Added TwoDimTensor2dTile2x256Legal test for 2D tensor with valid 2x256
    tiling
  • +133/-0 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Debug Output Removal

    Extensive debug output has been added throughout the TMA analysis code (std::cout statements). This debug output should be removed or made conditional (guarded by a debug flag) before merging to production code.

    std::cout << "[MergeTileGroupsByRotation::condition] Checking expr: "
              << expr->front()->toString() << std::endl;
    std::cout << "  from[0] (partitioned): " << from_.at(0)->toString()
              << " extent="
              << from_.at(0)->front()->as<IterDomain>()->extent()->toString()
              << std::endl;
    std::cout << "  from[1]: " << from_.at(1)->toString() << " extent="
              << from_.at(1)->front()->as<IterDomain>()->extent()->toString()
              << std::endl;
    std::cout << "  to[0]: " << to_.at(0)->toString() << " extent="
              << to_.at(0)->front()->as<IterDomain>()->extent()->toString()
              << std::endl;
    std::cout
        << "  partitioned_dim->box extent: "
        << partitioned_dim->box->front()->as<IterDomain>()->extent()->toString()
        << std::endl;
    std::cout << "  partitioned_dim->box is bulk: " << box_is_bulk << std::endl;
    std::cout << "  from[1] is bulk: " << from1_is_bulk << std::endl;
    
    if (bulk_groups_.count(partitioned_dim->box) == 0 ||
        bulk_groups_.count(from_.at(1)) == 0) {
      std::cout << "  Result: FALSE (not both bulk)" << std::endl;
      return false;
    }
    NVF_ERROR(
        partitioned_dim->tile == partitioned_dim->box &&
        partitioned_dim->stride == nullptr);
    
    // Check if merging would exceed the 256-element hardware limit
    auto box_extent = partitioned_dim->box->front()->as<IterDomain>()->extent();
    auto from1_extent = from_.at(1)->front()->as<IterDomain>()->extent();
    Val* merged_extent =
        SimplifyingIrBuilder::mulExpr(box_extent, from1_extent);
    
    constexpr int64_t largest_dim_size = 256; // Hardware limitation
    Val* too_large_after_merge = SimplifyingIrBuilder::gtExpr(
        merged_extent, IrBuilder::create<Val>(largest_dim_size));
    
    std::cout << "  Merged box size would be: " << box_extent->toString()
              << " * " << from1_extent->toString() << " = "
              << merged_extent->toString() << std::endl;
    
    if (simplifyExpr(too_large_after_merge)->isTrue()) {
      std::cout << "  Result: FALSE (would exceed 256 element limit)"
                << std::endl;
      return false;
    }
    
    std::cout
        << "  Result: TRUE - WILL MERGE (pattern matches and within limits!)"
        << std::endl;
    Performance Impact

    The debug output in loops (lines 661-703) could significantly impact performance if left in production code, especially for complex TMA operations with many iterations.

    std::cout << "\n[INFER_ROLES DEBUG] Starting infer_roles passes" << std::endl;
    int iteration = 0;
    bool changed = true;
    while (changed) {
      changed = false;
      std::cout << "\n[INFER_ROLES DEBUG] === Iteration " << iteration++
                << " ===" << std::endl;
      std::cout << "  Remaining exprs: " << exprs.size() << std::endl;
      std::cout << "  Current inferred_dims: " << inferred_dims.size()
                << std::endl;
      for (const auto& dim : inferred_dims) {
        std::cout << "    " << dim << std::endl;
      }
    
      bool b = bulk_pass.run(exprs);
      if (b)
        std::cout << "  bulk_pass made changes" << std::endl;
      changed = changed || b;
    
      b = nonbulk_pass.run(exprs);
      if (b)
        std::cout << "  nonbulk_pass made changes" << std::endl;
      changed = changed || b;
    
      b = striding_split_pass.run(exprs);
      if (b)
        std::cout << "  striding_split_pass made changes" << std::endl;
      changed = changed || b;
    
      b = boxing_split_pass.run(exprs);
      if (b)
        std::cout << "  boxing_split_pass made changes" << std::endl;
      changed = changed || b;
    
      b = move_partitioned_pass.run(exprs);
      if (b)
        std::cout << "  move_partitioned_pass made changes" << std::endl;
      changed = changed || b;
    
      b = merge_tile_pass.run(exprs);
      if (b)
        std::cout << "  merge_tile_pass made changes" << std::endl;
      changed = changed || b;

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants