Skip to content

Feature: Provide network hardware counter summary at completion time #13525

@sadlochr

Description

@sadlochr

HPE's CrayMPI has a feature where during the call to MPI_Finalize, it can display or output to NIC-specific files a collection of network hardware performance counters and output them at various granularities and verbosities. Anything from a global condensed summary of network timeouts to a thousand NIC-specific hardware counters for every NIC. It also accepts as input a user's list of specific hardware performance counters to track instead of the primary defaults. I'm thinking it would be nice if OpenMPI could do the same.

I have used this feature before to quickly characterize performance behavior or at least create and test an informed set of theories.

The basic implementation would have the network hardware counters sampled by a local root rank on each compute node during MPI_Init() - assuming the correct environment variables are set. If not set, no sampling is performed. During MPI_Finalize(), a final sampling would be taken, delta values computed, and the resultant data and metrics sent to and collated by the global root rank. Global summary is outputted by the global root rank. NIC-specific files written by the local root rank on each compute node.

Here is HPE's user guide for the feature I'm referring to:
https://cpe.ext.hpe.com/docs/latest/getting_started/HPE-Cassini-Performance-Counters.html

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions