Skip to content

Conversation

@pgrete
Copy link
Collaborator

@pgrete pgrete commented Oct 24, 2025

PR Summary

Again hotfix because of machine issues.
I refactored the AthenaK one (which itself is ported from the ETK).

While this works as is it still needs

  • docs (also for downstream use wrt to poking the dog)
  • think about other places (on top of the execute) where we might want to call the dog
  • potentially modernize the implementation

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files


extern int my_rank, nranks, nghost;
extern bool is_restart;
extern bool watchdog_enabled;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy about adding a global variable, but given that the watchdog itself may or may not be tied to the driver, this reduces the amount of "threading through" and also should allow downstream to flexibly call the dog from any place.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this won't be problematic if we run with kokkos openmp backend?

}
}
} // namespace WatchDog
} // namespace parthenon No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 25, 2025

I assume this won't be problematic if we run with kokkos openmp backend?

I don't think so because, we're already pulling the standard Threads lib in by default (for the task mgmt IIRC).

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, although I think the interface is a little confusing. Since it is more work I am not sure it is worth it, but I would use a singleton pattern for the watchdog. This removes the explicit global variables and allows for a cleaner interface. A chatgpt written implementation that also uses std library threads (untested) is:

watch_dog.hpp:

#pragma once
#include <chrono>
#include <condition_variable>
#include <mutex>
#include <thread>

class Watchdog {
public:
  // Start the watchdog once; subsequent calls are no-ops.
  static void init(int timeout_sec, int mpi_rank);

  // Increment progress. Does nothing if init() has not been called.
  static void tick();

  // Graceful shutdown (useful for tests/tools). Safe to call even if not running.
  static void stop();

  // Whether the watchdog thread is running.
  static bool is_running();

private:
  Watchdog() = default;
  Watchdog(const Watchdog&) = delete;
  Watchdog& operator=(const Watchdog&) = delete;

  static Watchdog& instance();
  void loop();

  static const char* ts();
  void log_plain(const char* msg);
  void log_rank(const char* msg);

  // state
  std::once_flag once_;
  std::thread th_;
  std::mutex m_, io_;
  std::condition_variable cv_;

  bool running_ = false;                    // protected by m_
  unsigned long long progress_ = 0;         // protected by m_
  std::chrono::seconds timeout_{0};         // set at init, then read-only
  int rank_ = 0;                             // set at init, then read-only
};

watch_dog.cpp:

#include "watchdog.hpp"
#include <cstdio>
#include <cstdlib>
#include <ctime>

Watchdog& Watchdog::instance() {
  static Watchdog w;
  return w;
}

void Watchdog::init(int timeout_sec, int mpi_rank) {
  auto& w = instance();
  std::call_once(w.once_, [&]{
    w.timeout_ = std::chrono::seconds(timeout_sec);
    w.rank_ = mpi_rank;
    {
      std::lock_guard<std::mutex> lk(w.m_);
      w.running_ = true;
    }
    w.th_ = std::thread(&Watchdog::loop, &w);
    if (w.rank_ == 0) w.log_plain("Starting.");
  });
}

void Watchdog::stop() {
  auto& w = instance();
  {
    std::lock_guard<std::mutex> lk(w.m_);
    if (!w.running_) return;
    w.running_ = false;
  }
  w.cv_.notify_all();
  if (w.th_.joinable()) w.th_.join();
  w.log_plain("Stopping.");
}

bool Watchdog::is_running() {
  auto& w = instance();
  std::lock_guard<std::mutex> lk(w.m_);
  return w.running_;
}

void Watchdog::tick() {
  auto& w = instance();
  std::unique_lock<std::mutex> lk(w.m_);
  if (!w.running_) return;            // no-op before init()
  ++w.progress_;
  lk.unlock();
  w.cv_.notify_all();
}

void Watchdog::loop() {
  unsigned long long last;
  {
    std::lock_guard<std::mutex> lk(m_);
    last = progress_;
  }

  while (true) {
    std::unique_lock<std::mutex> lk(m_);
    const bool woke = cv_.wait_for(lk, timeout_, [&]{ return !running_ || progress_ != last; });

    if (!running_) break;

    if (woke && progress_ != last) {
      last = progress_;
      lk.unlock();
      if (rank_ == 0) log_plain("Everything is fine.");
    } else {
      lk.unlock();
      log_rank("is not progressing.");
      log_plain("Terminating...");
      std::abort();
    }
  }
}

const char* Watchdog::ts() {
  static thread_local char buf[32];
  std::time_t t = std::time(nullptr);
#if defined(_WIN32)
  std::tm tm{}; localtime_s(&tm, &t);
#else
  std::tm tm{}; localtime_r(&t, &tm);
#endif
  std::strftime(buf, sizeof(buf), "%a %b %d %H:%M:%S %Y", &tm);
  return buf;
}

void Watchdog::log_plain(const char* msg) {
  std::lock_guard<std::mutex> lk(io_);
  std::fprintf(stderr, "[WATCHDOG (%s)] %s\n", ts(), msg);
  std::fflush(stderr);
}

void Watchdog::log_rank(const char* msg) {
  std::lock_guard<std::mutex> lk(io_);
  std::fprintf(stderr, "[WATCHDOG (%s)] Rank %d %s\n", ts(), rank_, msg);
  std::fflush(stderr);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants