-
Notifications
You must be signed in to change notification settings - Fork 40
Add global WatchDog #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add global WatchDog #1333
Conversation
|
|
||
| extern int my_rank, nranks, nghost; | ||
| extern bool is_restart; | ||
| extern bool watchdog_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy about adding a global variable, but given that the watchdog itself may or may not be tied to the driver, this reduces the amount of "threading through" and also should allow downstream to flexibly call the dog from any place.
Yurlungur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this won't be problematic if we run with kokkos openmp backend?
| } | ||
| } | ||
| } // namespace WatchDog | ||
| } // namespace parthenon No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline
I don't think so because, we're already pulling the standard |
lroberts36
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, although I think the interface is a little confusing. Since it is more work I am not sure it is worth it, but I would use a singleton pattern for the watchdog. This removes the explicit global variables and allows for a cleaner interface. A chatgpt written implementation that also uses std library threads (untested) is:
watch_dog.hpp:
#pragma once
#include <chrono>
#include <condition_variable>
#include <mutex>
#include <thread>
class Watchdog {
public:
// Start the watchdog once; subsequent calls are no-ops.
static void init(int timeout_sec, int mpi_rank);
// Increment progress. Does nothing if init() has not been called.
static void tick();
// Graceful shutdown (useful for tests/tools). Safe to call even if not running.
static void stop();
// Whether the watchdog thread is running.
static bool is_running();
private:
Watchdog() = default;
Watchdog(const Watchdog&) = delete;
Watchdog& operator=(const Watchdog&) = delete;
static Watchdog& instance();
void loop();
static const char* ts();
void log_plain(const char* msg);
void log_rank(const char* msg);
// state
std::once_flag once_;
std::thread th_;
std::mutex m_, io_;
std::condition_variable cv_;
bool running_ = false; // protected by m_
unsigned long long progress_ = 0; // protected by m_
std::chrono::seconds timeout_{0}; // set at init, then read-only
int rank_ = 0; // set at init, then read-only
};watch_dog.cpp:
#include "watchdog.hpp"
#include <cstdio>
#include <cstdlib>
#include <ctime>
Watchdog& Watchdog::instance() {
static Watchdog w;
return w;
}
void Watchdog::init(int timeout_sec, int mpi_rank) {
auto& w = instance();
std::call_once(w.once_, [&]{
w.timeout_ = std::chrono::seconds(timeout_sec);
w.rank_ = mpi_rank;
{
std::lock_guard<std::mutex> lk(w.m_);
w.running_ = true;
}
w.th_ = std::thread(&Watchdog::loop, &w);
if (w.rank_ == 0) w.log_plain("Starting.");
});
}
void Watchdog::stop() {
auto& w = instance();
{
std::lock_guard<std::mutex> lk(w.m_);
if (!w.running_) return;
w.running_ = false;
}
w.cv_.notify_all();
if (w.th_.joinable()) w.th_.join();
w.log_plain("Stopping.");
}
bool Watchdog::is_running() {
auto& w = instance();
std::lock_guard<std::mutex> lk(w.m_);
return w.running_;
}
void Watchdog::tick() {
auto& w = instance();
std::unique_lock<std::mutex> lk(w.m_);
if (!w.running_) return; // no-op before init()
++w.progress_;
lk.unlock();
w.cv_.notify_all();
}
void Watchdog::loop() {
unsigned long long last;
{
std::lock_guard<std::mutex> lk(m_);
last = progress_;
}
while (true) {
std::unique_lock<std::mutex> lk(m_);
const bool woke = cv_.wait_for(lk, timeout_, [&]{ return !running_ || progress_ != last; });
if (!running_) break;
if (woke && progress_ != last) {
last = progress_;
lk.unlock();
if (rank_ == 0) log_plain("Everything is fine.");
} else {
lk.unlock();
log_rank("is not progressing.");
log_plain("Terminating...");
std::abort();
}
}
}
const char* Watchdog::ts() {
static thread_local char buf[32];
std::time_t t = std::time(nullptr);
#if defined(_WIN32)
std::tm tm{}; localtime_s(&tm, &t);
#else
std::tm tm{}; localtime_r(&t, &tm);
#endif
std::strftime(buf, sizeof(buf), "%a %b %d %H:%M:%S %Y", &tm);
return buf;
}
void Watchdog::log_plain(const char* msg) {
std::lock_guard<std::mutex> lk(io_);
std::fprintf(stderr, "[WATCHDOG (%s)] %s\n", ts(), msg);
std::fflush(stderr);
}
void Watchdog::log_rank(const char* msg) {
std::lock_guard<std::mutex> lk(io_);
std::fprintf(stderr, "[WATCHDOG (%s)] Rank %d %s\n", ts(), rank_, msg);
std::fflush(stderr);
}
PR Summary
Again hotfix because of machine issues.
I refactored the AthenaK one (which itself is ported from the ETK).
While this works as is it still needs
PR Checklist