Bump to v2.0.12: VMWatch Integration#60
Merged
frank-pang-msft merged 71 commits intomasterfrom Jun 11, 2024
Merged
Conversation
…33) ## Overview This PR contains changes to support running VMWatch (amd64 and arm64) as an executable via goroutines and channels. > VMWatch is a standardized, lightweight, and open-sourced testing framework designed to enhance the monitoring and management of guest VMs on the Azure platform, including both 1P and 3P instances. VMWatch is engineered to collect vital health signals across multiple dimensions, which will be seamlessly integrated into Azure's quality systems. By leveraging these signals, VMWatch will enable Azure to swiftly detect and prevent regressions induced by platform updates or configuration changes, identify gaps in platform telemetry, and ultimately improve the guest experience for all Azure customers. ## Behavior VMWatch will run asynchronously as a separate process than ApplicationHealth, so the probing of application health will not be affected by the state of VMWatch. Depending on extension settings, VMWatch can be enabled/disabled, and also specify the test names and parameter overrides to VMWatch binary. The status of VMWatch will be displayed in the extension x.status files and also in GET VM Instance View. Main process will attempt to start VMWatch binary up to 3 times, after which VMWatch status will be set to failed. ## Process Leaks To ensure that VMWatch processes do not accumulate, applicationhealth-shim will be responsible for killing existing VMWatch processes by looking for processes running with the VMWatch binary names according to the architecture type. For unexpected process termination, if for some reason the main applicationhealth-extension process is terminated, we also ensure that the VMWatch process is also killed by subscribing to shutdown/termination signals in the main process, and killing the VMWatch based off process ID. ## Example Binary Execution Example execution from integration testing ` SIGNAL_FOLDER=/var/log/azure/Microsoft.ManagedServices.ApplicationHealthLinux/events VERBOSE_LOG_FILE_FULL_PATH=/var/log/azure/Microsoft.ManagedServices.ApplicationHealthLinux/VE.RS.ION/vmwatch.log ./var/lib/waagent/Extension/bin/VMWatch/vmwatch_linux_amd64 --config /var/lib/waagent/Extension/bin/VMWatch/vmwatch.conf --input-filter disk_io:outbound_connectivity ` ## Release/Packaging In addition to the arm64 or amd64 VMWatch binaries, `vmwatch.conf` will be expected to be present in the bin/VMWatch directory for VMWatch process to read. VMWatch will also be populating and sharing eventsFolder with ApplicationHealth, so events can be viewed in Kusto. The verbose logs of VMWatch will be written to `vmwatch.log`. --------- Co-authored-by: klugorosado <[email protected]>
* Bootstrapping has no integration test regressions * Add cleanup of VMWatch process during shutdown signals and upon other commands, plus integration test template * Added integration tests for VMWatch * Linting * Fix file vet issues * attempt to fix handler command: install - creates the data dir * nit integration tests * Use handlerenvironment to dictate vmwatch signal folder and verbose log file paths * Include missing changes in previous commit * Remove unnecessary changes * Try to fix docker installation error in go workflow * Fix integration tests * Update HandlerManifest with process names for guest agent to monitor cpu/memory usage * Run linting * Remove cpu/memory limits in HandlerManifest + update VMWatch binary directory to bin/VMWatch/ + implement VMWatch process retries + update integration tests * Update test.Dockerfile * Rename workflow * Add formatting & linting * Add logic to do retries on failed tests + don't fail fast * Minor nits * Update integration tests + code changes to resolve comments regarding execution of process * Formatting + Linting + Vet * Add logic for recover and defer for executing VMWatch. Proper close and read of channel. Also only every 60 seconds * fix integration tests * Bump to v2.0.7 * revert unnecessary changes to schema.go * Small fix to killVMWatch * Fix logic for killing VMWatch * v2.0.8 Added Support for dynamic EventsFolder directory from extension Handler Environment (#39) * - moved handlerenv.go and seqno.go from "github.com/Azure/azure-docker-extension/pkg/vmextension" - Added EventsFolder with other missing parameters. * -removed vmextension lib dependency from VMwatch and other Files. - Updates HandlerEnviroment.json test file. - Updated VMwatch Integration Tests. * - Bump to v2.0.8 * initial devcontainer changes changes: 1. add devcontainer condig 2. add vscode build config 3. add makefile target to set up the appropriate stuff in the container 4. update some line endings and add gitattributes so script run 5. fix what seems to be a bug in fake-waagent script as it doesn't work without this fix for me * update binaries and config to latest * Resource governance, heartbeat and dev container changes The main feature change here is the addition of resource governance for linux via cgroups. We discover the current cgroup and add a sub cgroup for our purposes (limiting cpu to 1% and memory to 40MB) I also added support for detecting a stuck vmwatch using the heartbeat file and implemented the same logic for restarts from the windows version (3 restarts per 3 hours) As part of the development of this, I added support for devcontainer execution so we can step through the code from a dev machine into either a WSL session or a linux vm with tools installed. I added integration tests to check process exit, OOM and cpu throttling. These changes required a few changes to the makefile and scripts. I also updated the vmwatch binaries and added a script to download the latest ones as well I updated the govendor files using the tool it told me to run I hope I did this right * feedback * feedback * Run 'go mod edit -go=1.18 to be conistent with linux extensions repo * Run linting/formatting * Fix merge nits to merge conflicts * Fix app health handler.log directory path * Change to applicationhealth-extension * Mistakenly added two VMWatch substatus items * Adding filtering for tests which can only run on a real linux host (not WSL or docker) continuing investigation... * fix time from minutes to hours plus add makefile target to create zip file (for use in testing) * feedback * feedback * add readme * updated vmwatch version, config schema and commandline * typo * test fixes * test fixes * add helper script to upload binaries to storage * change container name * feedback * feedback * typo --------- Co-authored-by: Frank Pang <[email protected]> Co-authored-by: frank-pang-msft <[email protected]> Co-authored-by: klugorosado <[email protected]>
* Initial checkpoint * tweak tests * tweak the scripts 1. use nc for a tco server instead of web server for simplicity 2. add the variables to control tolerating the failure assignment to cgroup to allow tests to run 3. add new test for the case where it fails * feedback * feedback * feeback * feedback
Add debug flag when running vmwatch
adding more properties to pass down to vmwatch
removing data-type key from json tag for url type, unmarshal is failing otherwise
adding a way to provide custom container to upload artifact
* Added --apphealth-version flag to VMWatch with AppHealth version from manifest.xml * - Validated Extension Version on existing VMWatch. - Created bash function to extract Version from manifest.xml. - GetExtensionManifestVersion now first attempts to get Extension Version from Version passed at build time and uses manifest.xml file as fallback.
updating vmwatch version to 1.0.8
tweaking settings based on findings in sql vms
Dev/dpoole/tweak resource governance
bump version to 2.0.10
* Adding internal/manifest package from Cross-Platform AppHealth Feature Branch * Running go mod tidy and go mod vendor * - Add manifest.xml to Extension folder - Chaged Github workflow go version to Go 1.18 - Small refactor in setup function for bats tests. * Update Go version to 1.18 in Dockerfile * Add logging package with NopLogger implementation * Add telemetry package for logging events * - Add telemetry event Logging to main.go * - Add new String() methods to vmWatchSignalFilters and vmWatchSettings structs - Add telemetry event Logging to handlersettings.go * - Add telemetry event Logging to reportstatus.go * Add telemetry event Logging to health.go * Refactor install handler in main/cmds.go to use telemetry event logging * Refactor uninstall handler in main/cmds.go to use telemetry event logging * Refactor enable handler function in main/cmds.go to use telemetry event logging * Refactor vmWatch.go to use telemetry event logging * Fix requestPath in extension-settings.json and updated 2 integration tests, one in 2_handler-commands.bats and another in 7_vmwatch.bats * ran go mod tidy && go mod vendor * Update ExtensionManifest version to 2.0.9 on UT * Refactor telemetry event sender to use EventLevel constants in main/telemetry.go * Refactor telemetry event sender to use EventTasks constants that match with existing Windows Telemetry * Update logging messages in 7_vmwatch.bats * Moved telemetry.go to its package in internal/telemetry * Update Go version to 1.22 in Dockerfile, go.yml, go.mod, and go.sum * Update ExtensionManifest version to 2.0.9 on UT * Add NopLogger documentation to pkg/logging/logging.go * Added Documentation to Telemetry Pkg * -Added a Wrapper to HandlerEnviroment to add Additional functionality like the String() func - Added String() func to handlersettings struct, publicSettings struct, vmWatchSettings struct and vmWatchSignalFilters struct - Added Telemetry Event for HandlerSettings, and for HandlerEnviroment * - Updated HandlerEnviroment String to use MarshallIndent Function. - Updated HandlerSettings struct String() func to use MarshallIndent - Fixed Failing UTs due to nil pointer in Embedded Struct inside HandlerEnviroment. * - Updated vmWatchSetting String Func to use MarshallIdent * Update ExtensionManifest version to 2.0.10 on Failing UT * removed duplicated UT * Removed String() func from VMWatchSignalFilters, publicSettings and protectedSettings
chore: update the latest vmwatch binaries (1.1.1)
…nly (#68) * Removed Noise Telemetry Events, and more details on error log. * - Created new CustomMetricsStatusType - CustomMetrics will know be reported only when there is a Change in the CustomMetric Field. - Added commitedCustomMetricsState variable to keep track of the last CustomMetric Value.
… version We found when testing on some ditros that they had older versions of systemd installed. Versions before 246 use `MemoryLimit` and after that use `MemoryMax` so we need to know which version we have when constructing the commandline. Also older versions didn't support the `-E` flag for environment variables and instead use the longer form `--setenv`. This same flag is supported in both old and new versions
Change the commandline used for systemd-run depeding on the installed version
klugorosado
approved these changes
May 6, 2024
Although the tests have been passing on the latest changes, there was a failure in testing last night.
When investigating I found the cause of the problem. When you call cmd.Execute("systemd-run") golang will (sometimes) replace it with the full path (in this case /usr/bin/systemd-run) and so our check for systemd-run mode was not working and it was going down the old code path of direct cgroup assignment.
Fixing by being explicit about it and returning a boolean indicating whether resource governance is required after the process is launched. This brings it back to the way it was in the previous PR iterations but avoids the objections raised there due to linux only concepts. When we converge the windows code here, the implementation of applyResourceGovernance will use Job objects on windows and the code flow will be the same.
I have been unable to run the integration tests locally since upgrading my laptop. I worked with kevin to figure out the issues and the tests are working now. 1. changing to build the test container using no-cache mode since if you have an old bad version it would not get rebuilt. 1. changing the devconatiner config to force running amd64 rather than arm64 1. tweaking the scripts to handle the slightly different process names and ps output when running in this way. now, the tests pass on mac
Although the tests have been passing on the latest changes, there was a
failure in testing last night.
When investigating I found the cause of the problem. When you call
cmd.Execute("systemd-run") golang will (sometimes) replace it with the
full path (in this case /usr/bin/systemd-run) and so our check for
systemd-run mode was not working and it was going down the old code path
of direct cgroup assignment.
Fixing by being explicit about it and returning a boolean indicating
whether resource governance is required after the process is launched.
This brings it back to the way it was in the previous PR iterations but
avoids the objections raised there due to linux only concepts. When we
converge the windows code here, the implementation of
applyResourceGovernance will use Job objects on windows and the code
flow will be the same.
* Adding codeql code scanning to repo * Update .github/workflows/codeql.yml to use only ubuntu-latest for Go language build mode * chore: Update GOPATH on codeql.yml * Attempt to fix GOPATH * debug * debug * chore: Update GO111MODULE * chore: Update GOPATH and repo root path in codeql.yml * revert * adding more codeql queries
Some info is missing in the kusto logs that are present in the local logs that makes it difficult to debug. - Log specific command being executed at startup (install/enable/update/etc) - Include extension sequence number and pid at startup for debugging from GuestAgent logs when extension logs are missing or seqNum.status file is missing - Log overall status file so we have better debugging when VMExtensionProvisioning fails. This status is only sent when extension transitions between Transitioning -> Success/Error or whenever extension starts up. - Update azure-extension-platform package to pull in change to increase precision of event timestamp to include milliseconds/nanoseconds, Previously it was RFC3339, which is in format yyyy-mm-ddThh:mm:ssZ, which causes issue in sorting timestamps. Azure/azure-extension-platform#34
dpoole73
approved these changes
Jun 11, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR contains changes multiple pull requests into a feature branch that will support running VMWatch (amd64 and arm64) as an executable via goroutines and channels. In addition, a number of dev/debugging tools were included to improve developer productivity.
Behavior
VMWatch will run asynchronously as a separate process than ApplicationHealth, so the probing of application health will not be affected by the state of VMWatch. Depending on extension settings, VMWatch can be enabled/disabled, and also specify the test names and parameter overrides to VMWatch binary. The status of VMWatch will be displayed in the extension status file and also in GET VM Instance View. Main process will manage VMWatch process and communicate VMWatch status via extension status file.
Process Leaks & Resource Governance
Main process ensures proper resource utilization limits for CPU and Memory, along with avoiding process leaks by subscribing to shutdown/termination signals in the main process.