Sync WindowsMigration Feature branch by klugorosado · Pull Request #90 · Azure/applicationhealth-extension-linux

klugorosado · 2024-07-31T00:58:03Z

This pull request includes several updates to the development environment, dependency management, and testing setup. The most significant changes involve updating the Dockerfile, enhancing the devcontainer configuration, and improving the integration tests.

Development Environment Updates:

Dockerfile: Upgraded the Go version from 1.18 to 1.22 and added installations for npm, Bats, Parallel, and Docker. [1] [2]
Devcontainer Configuration: Added Docker and GitHub Copilot extensions, updated environment variables, and modified the postCreateCommand to include go mod download. [1] [2]

Dependency Management:

Go Module: Updated the module name, Go version, and added new dependencies such as github.com/go-kit/log and go.uber.org/mock.

Testing Enhancements:

Integration Tests: Added a script to forcefully kill the application health extension and modified test scripts to improve reliability and compatibility. [1] [2] [3] [4]
GitHub Actions: Added a CodeQL analysis workflow and updated the Go setup to use version 1.22.2. [1] [2] [3] [4]

Minor Changes:

File Extensions: Ensured .bash files use Unix line endings.
VSCode Configuration: Added recommendations for Docker and GitHub extensions. [1] [2] [3]

These changes collectively improve the development workflow, enhance the testing infrastructure, and ensure compatibility with the latest tool versions.

@dpoole73

Changes by @dpoole73 - Fix bug where we were not using the global `vmWatchCommand` variable so the SIGTERM handler was not killing anything - set the `Pdealthsig` property on the command so the SIGTERM signal is sent to the sub process on parent process termination This fixes both issues: Before the fix if we killed app health process, vmwatch process was always leaked After the fix: `kill <pid>` -> log message "Received shutdown request" and kill vmwatch. `kill -9 <pid>`-> no log message, vmwatch is killed Changes by @klugorosado - Added Integration tests to kill AppHealth Gracefully with SIGTERM and SIGINT, and validated VMWatch Shutdown. - Added Integration tests to kill AppHealth Forcibly with SIGKILL, and validated VMWatch Shutdown. - Added the capability for dev containers to run Integration tests inside.

Background: Our tests have been running fine for a long time but suddenly started failing on specific os versions. This was because the process (although initially associated with the correct cgroup that we created) gets moved back to the parent cgroup. This results in the limits being removed. I did some research and reached out to various people and found that this is something that has previously been seen. When a process is started with systemd you are not supposed to manage cgroups directly, systemd owns its own hierarchy and can manipulate things within it. Documentation says that you should not modify the cgroups within that slice hierarchy directly but instead you should use `systemd-run` to launch processes. The GuestAgent folks saw very similar behavior and switching to systemd-run resolved all their issues. Changes: Changed the code to run using `systemd-run` to launch the vmwatch process. Using the `--scope` parameter results in the call to wait until the vmwatch process completes. The process id returned from the call is the actual process id of vmwatch. I have confirmed that killing vmwatch and killing app health extension still has the same behavior (the PDeathSig integration is working fine) and the aurora tests are working fine with these changes. NOTE: Because in docker containers, systemd-run is not available, the code falls back to run the process directly and continues to use the old code path in that case. This should also cover and linux distros which don't use systemd where direct cgroup assignment should work fine.

…message can get logged differently

i don't know why this passed before, clearly we kill the process when we fail to assign a cgroup i don't know why it would ever return a different message with this fix test pass locally

fix(systemd-run): Switch to use systemd-run instead of direct process and cgroup manipulation

Updating vmwatch binaries to 1.1.0 package

bump version to 2.0.10

* Adding internal/manifest package from Cross-Platform AppHealth Feature Branch * Running go mod tidy and go mod vendor * - Add manifest.xml to Extension folder - Chaged Github workflow go version to Go 1.18 - Small refactor in setup function for bats tests. * Update Go version to 1.18 in Dockerfile * Add logging package with NopLogger implementation * Add telemetry package for logging events * - Add telemetry event Logging to main.go * - Add new String() methods to vmWatchSignalFilters and vmWatchSettings structs - Add telemetry event Logging to handlersettings.go * - Add telemetry event Logging to reportstatus.go * Add telemetry event Logging to health.go * Refactor install handler in main/cmds.go to use telemetry event logging * Refactor uninstall handler in main/cmds.go to use telemetry event logging * Refactor enable handler function in main/cmds.go to use telemetry event logging * Refactor vmWatch.go to use telemetry event logging * Fix requestPath in extension-settings.json and updated 2 integration tests, one in 2_handler-commands.bats and another in 7_vmwatch.bats * ran go mod tidy && go mod vendor * Update ExtensionManifest version to 2.0.9 on UT * Refactor telemetry event sender to use EventLevel constants in main/telemetry.go * Refactor telemetry event sender to use EventTasks constants that match with existing Windows Telemetry * Update logging messages in 7_vmwatch.bats * Moved telemetry.go to its package in internal/telemetry * Update Go version to 1.22 in Dockerfile, go.yml, go.mod, and go.sum * Update ExtensionManifest version to 2.0.9 on UT * Add NopLogger documentation to pkg/logging/logging.go * Added Documentation to Telemetry Pkg * -Added a Wrapper to HandlerEnviroment to add Additional functionality like the String() func - Added String() func to handlersettings struct, publicSettings struct, vmWatchSettings struct and vmWatchSignalFilters struct - Added Telemetry Event for HandlerSettings, and for HandlerEnviroment * - Updated HandlerEnviroment String to use MarshallIndent Function. - Updated HandlerSettings struct String() func to use MarshallIndent - Fixed Failing UTs due to nil pointer in Embedded Struct inside HandlerEnviroment. * - Updated vmWatchSetting String Func to use MarshallIdent * Update ExtensionManifest version to 2.0.10 on Failing UT * removed duplicated UT * Removed String() func from VMWatchSignalFilters, publicSettings and protectedSettings

chore: update the latest vmwatch binaries (1.1.1)

…nly (#68) * Removed Noise Telemetry Events, and more details on error log. * - Created new CustomMetricsStatusType - CustomMetrics will know be reported only when there is a Change in the CustomMetric Field. - Added commitedCustomMetricsState variable to keep track of the last CustomMetric Value.

… version We found when testing on some ditros that they had older versions of systemd installed. Versions before 246 use `MemoryLimit` and after that use `MemoryMax` so we need to know which version we have when constructing the commandline. Also older versions didn't support the `-E` flag for environment variables and instead use the longer form `--setenv`. This same flag is supported in both old and new versions

Change the commandline used for systemd-run depeding on the installed version

Although the tests have been passing on the latest changes, there was a failure in testing last night. When investigating I found the cause of the problem. When you call cmd.Execute("systemd-run") golang will (sometimes) replace it with the full path (in this case /usr/bin/systemd-run) and so our check for systemd-run mode was not working and it was going down the old code path of direct cgroup assignment. Fixing by being explicit about it and returning a boolean indicating whether resource governance is required after the process is launched. This brings it back to the way it was in the previous PR iterations but avoids the objections raised there due to linux only concepts. When we converge the windows code here, the implementation of applyResourceGovernance will use Job objects on windows and the code flow will be the same.

…ous feedback

* Adding codeql code scanning to repo * Update .github/workflows/codeql.yml to use only ubuntu-latest for Go language build mode * chore: Update GOPATH on codeql.yml * Attempt to fix GOPATH * debug * debug * chore: Update GO111MODULE * chore: Update GOPATH and repo root path in codeql.yml * revert * adding more codeql queries

I have been unable to run the integration tests locally since upgrading my laptop. I worked with kevin to figure out the issues and the tests are working now. 1. changing to build the test container using no-cache mode since if you have an old bad version it would not get rebuilt. 1. changing the devconatiner config to force running amd64 rather than arm64 1. tweaking the scripts to handle the slightly different process names and ps output when running in this way. now, the tests pass on mac

Although the tests have been passing on the latest changes, there was a failure in testing last night. When investigating I found the cause of the problem. When you call cmd.Execute("systemd-run") golang will (sometimes) replace it with the full path (in this case /usr/bin/systemd-run) and so our check for systemd-run mode was not working and it was going down the old code path of direct cgroup assignment. Fixing by being explicit about it and returning a boolean indicating whether resource governance is required after the process is launched. This brings it back to the way it was in the previous PR iterations but avoids the objections raised there due to linux only concepts. When we converge the windows code here, the implementation of applyResourceGovernance will use Job objects on windows and the code flow will be the same.

* Adding codeql code scanning to repo * Update .github/workflows/codeql.yml to use only ubuntu-latest for Go language build mode * chore: Update GOPATH on codeql.yml * Attempt to fix GOPATH * debug * debug * chore: Update GO111MODULE * chore: Update GOPATH and repo root path in codeql.yml * revert * adding more codeql queries

Some info is missing in the kusto logs that are present in the local logs that makes it difficult to debug. - Log specific command being executed at startup (install/enable/update/etc) - Include extension sequence number and pid at startup for debugging from GuestAgent logs when extension logs are missing or seqNum.status file is missing - Log overall status file so we have better debugging when VMExtensionProvisioning fails. This status is only sent when extension transitions between Transitioning -> Success/Error or whenever extension starts up. - Update azure-extension-platform package to pull in change to increase precision of event timestamp to include milliseconds/nanoseconds, Previously it was RFC3339, which is in format yyyy-mm-ddThh:mm:ssZ, which causes issue in sorting timestamps. Azure/azure-extension-platform#34

## Overview This PR contains changes multiple pull requests into a feature branch that will support running VMWatch (amd64 and arm64) as an executable via goroutines and channels. In addition, a number of dev/debugging tools were included to improve developer productivity. > VMWatch is a standardized, lightweight, and open-sourced testing framework designed to enhance the monitoring and management of guest VMs on the Azure platform, including both 1P and 3P instances. VMWatch is engineered to collect vital health signals across multiple dimensions, which will be seamlessly integrated into Azure's quality systems. By leveraging these signals, VMWatch will enable Azure to swiftly detect and prevent regressions induced by platform updates or configuration changes, identify gaps in platform telemetry, and ultimately improve the guest experience for all Azure customers. ## Behavior VMWatch will run asynchronously as a separate process than ApplicationHealth, so the probing of application health will not be affected by the state of VMWatch. Depending on extension settings, VMWatch can be enabled/disabled, and also specify the test names and parameter overrides to VMWatch binary. The status of VMWatch will be displayed in the extension status file and also in GET VM Instance View. Main process will manage VMWatch process and communicate VMWatch status via extension status file. ## Process Leaks & Resource Governance Main process ensures proper resource utilization limits for CPU and Memory, along with avoiding process leaks by subscribing to shutdown/termination signals in the main process.

Bumps google.golang.org/protobuf from 1.27.1 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>

We discovered that there is a timing issue in the script which can cause it to fail, this resulting in settings update timeouts. Explanation: 1. `kill_existing_apphealth_processes` checks for app health running and see it is running 1. it kills the process using `pkill -f`, this succeeds. 1. killing app health causes vmwatch to be sent a kill signal 1. `kill_existing_vmwatch_processes` checks for vmwatch running and sees it is there because it hasn't quite reacted to the kill signal yet 1. tries to kill it using `pkill -f` but it has already gone so it fails and because the script is running with `set -e` it fails immediately the fix: add `|| true` to the command so that failures are ignored. If it actually failed to kill the process for some reason the script will still poll and fall back to `pkill -9` so there is no change in behavior in the case of a real issue killing the process, just fixes a timing issue

…he new account we migrated to a new account so updating the script

…he new account (#86) we migrated to a new account so updating the script

We investigated this at length and cannot figure out what could possibly be happening based on the logs. Logging the fact that kill failed should help us at least narrow it down further. Also adding a sleep between killing app health and vmwatch to reduce the chance of any races (vmwatch should die naturally when app health is killed but it may not happen immediately so giving it a second to respond should reduce the times it needs to be killed independently.

… extension Sequence Number (#83) This pull request includes changes to the sequence number management and testing in the `main` and `internal/seqno` packages. The most important changes include the creation of a new `SequenceNumberManager` interface and `SeqNumManager` struct, the addition of a function to check if a sequence number has already been processed before enabling it, and the addition of tests for the new function. New sequence number management: * [`internal/seqno/seqno.go`](diffhunk://#diff-f671e4abbca7ae7b738bc8ef287fcbf3995062b2cc5e54ad666e3fa6f1b674dcR1-R101): Created a new `SequenceNumberManager` interface and `SeqNumManager` struct to manage sequence numbers. The `SeqNumManager` struct includes functions to get and set sequence numbers, and to find a sequence number from either the environment variable or the most recently used file under the config folder. Changes to `main` package: * [`main/cmds.go`](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L32-R32): a new function `enablePre` has been added. This function, acting as the PreFunc for the enable command, verifies if the sequence number is ready for processing by comparing it with the last executed number from the `mrseq` file. This ensures orderly processing of sequence numbers. * [`main/main.go`](diffhunk://#diff-327181d0a8c5e6b164561d7910f4eeffd41442d55b2a2788fda2aa2692f17ec0L64-R68): Replaced the `FindSeqNum` function with `seqnoManager.FindSeqNum` to find the sequence number. * [`main/seqnum.go`](diffhunk://#diff-171d8d31093fac5a89b9bbe034fe628faf47dd12fad91b3205433ca95c56be52L1-L32): Removed the `FindSeqNum` function as it has been replaced by `seqnoManager.FindSeqNum`. New tests: * [`main/cmds_test.go`](diffhunk://#diff-bdb35e68cc43b04f7c8b572233a1472169052b84e0b471c6fe578fe049784223R36-R133): Added tests for the enablePre. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…gger (#88) This Pull Request refactors the usage of our Telemetry package and changes the logger from gokit logger. These changes close the gap on the difference between the v2 in the master branch and the consolidated code base in the feature/v2/WindowsMigration. Important changes: - Created new Slog Handler to change the msg tag to event and moved it to the end of the log message. - The new Telemetry struct is now a singleton instead of a global function. ### AI Updates: ------------------------------------------------------------------------------------ Updates to Telemetry: * [`internal/telemetry/telemetry.go`](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L4-R21): The telemetry package was updated with several changes. The `EventLevel` and `EventTask` constants were renamed for clarity, and the `TelemetryEventSender` struct was replaced with a `Telemetry` struct. Several methods were also changed, and new error variables were introduced for better error handling. [[1]](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L4-R21) [[2]](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L34-R121) * [`main/cmds.go`](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373R5-R16): The logging in the `main/cmds.go` file was updated to use the `slog` package instead of the `log` package. This included changing the type of the `lg` variable in several functions and updating the telemetry calls to use the new `SendEvent` method from the updated `telemetry` package. [[1]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373R5-R16) [[2]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L44-R68) [[3]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L80-R88) [[4]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L108-R115) [[5]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L137-R142) [[6]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L173-R173) [[7]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L182-R202) [[8]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L225-R225) [[9]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L238-R238) Updates to Logging: * [`main/handlersettings.go`](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cR6-L11): The logging in the `handlersettings.go` file was updated to use the `slog` package instead of the `log` package. This included updating the telemetry calls to use the new `SendEvent` method from the updated `telemetry` package. [[1]](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cR6-L11) [[2]](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cL139-R162) * [`main/health.go`](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efR8): The logging in the `health.go` file was updated to use the `slog` package instead of the `log` package. This included updating the `evaluate` method in the `HealthProbe` interface. [[1]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efR8) [[2]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efL16) [[3]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efL63-R63) Addition of a new devcontainer run configuration: * [`.vscode/launch.json`](diffhunk://#diff-bd5430ee7c51dc892a67b3f2829d1f5b6d223f0fd48b82322cfd45baf9f5e945R18-R29): A new devcontainer run configuration named "devcontainer run - uninstall" was added. This configuration is set up to run the "uninstall" command in the `applicationhealth-extension` program.

klugorosado and others added 30 commits March 19, 2024 12:06

Bump to v2.0.9

7a9c4ff

correct typo

fc430b0

make sure bash files are lf line endings

c864997

feedback

e4c00e9

feedback

0c9a693

feedback

b30ee91

fix test issue. There seems to be a non-deterministic case where the …

fea2ebd

…message can get logged differently

revert

3e94eb6

correcting the search term

23cb651

i don't know why this passed before, clearly we kill the process when we fail to assign a cgroup i don't know why it would ever return a different message with this fix test pass locally

Merge pull request #64 from Azure/dev/dpoole/cgroup-using-systemd-run

043d2a2

fix(systemd-run): Switch to use systemd-run instead of direct process and cgroup manipulation

Updating vmwatch binaries to 1.1.0 package

152b39c

bump version to 2.0.10

6596189

Merge pull request #65 from Azure/dev/dpoole/update-vmwatch-4-30

b552134

Updating vmwatch binaries to 1.1.0 package

Merge pull request #66 from Azure/release-2.0.10

e8e69f4

bump version to 2.0.10

update the latest vmwatch binaries (1.1.1)

30a2d4c

Merge pull request #67 from Azure/dev/dpoole/update-vmwatch-5-2

bd1dbc0

chore: update the latest vmwatch binaries (1.1.1)

feedback

5b29a32

Merge pull request #69 from Azure/dpoole/systemd-run-commandline-fix

6d2ff1e

Change the commandline used for systemd-run depeding on the installed version

fix error check

c84eb7b

fix accidental edit

89379ba

preserving the original logic

486672b

revert accidental new variable creation

c079e29

feedback

03446f8

corrected test to match the new way the error shows up based on previ…

baa630d

…ous feedback

klugorosado and others added 21 commits May 16, 2024 12:57

fix: Update manifest.xml version to 2.0.11

af6cd72

bump version to 2.0.11 (#74)

62444c8

AppHealthLinux manifest version bump to 2.0.12

f723cfd

update VMWatch binary

46fbe02

AppHealthLinux manifest version bump to 2.0.12 (#81)

b2b0c04

Bump google.golang.org/protobuf from 1.27.1 to 1.33.0

93f8e0f

Bumps google.golang.org/protobuf from 1.27.1 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>

chore(script): Update the script tp upload the test binaries to use t…

78910a6

…he new account we migrated to a new account so updating the script

chore(script): Update the script to upload the test binaries to use t…

d2d9562

…he new account (#86) we migrated to a new account so updating the script

Merge branch 'feature/v2/WindowsMigration' into feature/v2/sync

403f33e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync WindowsMigration Feature branch#90

Sync WindowsMigration Feature branch#90
klugorosado wants to merge 51 commits intofeature/v2/WindowsMigrationfrom
feature/v2/sync

klugorosado commented Jul 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

klugorosado commented Jul 31, 2024

Development Environment Updates:

Dependency Management:

Testing Enhancements:

Minor Changes:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants