From 2854b2ff633ea06e6f7ccb45482f536f187b29a7 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Thu, 26 Sep 2024 15:42:14 +0800 Subject: [PATCH 1/5] HLD of simplified PR test --- .azure-pipelines/simplified_PR_test/README.md | 66 +++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 .azure-pipelines/simplified_PR_test/README.md diff --git a/.azure-pipelines/simplified_PR_test/README.md b/.azure-pipelines/simplified_PR_test/README.md new file mode 100644 index 00000000000..1d9095e6808 --- /dev/null +++ b/.azure-pipelines/simplified_PR_test/README.md @@ -0,0 +1,66 @@ +## Background +In the current PR test process, we run a fixed set of test scripts regardless of the scope of changes, +leading to unnecessary resource consumption. +However, in the sonic-mgmt repository, +it's sufficient to run only the relevant test scripts to validate the changes. + +To optimize this, we propose a simplified PR test +that runs only the necessary test scripts located in the same folder as the modified files, +which reduces both time and cost efficiently. + + +## Design +Our new simplified PR test will follow below principles: +- If changes are made only to the scripts within the features folder, +we will run only the specific scripts in those feature folders. +- If our change related to the common folder, we will run all test scripts, +which is same as our previous PR test. + +In our new PR test, we will have multiple PR checkers classified by topology type. +To collect all required scripts for each PR checker, which means, +these scripts should not only within the scope that we changed, but also meet the requirement of topology. + +Because the number of scripts per test is variable, +the instances used by Elastictest will also be automatically scheduled concurrently, + + +### To meet the requirement of topology +One approach to achieve this is by using the `--topology` parameter supported by pytest. +It compares against the topology marked with `pytest.mark.topology` in script, +and if the mark matches, the script is deemed necessary. +However, this method triggers pytest's collection process for each script, +leading to unnecessary time consumption. + +Another approach is to collect and analyze all scripts before execution. +Each script includes the `pytest.mark.topology` marker to indicate the applicable topology it can run on. +We will perform a global scan of all test scripts to identify this marker and extract its value, +which represents the topology type compatible with the script. +After determining the valid topology for each script, +we will group them accordingly and maintain a set of test scripts for each topology. +Then, in each PR checker, we will select relevant scripts in the set within the change scope. +This method eliminates unnecessary processes by executing only the on-demand scripts, +resulting in reduced running time. + + +### To schedule instances automatically +Our goal is to complete the entire PR test within 2 hours. +In the current PR test, each checker uses a fixed number of instances. +However, in the new simplified test, the number of scripts executed varies, +so the number of instances should be dynamically adjusted for cost efficiency. +The number of instances we allocate will be determined by the total estimated execution time of the scripts that need to be run. +We can leverage historical data to obtain the average running time of each script from previous test executions. + +We now have a Kusto table that logs details about the execution of test cases, +including the running time, date, results, and more. +To determine the preset running time for each test script, +we will calculate the average running time of successful runs over the past three days. +If no relevant records are found in Kusto, a default value will be used for the preset running time. +This approach allows us to estimate the total execution time for our scripts accurately. + +Using this information, we will evenly distribute the scripts across instances, +ensuring that the workload is balanced of each instance. +Ideally, each instance will run its assigned scripts in approximately 1.5 hours, +leaving additional time for tasks such as testbed preparation and keeping the total runtime within 2 hours. + +## Benefits +This new simplified PR test will run on demand, reducing both time and cost efficiently. From af1ec65b8030630ab7ca5a0e7fe529dbfd9854a4 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 15 Oct 2024 14:18:00 +0800 Subject: [PATCH 2/5] pre-commit fix --- .azure-pipelines/simplified_PR_test/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/.azure-pipelines/simplified_PR_test/README.md b/.azure-pipelines/simplified_PR_test/README.md index 1d9095e6808..8d33e2b8ec5 100644 --- a/.azure-pipelines/simplified_PR_test/README.md +++ b/.azure-pipelines/simplified_PR_test/README.md @@ -50,11 +50,11 @@ so the number of instances should be dynamically adjusted for cost efficiency. The number of instances we allocate will be determined by the total estimated execution time of the scripts that need to be run. We can leverage historical data to obtain the average running time of each script from previous test executions. -We now have a Kusto table that logs details about the execution of test cases, -including the running time, date, results, and more. -To determine the preset running time for each test script, -we will calculate the average running time of successful runs over the past three days. -If no relevant records are found in Kusto, a default value will be used for the preset running time. +We now have a Kusto table that logs details about the execution of test cases, +including the running time, date, results, and more. +To determine the preset running time for each test script, +we will calculate the average running time of successful runs over the past three days. +If no relevant records are found in Kusto, a default value will be used for the preset running time. This approach allows us to estimate the total execution time for our scripts accurately. Using this information, we will evenly distribute the scripts across instances, From 05c7a35313b9fd62de3bcd36b45b4e0a38251c60 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Fri, 18 Oct 2024 15:13:29 +0800 Subject: [PATCH 3/5] Update design doc --- .azure-pipelines/simplified_PR_test/README.md | 95 ++++++++++++------- 1 file changed, 61 insertions(+), 34 deletions(-) diff --git a/.azure-pipelines/simplified_PR_test/README.md b/.azure-pipelines/simplified_PR_test/README.md index 8d33e2b8ec5..13e8da71ee6 100644 --- a/.azure-pipelines/simplified_PR_test/README.md +++ b/.azure-pipelines/simplified_PR_test/README.md @@ -1,48 +1,69 @@ +## Terminology +- Instance: A KVM running on an Azure VMSS used for executing PR tests. + Each instance corresponds to a specific topology and is utilized to perform the tests. +- Scope: The set of test scripts that are executed during the PR test. + ## Background -In the current PR test process, we run a fixed set of test scripts regardless of the scope of changes, -leading to unnecessary resource consumption. -However, in the sonic-mgmt repository, -it's sufficient to run only the relevant test scripts to validate the changes. +In current PR testing process, a fixed set of test scripts is executed regardless of the change scope. +With approximately 440 test scripts running, the process has become excessively large. +Due to the maximum execution time limit, more instances are needed to run the tests in parallel. +For example, to meet this requirement, we need 20 instances for t0 and 25 instances for t1. +The cost per PR has reached $35, which is considerably high. +To address these issues, we propose a new PR testing model called 'Impacted Area-Based PR Testing. + +## Preparation +We can organization the test scripts in this way: +```buildoutcfg +sonic-mgmgt + | + | - tests + | + | - common ---------- shared + | - arp -----| + | - ecmp | --- features + | - vlan | + | - ...... -----| +``` +Within the tests directory in sonic-mgmt, we categorize scripts into two sections: shared and features. +Scripts in the common folder fall under the shared section and can be utilized across different folders. +In contrast, scripts in other folders belong to the features section, representing specific functionalities such as arp, ecmp, and vlan, +and are intended for use within their respective folders. +This hierarchy helps us more effectively identify the impacted areas for the new PR testing process. -To optimize this, we propose a simplified PR test -that runs only the necessary test scripts located in the same folder as the modified files, -which reduces both time and cost efficiently. +However, the previous code had numerous cross-feature dependencies. +To achieve our goal, we carried out some preparatory work by eliminating these cross-feature dependencies. ## Design -Our new simplified PR test will follow below principles: -- If changes are made only to the scripts within the features folder, -we will run only the specific scripts in those feature folders. -- If our change related to the common folder, we will run all test scripts, -which is same as our previous PR test. +### Impcated Area +We introduce a new term called `impacted area`, which represents the scope of PR testing. +This term can be defined as follows: +- If changes are made solely to the scripts within the feature folders, + the impacted area is considered to be those specific feature folders. +- If changes occur in the common folder, + the impacted area encompasses both common and all feature folders. +We can determine the impcated area using command `git diff`. +### Distribute scripts to PR checkers In our new PR test, we will have multiple PR checkers classified by topology type. -To collect all required scripts for each PR checker, which means, +To distribute all required scripts for each PR checker, which means, these scripts should not only within the scope that we changed, but also meet the requirement of topology. -Because the number of scripts per test is variable, -the instances used by Elastictest will also be automatically scheduled concurrently, - - -### To meet the requirement of topology -One approach to achieve this is by using the `--topology` parameter supported by pytest. +We can suggest two approaches to achieve this: +- One approach is by using the `--topology` parameter supported by pytest. It compares against the topology marked with `pytest.mark.topology` in script, and if the mark matches, the script is deemed necessary. However, this method triggers pytest's collection process for each script, -leading to unnecessary time consumption. +leading to unnecessary time consumption, which is not expected. -Another approach is to collect and analyze all scripts before execution. +- Another approach is to collect and analyze all scripts before execution. Each script includes the `pytest.mark.topology` marker to indicate the applicable topology it can run on. -We will perform a global scan of all test scripts to identify this marker and extract its value, +We will perform a global scan of all test scripts in the impacted area to identify this marker and extract its value, which represents the topology type compatible with the script. -After determining the valid topology for each script, -we will group them accordingly and maintain a set of test scripts for each topology. -Then, in each PR checker, we will select relevant scripts in the set within the change scope. -This method eliminates unnecessary processes by executing only the on-demand scripts, -resulting in reduced running time. - +After determining the valid topology for each script, we can distribute the script to corresponding PR checkers. +This method eliminates unnecessary processes by executing only the on-demand scripts, resulting in reduced running time. -### To schedule instances automatically +### Allocate instances dynamically Our goal is to complete the entire PR test within 2 hours. In the current PR test, each checker uses a fixed number of instances. However, in the new simplified test, the number of scripts executed varies, @@ -53,14 +74,20 @@ We can leverage historical data to obtain the average running time of each scrip We now have a Kusto table that logs details about the execution of test cases, including the running time, date, results, and more. To determine the preset running time for each test script, -we will calculate the average running time of successful runs over the past three days. -If no relevant records are found in Kusto, a default value will be used for the preset running time. +we will calculate the average running time of the latest five run times. +If no relevant records are found in Kusto, a default value(1800s per script) will be used for the preset running time. This approach allows us to estimate the total execution time for our scripts accurately. Using this information, we will evenly distribute the scripts across instances, ensuring that the workload is balanced of each instance. Ideally, each instance will run its assigned scripts in approximately 1.5 hours, -leaving additional time for tasks such as testbed preparation and keeping the total runtime within 2 hours. +leaving additional time for tasks such as testbed preparation and clean-up and keeping the total runtime within 2 hours. + +## Advantages +Impacted area based PR testing runs test scripts on demand, reducing the overall scale of the PR test and saving execution time. +Additionally, instances will be allocated as needed, resulting in more cost-efficient resource usage. -## Benefits -This new simplified PR test will run on demand, reducing both time and cost efficiently. +## Safeguard +As impacted area based PR testing would not cover all test scripts, +we need a safeguard to run all test scripts daily to prevent any unforeseen issues. +Fortunately, we have Baseline test to do so. \ No newline at end of file From e165d7deea59c8efd00b7e6b3a67512ac97adc61 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Fri, 18 Oct 2024 15:58:38 +0800 Subject: [PATCH 4/5] modify --- .azure-pipelines/simplified_PR_test/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.azure-pipelines/simplified_PR_test/README.md b/.azure-pipelines/simplified_PR_test/README.md index 13e8da71ee6..3c5b22ea75a 100644 --- a/.azure-pipelines/simplified_PR_test/README.md +++ b/.azure-pipelines/simplified_PR_test/README.md @@ -13,7 +13,7 @@ To address these issues, we propose a new PR testing model called 'Impacted Area ## Preparation We can organization the test scripts in this way: -```buildoutcfg +``` sonic-mgmgt | | - tests From 5c8b745b4126f7334b98532cc96a5cbea029b9e8 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 10 Dec 2024 14:35:24 +0800 Subject: [PATCH 5/5] Update design doc --- .azure-pipelines/simplified_PR_test/README.md | 83 +++++++++++-------- 1 file changed, 48 insertions(+), 35 deletions(-) diff --git a/.azure-pipelines/simplified_PR_test/README.md b/.azure-pipelines/simplified_PR_test/README.md index 3c5b22ea75a..02c737d89b0 100644 --- a/.azure-pipelines/simplified_PR_test/README.md +++ b/.azure-pipelines/simplified_PR_test/README.md @@ -1,47 +1,58 @@ -## Terminology -- Instance: A KVM running on an Azure VMSS used for executing PR tests. - Each instance corresponds to a specific topology and is utilized to perform the tests. -- Scope: The set of test scripts that are executed during the PR test. - ## Background In current PR testing process, a fixed set of test scripts is executed regardless of the change scope. -With approximately 440 test scripts running, the process has become excessively large. +This approach lacks flexibility. On the one hand, if changes are only related to a few lines of codebase, +we may don't need to run the whole scope. On the other hand, if there are new added test scripts, +we need to add them manually. + +With approximately 570 test scripts running, the process has become excessively large and the runtime increased significantly. Due to the maximum execution time limit, more instances are needed to run the tests in parallel. For example, to meet this requirement, we need 20 instances for t0 and 25 instances for t1. -The cost per PR has reached $35, which is considerably high. +The cost per PR has reached $35, and we will use $23,000 per month to run PR testing, which is considerably high. + To address these issues, we propose a new PR testing model called 'Impacted Area-Based PR Testing. -## Preparation -We can organization the test scripts in this way: +## Preparation +We can organize the codebase in this way: ``` sonic-mgmgt - | + | - .azure-pipelines + | - ansible + | - docs + | - ...... | - tests - | - | - common ---------- shared + | + | - common ---------- shared | - arp -----| | - ecmp | --- features | - vlan | | - ...... -----| ``` -Within the tests directory in sonic-mgmt, we categorize scripts into two sections: shared and features. -Scripts in the common folder fall under the shared section and can be utilized across different folders. -In contrast, scripts in other folders belong to the features section, representing specific functionalities such as arp, ecmp, and vlan, -and are intended for use within their respective folders. +Under sonic-mgmt, there are several top-level folders such as `.azure-pipelines`, `ansible`, `docs`, `tests`, and more. +Except for the `tests` folder, we classify all other folders as part of the shared section of the repo. + +Within the `tests` folder, there are multiple second-level directories. +Among them, the common folder is also considered part of the shared section. +Other folders, such as `arp`, `ecmp`, and similar directories, are classified as feature-specific parts. + +Scripts in the common folder fall under the shared section and can be utilized across different folders. +In contrast, scripts in other folders belong to the features section, representing specific functionalities such as arp, ecmp, and vlan, +and are intended for use within their respective folders. This hierarchy helps us more effectively identify the impacted areas for the new PR testing process. -However, the previous code had numerous cross-feature dependencies. +However, the previous code had numerous cross-feature dependencies. To achieve our goal, we carried out some preparatory work by eliminating these cross-feature dependencies. ## Design ### Impcated Area -We introduce a new term called `impacted area`, which represents the scope of PR testing. -This term can be defined as follows: -- If changes are made solely to the scripts within the feature folders, - the impacted area is considered to be those specific feature folders. -- If changes occur in the common folder, - the impacted area encompasses both common and all feature folders. +To take advantage of such code structure, we introduce a new term called `impacted area`, which represents the scope of PR testing. +The `impacted area` can be defined by specific features, so that we can narrow down the scope into folders. + +This term can be elaborated as follows: +- If the changes are confined to a specific feature folder, we can narrow the scope of testing to only include files within that folder. +As files in other feature folders remain unaffected and do not require testing. +- If the changes affect the common components, we cannot narrow the testing scope and must run all test scripts to ensure comprehensive coverage, as they are commonly used by other features. + We can determine the impcated area using command `git diff`. ### Distribute scripts to PR checkers @@ -63,13 +74,13 @@ which represents the topology type compatible with the script. After determining the valid topology for each script, we can distribute the script to corresponding PR checkers. This method eliminates unnecessary processes by executing only the on-demand scripts, resulting in reduced running time. -### Allocate instances dynamically -Our goal is to complete the entire PR test within 2 hours. -In the current PR test, each checker uses a fixed number of instances. -However, in the new simplified test, the number of scripts executed varies, -so the number of instances should be dynamically adjusted for cost efficiency. -The number of instances we allocate will be determined by the total estimated execution time of the scripts that need to be run. -We can leverage historical data to obtain the average running time of each script from previous test executions. +### Implement dynamic instances +Since the scope of PR testing is dynamic and determined by the impacted area, +the number of instances required also needs to be dynamic to ensure cost efficiency. +To achieve this, we must accurately estimate the total execution time in advance, +allowing us to allocate the appropriate number of instances. +This estimation can be achieved by analyzing historical data, +which provides insights into execution times for similar scenarios. We now have a Kusto table that logs details about the execution of test cases, including the running time, date, results, and more. @@ -84,10 +95,12 @@ Ideally, each instance will run its assigned scripts in approximately 1.5 hours, leaving additional time for tasks such as testbed preparation and clean-up and keeping the total runtime within 2 hours. ## Advantages -Impacted area based PR testing runs test scripts on demand, reducing the overall scale of the PR test and saving execution time. -Additionally, instances will be allocated as needed, resulting in more cost-efficient resource usage. +Impacted area based PR testing runs test scripts on demand, reducing the overall scale of the PR test and saving execution time. +And instances will be allocated as needed, resulting in more cost-efficient resource usage. +Additionally, the PR testing will be more flexible as we can collect test scripts automatically rather than hard code. ## Safeguard -As impacted area based PR testing would not cover all test scripts, -we need a safeguard to run all test scripts daily to prevent any unforeseen issues. -Fortunately, we have Baseline test to do so. \ No newline at end of file +As impacted area based PR testing would not cover all test scripts, we need a safeguard to run all test scripts daily to prevent any unforeseen issues. +Fortunately, we have Baseline testing to do so. +Baseline testing involves running all test scripts in the test plan daily to ensure the overall stability of the system and identify potential issues. +We conduct five rounds of baseline testing each day, and if any issues are detected, an ADO is automatically created, and email alerts are sent to notify the relevant teams.