This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[MXNET-651] MXNet Model Backwards Compatibility Checker #11626
Merged
Merged
Changes from 60 commits
Commits
Show all changes
61 commits
Select commit
Hold shift + click to select a range
4ee8b21
Added MNIST-MLP-Module-API models to check model save and load_checkp…
piyushghai 118850f
Added LENET with Conv2D operator training file
piyushghai 27863fd
Added LENET with Conv2d operator inference file
piyushghai b3e9774
Added LanguageModelling with RNN training file
piyushghai c141701
Added LamguageModelling with RNN inference file
piyushghai 35cbefb
Added hybridized LENET Gluon Model training file
piyushghai 418f805
Added hybridized LENET gluon model inference file
piyushghai 600efaf
Added license headers
piyushghai d73b9e2
Refactored the model and inference files and extracted out duplicate …
piyushghai 3eeba08
Added runtime function for executing the MBCC files
piyushghai 9c0157c
Added JenkinsFile for MBCC to be run as a nightly job
piyushghai 3d43bcd
Added boto3 install for s3 uploads
piyushghai 4b70e4a
Added README for MBCC
piyushghai 08ad342
Added license header
piyushghai 5d1c3fc
Added more common functions from lm_rnn_gluon_train and inference fil…
piyushghai cfe8dfc
Added scripts for training models on older versions of MXNet
piyushghai 7c41488
Added check for preventing inference script from crashing in case no …
piyushghai 50be5d8
Fixed indentation issue
piyushghai c3c9129
Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset
piyushghai 3485352
Fixed indentation issue
piyushghai af9b86d
Removed training in models and added smaller models. Now we are simpl…
piyushghai 79cfa46
Updated README
piyushghai 4df779b
Fixed indentation error
piyushghai 04465b0
Fixed indentation error
piyushghai 2d5cf09
Removed code duplication in the training file
piyushghai 7bfdf87
Added comments for runtime_functions script for training files
piyushghai c80ee31
Merged S3 Buckets for storing data and models into one
piyushghai e764d5a
Automated the process to fetch MXNet versions from git tags
piyushghai 05ded05
Added defensive checks for the case where the data might not be found
piyushghai 60c7be0
Fixed issue where we were performing inference on state model files
piyushghai 9d4d099
Replaced print statements with logging ones
piyushghai d08ba5a
Merge branch 'master' into mbcc
piyushghai cebfb26
Removed boto install statements and move them into ubuntu_python docker
piyushghai f7a36eb
Separated training and uploading of models into separate files so tha…
piyushghai 1f63941
Updated comments and README
piyushghai fbaf3e0
Fixed pylint warnings
piyushghai edd6816
Removed the venv for training process
piyushghai 87103d4
Fixed indentation in the MBCC Jenkins file and also separated out tra…
piyushghai eb24e8e
Fixed indendation
piyushghai 3525656
Fixed erroneous single quote
piyushghai 25e7ec7
Added --user flag to check for Jenkins error
piyushghai 00ee6e7
Removed unused methods
piyushghai a3a72b8
Added force flag in the pip command to install mxnet
piyushghai 86e8882
Removed the force-re-install flag
piyushghai ddb672a
Changed exit 1 to exit 0
piyushghai 9e77064
Added quotes around the shell command
piyushghai 69843fb
added packlibs and unpack libs for MXNet builds
piyushghai fae44fe
Changed PythonPath from relative to absolute
piyushghai c099979
Created dedicated bucket with correct permission
marcoabreu ffcc637
Fix for python path in training
piyushghai 7f7f6e3
Merge branch 'mbcc' of https://github.com/piyushghai/incubator-mxnet …
piyushghai 33096c0
Changed bucket name to CI bucket
piyushghai 8a085b5
Added set -ex to the upload shell script
piyushghai 5207ab1
Now raising an exception if no models are found in the S3 bucket
piyushghai 5e30f7a
Added regex to train models script
piyushghai e079d3c
Added check for performing inference only on models trained on same m…
piyushghai ceac705
Added set -ex flags to shell scripts
piyushghai 16d320a
Added multi-version regex checks in training
piyushghai 19495d6
Fixed typo in regex
piyushghai d8fa75d
Now we will train models for all the minor versions for a given major…
piyushghai ca01aa2
Added check for validating current_version
piyushghai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
120 changes: 120 additions & 0 deletions
120
tests/nightly/model_backwards_compatibility_check/JenkinsfileForMBCC
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| // -*- mode: groovy -*- | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
|
|
||
| //This is a Jenkinsfile for the model backwards compatibility checker. The format and some functions have been picked up from the top-level Jenkinsfile. | ||
|
|
||
| err = null | ||
| mx_lib = 'lib/libmxnet.so, lib/libmxnet.a, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a' | ||
|
|
||
| def init_git() { | ||
| deleteDir() | ||
| retry(5) { | ||
| try { | ||
| timeout(time: 15, unit: 'MINUTES') { | ||
| checkout scm | ||
| sh 'git submodule update --init --recursive' | ||
| sh 'git clean -d -f' | ||
| } | ||
| } catch (exc) { | ||
| deleteDir() | ||
| error "Failed to fetch source codes with ${exc}" | ||
| sleep 2 | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // pack libraries for later use | ||
| def pack_lib(name, libs=mx_lib) { | ||
| sh """ | ||
| echo "Packing ${libs} into ${name}" | ||
| echo ${libs} | sed -e 's/,/ /g' | xargs md5sum | ||
| """ | ||
| stash includes: libs, name: name | ||
| } | ||
|
|
||
| // unpack libraries saved before | ||
| def unpack_lib(name, libs=mx_lib) { | ||
| unstash name | ||
| sh """ | ||
| echo "Unpacked ${libs} from ${name}" | ||
| echo ${libs} | sed -e 's/,/ /g' | xargs md5sum | ||
| """ | ||
| } | ||
|
|
||
| def docker_run(platform, function_name, use_nvidia, shared_mem = '500m') { | ||
| def command = "ci/build.py --docker-registry ${env.DOCKER_CACHE_REGISTRY} %USE_NVIDIA% --platform %PLATFORM% --shm-size %SHARED_MEM% /work/runtime_functions.sh %FUNCTION_NAME%" | ||
| command = command.replaceAll('%USE_NVIDIA%', use_nvidia ? '--nvidiadocker' : '') | ||
| command = command.replaceAll('%PLATFORM%', platform) | ||
| command = command.replaceAll('%FUNCTION_NAME%', function_name) | ||
| command = command.replaceAll('%SHARED_MEM%', shared_mem) | ||
|
|
||
| sh command | ||
| } | ||
|
|
||
| try { | ||
| stage('MBCC Train'){ | ||
| node('restricted-mxnetlinux-cpu') { | ||
| ws('workspace/modelBackwardsCompat') { | ||
| init_git() | ||
| // Train models on older versions | ||
| docker_run('ubuntu_nightly_cpu', 'nightly_model_backwards_compat_train', false) | ||
| // upload files to S3 here outside of the docker environment | ||
| sh "./tests/nightly/model_backwards_compatibility_check/upload_models_to_s3.sh" | ||
| } | ||
| } | ||
| } | ||
|
|
||
| stage('MXNet Build'){ | ||
| node('restricted-mxnetlinux-cpu') { | ||
| ws('workspace/build-cpu') { | ||
| init_git() | ||
| docker_run('ubuntu_cpu','build_ubuntu_cpu', false) | ||
| pack_lib('cpu', mx_lib) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| stage('MBCC Inference'){ | ||
| node('restricted-mxnetlinux-cpu') { | ||
| ws('workspace/modelBackwardsCompat') { | ||
| init_git() | ||
| unpack_lib('cpu', mx_lib) | ||
| // Perform inference on the latest version of MXNet | ||
| docker_run('ubuntu_nightly_cpu', 'nightly_model_backwards_compat_test', false) | ||
| } | ||
| } | ||
| } | ||
| } catch (caughtError) { | ||
| node("restricted-mxnetlinux-cpu") { | ||
| sh "echo caught ${caughtError}" | ||
| err = caughtError | ||
| currentBuild.result = "FAILURE" | ||
| } | ||
| } finally { | ||
| node("restricted-mxnetlinux-cpu") { | ||
| // Only send email if model backwards compat test failed | ||
| if (currentBuild.result == "FAILURE") { | ||
| emailext body: 'Nightly tests for model backwards compatibity on MXNet branch : ${BRANCH_NAME} failed. Please view the build at ${BUILD_URL}', replyTo: '${EMAIL}', subject: '[MODEL BACKWARDS COMPATIBILITY TEST FAILED] build ${BUILD_NUMBER}', to: '${EMAIL}' | ||
| } | ||
| // Remember to rethrow so the build is marked as failing | ||
| if (err) { | ||
| throw err | ||
| } | ||
| } | ||
| } | ||
25 changes: 25 additions & 0 deletions
25
tests/nightly/model_backwards_compatibility_check/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Model Backwards Compatibility Tests | ||
|
|
||
| This folder contains the scripts that are required to run the nightly job of verifying the compatibility and inference results of models (trained on earlier versions of MXNet) when loaded on the latest release candidate. The tests flag if: | ||
| - The models fail to load on the latest version of MXNet. | ||
| - The inference results are different. | ||
|
|
||
|
|
||
| ## JenkinsfileForMBCC | ||
| This is configuration file for jenkins job. | ||
|
|
||
| ## Details | ||
| - Currently the APIs that covered for model saving/loading are : do_checkpoint/load_checkpoint, save_params/load_params, save_parameters/load_parameters(added v1.2.1 onwards), export/gluon.SymbolBlock.imports. | ||
| - These APIs are covered over models with architectures such as : MLP, RNNs, LeNet, LSTMs covering the four scenarios described above. | ||
| - More operators/models will be added in the future to extend the operator coverage. | ||
| - The model train file is suffixed by `_train.py` and the trained models are hosted in AWS S3. | ||
| - The trained models for now are backfilled into S3 starting from every MXNet release version v1.1.0 via the `train_mxnet_legacy_models.sh`. | ||
| - `train_mxnet_legacy_models.sh` script checks out the previous two releases using git tag command and trains and uploads models to S3 on those MXNet versions. | ||
| - The S3 bucket's folder structure looks like this : | ||
| * 1.1.0/<model-1-files> 1.1.0/<model-2-files> | ||
| * 1.2.0/<model-1-files> 1.2.0/<model-2-files> | ||
| - The <model-1-files> is also a folder which contains the trained model symbol definitions, toy datasets it was trained on, weights and parameters of the model and other relevant files required to reload the model. | ||
| - Over a period of time, the training script would have accumulated a repository of models trained over several versions of MXNet (both major and minor releases). | ||
| - The inference part is checked via the script `model_backwards_compat_inference.sh`. | ||
| - The inference script scans the S3 bucket for MXNet version folders as described above and runs the inference code for each model folder found. | ||
|
|
214 changes: 214 additions & 0 deletions
214
tests/nightly/model_backwards_compatibility_check/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,214 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
|
|
||
| import boto3 | ||
| import mxnet as mx | ||
| import os | ||
| import numpy as np | ||
| import logging | ||
| from mxnet import gluon | ||
| import mxnet.ndarray as F | ||
| from mxnet.gluon import nn | ||
| import re | ||
| from mxnet.test_utils import assert_almost_equal | ||
|
|
||
| # Set fixed random seeds. | ||
| mx.random.seed(7) | ||
| np.random.seed(7) | ||
| logging.basicConfig(level=logging.INFO) | ||
|
|
||
| # get the current mxnet version we are running on | ||
| mxnet_version = mx.__version__ | ||
| model_bucket_name = 'mxnet-ci-prod-backwards-compatibility-models' | ||
piyushghai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| data_folder = 'mxnet-model-backwards-compatibility-data' | ||
| backslash = '/' | ||
| s3 = boto3.resource('s3') | ||
| ctx = mx.cpu(0) | ||
|
|
||
|
|
||
| def get_model_path(model_name): | ||
| return os.path.join(os.getcwd(), 'models', str(mxnet_version), model_name) | ||
|
|
||
|
|
||
| def get_module_api_model_definition(): | ||
| input = mx.symbol.Variable('data') | ||
| input = mx.symbol.Flatten(data=input) | ||
|
|
||
| fc1 = mx.symbol.FullyConnected(data=input, name='fc1', num_hidden=128) | ||
| act1 = mx.sym.Activation(data=fc1, name='relu1', act_type="relu") | ||
| fc2 = mx.symbol.FullyConnected(data=act1, name='fc2', num_hidden=2) | ||
| op = mx.symbol.SoftmaxOutput(data=fc2, name='softmax') | ||
| model = mx.mod.Module(symbol=op, context=ctx, data_names=['data'], label_names=['softmax_label']) | ||
| return model | ||
|
|
||
|
|
||
| def save_inference_results(inference_results, model_name): | ||
| assert (isinstance(inference_results, mx.ndarray.ndarray.NDArray)) | ||
| save_path = os.path.join(get_model_path(model_name), ''.join([model_name, '-inference'])) | ||
|
|
||
| mx.nd.save(save_path, {'inference': inference_results}) | ||
|
|
||
|
|
||
| def load_inference_results(model_name): | ||
| inf_dict = mx.nd.load(model_name+'-inference') | ||
| return inf_dict['inference'] | ||
|
|
||
|
|
||
| def save_data_and_labels(test_data, test_labels, model_name): | ||
| assert (isinstance(test_data, mx.ndarray.ndarray.NDArray)) | ||
| assert (isinstance(test_labels, mx.ndarray.ndarray.NDArray)) | ||
|
|
||
| save_path = os.path.join(get_model_path(model_name), ''.join([model_name, '-data'])) | ||
| mx.nd.save(save_path, {'data': test_data, 'labels': test_labels}) | ||
|
|
||
|
|
||
| def clean_model_files(files, model_name): | ||
| files.append(model_name + '-inference') | ||
| files.append(model_name + '-data') | ||
|
|
||
| for file in files: | ||
| if os.path.isfile(file): | ||
| os.remove(file) | ||
|
|
||
|
|
||
| def download_model_files_from_s3(model_name, folder_name): | ||
| model_files = list() | ||
| bucket = s3.Bucket(model_bucket_name) | ||
| prefix = folder_name + backslash + model_name | ||
| model_files_meta = list(bucket.objects.filter(Prefix = prefix)) | ||
| if len(model_files_meta) == 0: | ||
| logging.error('No trained models found under path : %s', prefix) | ||
| return model_files | ||
| for obj in model_files_meta: | ||
| file_name = obj.key.split('/')[2] | ||
| model_files.append(file_name) | ||
| # Download this file | ||
| bucket.download_file(obj.key, file_name) | ||
|
|
||
| return model_files | ||
|
|
||
|
|
||
| def get_top_level_folders_in_bucket(s3client, bucket_name): | ||
| # This function returns the top level folders in the S3Bucket. | ||
| # These folders help us to navigate to the trained model files stored for different MXNet versions. | ||
| bucket = s3client.Bucket(bucket_name) | ||
| result = bucket.meta.client.list_objects(Bucket=bucket.name, Delimiter=backslash) | ||
| folder_list = list() | ||
| if 'CommonPrefixes' not in result: | ||
| logging.error('No trained models found in S3 bucket : %s for this file. ' | ||
| 'Please train the models and run inference again' % bucket_name) | ||
| raise Exception("No trained models found in S3 bucket : %s for this file. " | ||
| "Please train the models and run inference again" % bucket_name) | ||
| return folder_list | ||
| for obj in result['CommonPrefixes']: | ||
| folder_name = obj['Prefix'].strip(backslash) | ||
| # We only compare models from the same major versions. i.e. 1.x.x compared with latest 1.y.y etc | ||
| if str(folder_name).split('.')[0] != str(mxnet_version).split('.')[0]: | ||
| continue | ||
| # The top level folders contain MXNet Version # for trained models. Skipping the data folder here | ||
| if folder_name == data_folder: | ||
| continue | ||
| folder_list.append(obj['Prefix'].strip(backslash)) | ||
|
|
||
| if len(folder_list) == 0: | ||
| logging.error('No trained models found in S3 bucket : %s for this file. ' | ||
| 'Please train the models and run inference again' % bucket_name) | ||
| raise Exception("No trained models found in S3 bucket : %s for this file. " | ||
| "Please train the models and run inference again" % bucket_name) | ||
| return folder_list | ||
|
|
||
|
|
||
| def create_model_folder(model_name): | ||
| path = get_model_path(model_name) | ||
| if not os.path.exists(path): | ||
| os.makedirs(path) | ||
|
|
||
|
|
||
| class Net(gluon.Block): | ||
| def __init__(self, **kwargs): | ||
| super(Net, self).__init__(**kwargs) | ||
| with self.name_scope(): | ||
| # layers created in name_scope will inherit name space | ||
| # from parent layer. | ||
| self.conv1 = nn.Conv2D(20, kernel_size=(5, 5)) | ||
| self.pool1 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2)) | ||
| self.conv2 = nn.Conv2D(50, kernel_size=(5, 5)) | ||
| self.pool2 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2)) | ||
| self.fc1 = nn.Dense(500) | ||
| self.fc2 = nn.Dense(2) | ||
|
|
||
| def forward(self, x): | ||
| x = self.pool1(F.tanh(self.conv1(x))) | ||
| x = self.pool2(F.tanh(self.conv2(x))) | ||
| # 0 means copy over size from corresponding dimension. | ||
| # -1 means infer size from the rest of dimensions. | ||
| x = x.reshape((0, -1)) | ||
| x = F.tanh(self.fc1(x)) | ||
| x = F.tanh(self.fc2(x)) | ||
| return x | ||
|
|
||
|
|
||
| class HybridNet(gluon.HybridBlock): | ||
| def __init__(self, **kwargs): | ||
| super(HybridNet, self).__init__(**kwargs) | ||
| with self.name_scope(): | ||
| # layers created in name_scope will inherit name space | ||
| # from parent layer. | ||
| self.conv1 = nn.Conv2D(20, kernel_size=(5, 5)) | ||
| self.pool1 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2)) | ||
| self.conv2 = nn.Conv2D(50, kernel_size=(5, 5)) | ||
| self.pool2 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2)) | ||
| self.fc1 = nn.Dense(500) | ||
| self.fc2 = nn.Dense(2) | ||
|
|
||
| def hybrid_forward(self, F, x): | ||
| x = self.pool1(F.tanh(self.conv1(x))) | ||
| x = self.pool2(F.tanh(self.conv2(x))) | ||
| # 0 means copy over size from corresponding dimension. | ||
| # -1 means infer size from the rest of dimensions. | ||
| x = x.reshape((0, -1)) | ||
| x = F.tanh(self.fc1(x)) | ||
| x = F.tanh(self.fc2(x)) | ||
| return x | ||
|
|
||
|
|
||
| class SimpleLSTMModel(gluon.Block): | ||
| def __init__(self, **kwargs): | ||
| super(SimpleLSTMModel, self).__init__(**kwargs) | ||
| with self.name_scope(): | ||
| self.model = mx.gluon.nn.Sequential(prefix='') | ||
| with self.model.name_scope(): | ||
| self.model.add(mx.gluon.nn.Embedding(30, 10)) | ||
| self.model.add(mx.gluon.rnn.LSTM(20)) | ||
| self.model.add(mx.gluon.nn.Dense(100)) | ||
| self.model.add(mx.gluon.nn.Dropout(0.5)) | ||
| self.model.add(mx.gluon.nn.Dense(2, flatten=True, activation='tanh')) | ||
|
|
||
| def forward(self, x): | ||
| return self.model(x) | ||
|
|
||
|
|
||
| def compare_versions(version1, version2): | ||
| ''' | ||
| https://stackoverflow.com/questions/1714027/version-number-comparison-in-python | ||
| ''' | ||
| def normalize(v): | ||
| return [int(x) for x in re.sub(r'(\.0+)*$','', v).split(".")] | ||
| return cmp(normalize(version1), normalize(version2)) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.