Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-374] handle row_sparse weight in parameter and trainer#11001

Merged
piiswrong merged 23 commits intoapache:masterfrom
eric-haibin-lin:sparse-block
May 29, 2018
Merged

[MXNET-374] handle row_sparse weight in parameter and trainer#11001
piiswrong merged 23 commits intoapache:masterfrom
eric-haibin-lin:sparse-block

Conversation

@eric-haibin-lin
Copy link
Copy Markdown
Member

@eric-haibin-lin eric-haibin-lin commented May 19, 2018

Description

@piiswrong @szha @ZiyueHuang @haojin2 @safrooze please review.

  • added row_sparse stype to parameter
  • added trainer reference in parameter
  • added API to fetch row-sparse-data from parameter
  • In trainer, separated kvstore creation and parameter initialization in kvstore into two functions: _init_kv and _init_params
  • added check for loading parameters when trainer's kvstore is present.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@eric-haibin-lin eric-haibin-lin requested a review from szha as a code owner May 19, 2018 00:10
Comment thread python/mxnet/gluon/block.py Outdated
if stype != 'default':
raise ValueError("Cannot create a HybridBlock with Parameter '%s' " \
"because its storage type is %s. Please consider " \
"using a SparseBlock instead."%(param.name, stype))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR for sparse block will be created separately after this one is merged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"please consider using" -> "please use"

def test_sparse_parameter():
p = gluon.Parameter('weight', shape=(10, 10), grad_stype='row_sparse')
p = gluon.Parameter('weight', shape=(10, 10), stype='row_sparse', grad_stype='row_sparse')
p.initialize(init='xavier', ctx=[mx.cpu(0), mx.cpu(1)])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like constraining the contexts to cpu is causing test failures on GPU, is this a necessary thing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

"grad_stype for Parameter '%s' must be one of 'default', 'row_sparse', or 'csr'," \
" but got '%s'" % (name, grad_stype)
# sparse related storage type information
valid_stypes = ['default', 'row_sparse', 'csr']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well make it a set.

Copy link
Copy Markdown
Member Author

@eric-haibin-lin eric-haibin-lin May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only has 3 elements. I don't think this makes any real difference

Comment thread python/mxnet/gluon/parameter.py Outdated
""" Set the trainer this parameter is associated with. """
if self._trainer and self._trainer is not trainer:
raise RuntimeError(
"Failed to set the trainer for Parameter '%s' to %s because it was set to %s. " \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can user detach a parameter's association with a trainer without exiting python?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Users can just call _set_trainer(None). I don't think this will be used by common users, hence it remains private

@eric-haibin-lin eric-haibin-lin requested a review from piiswrong May 22, 2018 03:41
Comment thread python/mxnet/gluon/block.py Outdated
"""
def __init__(self, prefix=None, params=None):
# check if any parameter is row_sparse
if isinstance(params, ParameterDict):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check shouldn't be done here.
Parameters are only added to the current block when self.params.get is called.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Will the checks in param.list_data() and param.data() be sufficient?

Comment thread python/mxnet/gluon/parameter.py Outdated
raise RuntimeError(
"Failed to set the trainer for Parameter '%s' to %s because it was set to %s. " \
"More than one trainers for a single Parameter is not supported." %(
self.name, str(trainer), str(self._trainer)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does str(trainer) show? It's likely not meaningful to users

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change.
Suppose users want to use sgd to train 10 epochs and then switch to ADAM, this would prevent that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now only throws exception for rowsparse param

Comment thread python/mxnet/gluon/parameter.py Outdated
""" Get row_sparse data from row_sparse parameters based on row_id. """
# get row sparse params based on row ids
if not isinstance(row_id, ndarray.NDArray):
raise TypeError("Cannot get 'row_sparse' Parameter %s with %s type. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"row_id must have NDArray type, but %s is given"

"NDArray type is expected." % (self.name, type(row_id)))
if not self._trainer:
raise RuntimeError("Cannot get row_sparse data for Parameter '%s' when no " \
"Trainer is created with it."%self.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if user want to train with single device?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single device, we will encourage the user to use normal hybrid blocks with sparse_grad=True. There's no need to use rowsparse weight.
Even if the user choose to use rowsparse weight, a kvstore is created for the rowsparse param and the code still works.

Comment thread python/mxnet/gluon/parameter.py Outdated
"""(Re)initializes by loading from data."""
if self._trainer and self._trainer._kv_initialized and self._trainer._update_on_kvstore:
raise RuntimeError("Cannot (Re)initialize Parameter '%s' when its Trainer " \
"already initialized the parameter on KVStore."%(self.name))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message is cryptic. The reason is multi device training and update_on_kvstore is true.
error message should describe the reason and suggest a solution

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated message.

Comment thread python/mxnet/gluon/parameter.py Outdated
NDArray on ctx
"""
if self._stype != 'default':
raise ValueError("Cannot return a copy of Parameter '%s' on ctx %s via data() " \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be UserError?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should change to RuntimeError? There's UserWarning but I am not aware of UserError

self._param2idx[param.name] = i
self._params.append(param)
self._params_to_init.append(param)
param._set_trainer(self)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to set_trainer when stype='default' and update_on_kvstore=False?

Comment thread python/mxnet/gluon/trainer.py Outdated
for _ in self._contexts]

def _init_params(self):
""" Initialize parameters in the KVStore. Parameters whose
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong format

Comment thread python/mxnet/gluon/trainer.py Outdated
"when KVStore is not initialized."
params_to_init = []
if self._kvstore:
params = [param for param in self._params_to_init \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use for loop and if/else here

"""
if not self._kv_initialized:
self._init_kvstore()
if self._params_to_init:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. If there are uninitialized parameters, wouldn't step fail?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logics of kv.init(param) from _init_kvstore to _init_params. _params_to_init refers to params that are not initialized on kvstore.

@piiswrong piiswrong merged commit 482e50b into apache:master May 29, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
…#11001)

* + rsp parameter

* draft

* Fix optimizer pickle

* refactor and document

* add test for save load with cast_stype

* refactor trainer tests

* add test

* add back test

* raise error for load params

* add comment

* remove print

* fix doc

* CR comments

* CR comments

* change error

* remove cast stype

* fix test

* add reset kvstore to trainer

* lint

* add test to CI

* add more checks
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…#11001)

* + rsp parameter

* draft

* Fix optimizer pickle

* refactor and document

* add test for save load with cast_stype

* refactor trainer tests

* add test

* add back test

* raise error for load params

* add comment

* remove print

* fix doc

* CR comments

* CR comments

* change error

* remove cast stype

* fix test

* add reset kvstore to trainer

* lint

* add test to CI

* add more checks
@eric-haibin-lin eric-haibin-lin deleted the sparse-block branch September 18, 2018 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants