-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
The current pickling implementation was only supposed to be an exploratory kludge. However, I didn't leave a TODO and the status of it got lost.
Vocab.__reduce__ currently writes state to temp files, which are then not cleaned up. Pickling therefore fills the disk, and really only pretends to work.
The root of the problem is that a number of spaCy classes carry large binary data structures. Common usage is to load this data and consider it immutable, however you can write to these models, e.g. to change the word vectors, and pickle should not silently dump these changes. On the other hand it's harsh to assume we always need to write out the state. This would mean that users who follow the pattern of keeping the data immutable have to write out ~1gb of data to pickle the models. This makes average usage of Spark etc really problematic.
We could do this implicitly with copy-on-write semantics, but I don't think it's great to invoke some method where it may or may not write out 1gb of data to disk, depending on the entire execution history of the program.
We could have a more explicit version of copy-on-write, where all the classes track whether they've been changed, and then the models should refuse to be pickled if the state is unclean. Users would then explicitly save the state after they change it. I think this is a recipe for having long-running processes suddenly die, though. Mostly Python is designed around the assumption that things can either be pickled or they can't. It's surprising to find that your pickle works, sometimes, depending on state, but then your long-running process dies because you didn't meet the assumed invariant. And then next time you run, you get an error in that other place in your code where the classes get pickled.
I've been thinking for a while that context managers are the idiomatic standard for dealing with this problem. The idea would be that if you want to write to any of this loaded data, you have to open it within a context manager, so that the changes are explicitly scoped, and you explicitly decide whether you want to save the changes or dump them.
Ignoring the naming of everything, this might look like:
from spacy.en import English
nlp = English()
# Open a pre-trained model, do some more training, and save the changes
with nlp.entity.update_model(file_or_path_or_etc):
for doc, labels in my_training_data:
nlp.entity.train(doc, labels)
# Change the vector of 'submarines' to be the vector formed
# by "spaceships - space + ocean"
# When the context manager exits, revert the changes
with nlp.vocab.update_lexicon(revert=True):
submarines = nlp.vocab[u'submarines']
spaceships = nlp.vocab[u'spaceships']
space = nlp.vocab[u'space']
ocean = nlp.vocab[u'ocean']
submarines.vector = (spaceships.vector - space.vector) + ocean.vector