-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
I have been using spacy for streaming data (twitter and news stories mostly) and I believe that the fundamental design of the vocab/StringStore in spacy is problematic for streaming processing. When used for batch jobs the additional memory overhead of storing a new lexeme struct for each new word form encountered in parsing is negligible compared to the speed gains, and because most text conforms to the assumption that vocabulary size grows logarithmically as the total number of tokens grows linearly this is usually a safe bet. But for streaming text, especially for social media where new terms are invented by the minute (hashtags and URLs in particular) this assumption no longer holds and the spacy vocabulary storage represents a dynamic element in what should be a completely static production deployment.
In order to test this assumption, I took one million tweets and performed a rudimentary analysis using the resources module in python to get the maximum memory used by the program at regular intervals during processing. I first performed some minor preprocessing to remove newlines from the data so that it could be read line by line so that it wouldn't all be kept in memory, then I ran spacy with all models set to false, only the tokenizer loaded. I then did the same thing again after removing all URLs, hashtags, and twitter mentions from the data , and then filtering all empty strings (this resulted in a 1.4% data loss in terms of total tweets processed but that's fairly minor).
The final result was that spacy used an additional 278.6 MB after tokenizing the raw tweets and 60.99 MB of additional memory when tokenizing the pre-processed tweets. This result confirms my hypothesis but also shows that the memory increase really isn't all that significant (especially at the relatively low volume that I am currently processing). But it still points to a potential flaw in the design of the library.
My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing don't need to worry about changing memory requirements. In the long term, I think that an optimal solution would be to add some functionality for a timeout on vocabulary entries that aren't loaded at initialization. E.g. if this lexeme hasn't been accessed for the last n seconds, delete it from the StringStore. And n would be user configurable.
My code and results are available here: https://github.com/ELind77/spacy_memory_growth
Thanks again for continuing to develop such a great library!
-- Eric