Skip to content

"Property couldn't be hashed properly" even though fully picklable #3178

@BramVanroy

Description

@BramVanroy

Describe the bug

I am trying to tokenize a dataset with spaCy. I found that no matter what I do, the spaCy language object (nlp) prevents datasets from pickling correctly - or so the warning says - even though manually pickling is no issue. It should not be an issue either, since spaCy objects are picklable.

Steps to reproduce the bug

Here is a colab but for some reason I cannot reproduce it there. That may have to do with logging/tqdm on Colab, or with running things in notebooks. I tried below code on Windows and Ubuntu as a Python script and getting the same issue (warning below).

import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10%]")
        ds = ds.map(self.parse, batched=True, num_proc=6)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled!")

    pr.process()

Here is a small change that includes Hasher.hash that shows that the hasher cannot seem to successfully pickle parts form the NLP object.

from datasets.fingerprint import Hasher
import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10]")
        return ds.map(self.parse, batched=True)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled class instance!")

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr.nlp, f)
    print("Successfully pickled nlp!")

    # fails
    print(Hasher.hash(pr.nlp))
    pr.process()

Expected results

This to be picklable, working (fingerprinted), and no warning.

Actual results

In the first snippet, I get this warning
Parameter 'function'=<function Processor.parse at 0x7f44982247a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

In the second, I get this traceback which directs to the Hasher.hash line.

Traceback (most recent call last):
  File " \Python\Python36\lib\pickle.py", line 918, in save_global
    obj2, parent = _getattribute(module, name)
  File " \Python\Python36\lib\pickle.py", line 266, in _getattribute
    .format(name, obj))
AttributeError: Can't get local attribute 'add_codes.<locals>.ErrorsWithCodes' on <function add_codes at 0x00000296FF606EA0>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File " scratch_4.py", line 40, in <module>
    print(Hasher.hash(pr.nlp))
  File " \lib\site-packages\datasets\fingerprint.py", line 191, in hash
    return cls.hash_default(value)
  File " \lib\site-packages\datasets\fingerprint.py", line 184, in hash_default
    return cls.hash_bytes(dumps(value))
  File " \lib\site-packages\datasets\utils\py_utils.py", line 345, in dumps
    dump(obj, file)
  File " \lib\site-packages\datasets\utils\py_utils.py", line 320, in dump
    Pickler(file, recurse=True).dump(obj)
  File " \lib\site-packages\dill\_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File " \Python\Python36\lib\pickle.py", line 409, in dump
    self.save(obj)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File " \Python\Python36\lib\pickle.py", line 805, in _batch_appends
    save(x)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1176, in save_instancemethod0
    pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\datasets\utils\py_utils.py", line 523, in save_function
    obj=obj,
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 751, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 605, in save_reduce
    save(cls)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1439, in save_type
    StockPickler.save_global(pickler, obj, name=name)
  File " \Python\Python36\lib\pickle.py", line 922, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'spacy.errors.add_codes.<locals>.ErrorsWithCodes'>: it's not found as spacy.errors.add_codes.<locals>.ErrorsWithCodes

Environment info

Tried on both Linux and Windows

  • datasets version: 1.14.0
  • Platform: Windows-10-10.0.19041-SP0 + Python 3.7.9; Linux-5.11.0-38-generic-x86_64-with-Ubuntu-20.04-focal + Python 3.7.12
  • PyArrow version: 6.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions