Skip to content

.map not hashing under python 3.9 #6440

@changyeli

Description

@changyeli

Describe the bug

The .map function cannot hash under python 3.9. Tried to use the solution here, but still get the same message:

Parameter 'function'=<function map_to_pred at 0x7fa0b49ead30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Steps to reproduce the bug

def map_to_pred(batch):
    """
    Perform inference on an audio batch

    Parameters:
        batch (dict): A dictionary containing audio data and other related information.

    Returns:
        dict: The input batch dictionary with added prediction and transcription fields.
    """
    audio = batch['audio']
    input_features = processor(
        audio['array'], sampling_rate=audio['sampling_rate'], return_tensors="pt").input_features
    input_features = input_features.to('cuda')
    with torch.no_grad():
        predicted_ids = model.generate(input_features)
    preds = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    batch['prediction'] = processor.tokenizer._normalize(preds)
    batch["transcription"] = processor.tokenizer._normalize(batch['transcription'])
    return batch

MODEL_CARD = "openai/whisper-small"
MODEL_NAME = MODEL_CARD.rsplit('/', maxsplit=1)[-1]
model = WhisperForConditionalGeneration.from_pretrained(MODEL_CARD)
processor = AutoProcessor.from_pretrained(
MODEL_CARD, language="english", task="transcribe")
model = torch.compile(model)
dt = load_dataset("audiofolder", data_dir=config['DATA']['dataset'], split="test")
dt = dt.cast_column("audio", Audio(sampling_rate=16000))
result = coraal_dt.map(map_to_pred, num_proc=16)

Expected behavior

Hashed and cached dataset starts inferencing

Environment info

  • transformers version: 4.35.0
  • Platform: Linux-5.14.0-284.30.1.el9_2.x86_64-x86_64-with-glibc2.34
  • Python version: 3.9.18
  • Huggingface_hub version: 0.17.3
  • Safetensors version: 0.4.0
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions