Skip to content

Debertav2 debertav3 TPU : socket closed #18276

@Shiro-LK

Description

@Shiro-LK

System Info

  • transformers version: 4.20.1
  • Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu113 (False)
  • Tensorflow version (GPU?): 2.8.2 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: TF : 2.8.2 colab / 2.4 Kaggle
    TPU : v2 and v3

Who can help?

@Rocketknight1

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I tried to launch a script with a simple classification problem but got the error "socket close". I tried with deberta small and base so I doubt it is a memory error. Moreover I tried with Kaggle (TPUv3) and Colab (TPUv2). The same script with a roberta base model works perfectly fine. The length I used was 128.

I created the model using this :

def get_model() -> tf.keras.Model:
    backbone =  TFAutoModel.from_pretrained(cfg.model_name)
    input_ids = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="input_ids",
    )
    attention_mask = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="attention_mask",
    )
 
    x = backbone({"input_ids": input_ids, "attention_mask": attention_mask})[0]
    x = x[:, 0, :] # tf.concat([, feature], axis=1)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid", dtype="float32")(x)
    return tf.keras.Model(
        inputs=[input_ids, attention_mask],
        outputs=outputs,
    )

It also seems that Embedding is not compatible with bfloat16 :

InvalidArgumentError: Exception encountered when calling layer "embeddings" (type TFDebertaV2Embeddings).

cannot compute Mul as input #1(zero-based) was expected to be a bfloat16 tensor but is a float tensor

https://colab.research.google.com/drive/1T4GGCfYy7lAFrgapOtY0KBXPcnEPeTQz?usp=sharing

Expected behavior

A regular training like training roberta. On GPU, the same script is working and use 3 or 4 GB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions