This repository is the official implementation of the following paper.
Paper Title: A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition
Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Fakhri Karray
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA. 1st MSLR Workshop 2025. Copyright 2025 by the author(s).
Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model’s ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.
-
Team Name: CPAMI (UW)
-
Final Standing:
4thin Task-1 (SI) &2ndin Task-2 (US)
- If you find this project useful for your research, please cite this paper
@article{haque2025signer,
title={A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition},
author={Haque, Md Rezwanul and Islam, Md Milon and Raju, SM and Karray, Fakhri},
journal={arXiv preprint arXiv:2508.09372},
year={2025}
}Welcome to the Pose Estimation repository! This repository contains the starter kit for the MSLR CSLR Track and provides a simple baseline for two important tasks in Continuous Sign Language Recognition (CSLR).
The tasks include:
- Signer Independent View Competition
- Unseen Sentences View Competition
Figure 1: Signer-Invariant Conformer. Our proposed architecture for signer-independent CSLR begins by extracting pose keypoints from video frames. An initial temporal encoder, composed of convolutional layers, learns local features from this pose sequence. The core of the model consists of conformer blocks that capture global context with multi-head self-attention and extract local patterns using convolution. Positional encodings are exploited to provide the model with sequence order information. Finally, a linear classifier head analyzes the sequence representation to generate sign gloss predictions.
Figure 1: Multi-Scale Fusion Transformer. Multi-Scale Fusion Transformer: An overview of the proposed architecture for the unseen sentences CSLR task. The network first uses a pose estimator to retrieve keypoint data. The features are then processed by a temporal encoder with a dual-path design: a main block records fine-grained temporal dynamics, and an auxiliary block uses max-pooling to learn downsampled representations. The outputs of both blocks are combined to provide a comprehensive feature set. This is subsequently analyzed by a transformer encoder, which models the sequence's long-range relationships. A joint attention mechanism reweights feature values before they are passed to the classification phase. The output sequence is fed into a classifier head, which generates the US gloss predictions.
cd Update_MSLR-2025/
- mode = SI (train)
python run.py --train --mode SI --model SOTA_CSLR- mode = SI (infer)
python run.py --infer --mode SI --model SOTA_CSLR- mode = US (train)
python run.py --train --mode US --model AdvancedSignLanguageRecognizer- mode = US (infer)
python run.py --infer --mode US --model AdvancedSignLanguageRecognizer| Model | Mode | Dev (WER) | Test (WER) |
|---|---|---|---|
| llm_advslowfast | US | 93.0663 | ... |
| gcn_transformer | US | 91.7951 | ... |
| mixllama | US | 86.9029 | ... |
| LLM Backbone (DistilBERT) | US | 81.7026 | ... |
| slowfast | US | 81.3174 | ... |
| LSTM | US | 79.9307 | ... |
| SignLanguageConformer | US | 77.5039 | ... |
| SignLanguageRecognizer | US | 74.9614 | ... |
| SOTA_CSLR | US | 64.4838 | ... |
| MambaSignLanguageRecognizer | US | 59.514 | ... |
| AdvancedSignLanguageRecognizer | US | 55.0847 | 47.7756 |
| Model | Mode | Dev (WER) | Test (WER) |
|---|---|---|---|
| llm_advslowfast | SI | 43.8955 | 72.2365 |
| MambaSignLanguageRecognizer | SI | 29.3149 | 37.2774 |
| AdvancedSignLanguageRecognizer | SI | 27.5362 | 33.9069 |
| mixllama + slowfastllm | SI | 30.1274 | 46.9831 |
| mixllama | SI | 21.8270 | 51.2139 |
| LSTM | SI | 17.0180 | 26.0755 |
| slowfastllm | SI | 16.7106 | 42.5878 |
| SignLanguageConformer | SI | 16.2495 | 26.6290 |
| SignLanguageRecognizer | SI | 14.5367 | 22.6229 |
| SOTA_CSLR | SI | 7.3123 | 13.0652 |
To evaluate the performance of our proposed models, we compare them against baseline architectures implemented in this research. These models include established and recent approaches in sequence modeling, from classic recurrent networks to hybrid architectures incorporating Large Language Models (LLMs). The performance of each baseline on the Isharah-1000 Signer-Independent (SI) and Unseen-Sentences (US) tasks is reported:
-
LLM-SlowFast (
llm_advslowfast): This model implements the SlowFast [3] concept to pose data, with parallel transformer pathways processing the sequence at different temporal dimensions. It further inserts linguistic knowledge by concatenating features from a pretrained XLM-RoBERTa model [11] before the final classifier. -
LLaMA-Former (
mixllama): This baseline uses a standard transformer encoder to process pose features, which are then fed into a frozen LLaMA-2 [35] model to act as a sequential processor. This approach explores leveraging the advanced sequence modeling capabilities of a large generative LLM. -
LLaMA-SlowFast (
mixllama + slowfastllm): This model fuses LLaMA-2 and a SlowFast architecture to extract multi-rate temporal features from pose data. The fused visual features are then processed by an AraBERT model [8]. -
ST-GCN-Conformer (
gcn_transformer): This model first employs a Spatial-Temporal Graph Convolutional Network (ST- GCN) to learn features directly on the skeletal graph [37]. The output of the ST-GCN is then processed by a conformer encoder to capture the long-range sequential relationships among these learned spatio-temporal features. -
DistilBERT-Former (
LLM Backbone (DistilBERT)): This model initially processes the pose sequence using a standard transformer encoder to capture visual-temporal dependencies. The resulting feature embeddings are then fed into a pretrained Distil-BERT model [33]. This approach aims to leverage the linguistic and contextual knowledge inherent in the LLM backbone. -
Mamba-Sign (
MambaSignLanguageRecognizer): A hybrid Mamba-transformer block is utilized in this architecture, replacing traditional attention based backbones. This design leverages the linear-time sequence modeling strengths of Mamba and the global context capabilities of self-attention [17]. It represents an exploration of recent state-space models for their effectiveness in handling long sequences. -
BiLSTM (
LSTM): This is a classic CSLR baseline consisting of a simple Bi-directional Long Short-Term Memory (BiL-LSTM) network [20]. It processes the pose features directly to capture temporal dependencies. -
Sign-Conformer (
SignLanguageConformer): This network modifies the conformer architecture, which has shown great success in sign language domain. It combines convolutions and self-attention to capture both local and global dependencies in the pose sequence [18]. -
CNN-BiLSTM (
SignLanguageRecognizer): This architecture combines a a Temporal Convolutional Network (TCN) [26]with a BiLSTM backbone. The convolutional layers extract and downsample local spatio-temporal features, which are then modeled by the BiLSTM to capture long-range dependencies.
References
[1] Antoun, W., Baly, F., & Hajj, H. (2020). AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv preprint arXiv:2003.00104.
[2] Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI.
[3] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[4] Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
[5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[6] Gulati, A., et al. (2020). Conformer: Convolution-augmented transformer for speech recognition. Interspeech.
Follow these steps to set up the environment and get started:
-
Clone the repository:
git clone https://github.com/gufranSabri/Pose86K-CSLR-Isharah.git cd Pose86K-CSLR-Isharah -
Download the dataset from here. Place the dataset in the
./datafolder. -
Set up the Python environment:
-
Install
virtualenv:pip install virtualenv
-
Create a virtual environment and activate it:
python<version> -m venv pose source pose/bin/activate # On Windows: pose\Scripts activate
-
Install the required dependencies:
pip install torch==1.13 torchvision==0.14 tqdm numpy==1.23.5 pandas opencv-python git clone --recursive https://github.com/parlance/ctcdecode.git cd ctcdecode && pip install .
-
Once your environment is ready and the data is in place, you can run the main script using the following format:
python main.py \
--work_dir ./work_dir/test \
--data_dir ./data \
--mode SI \
--model base \
--device 0 \
--lr 0.0001 \
--num_epochs 300
--work_dir:Path to store logs and model checkpoints (default: ./work_dir/test)--data_dir:Path to the dataset directory (default:``` /data/sharedData/Smartphone/)--mode:Task mode, either SI (Signer Independent) or US (Unseen Sentences)--model:Model variant to use (base, or any other available variant)--device:GPU device index (default: 0)--lr:Learning rate (default: 0.0001)--num_epochs:Number of training epochs (default: 300)
You can modify these arguments as needed for your experiments.
- Task-1:
python main.py --work_dir ./work_dir/base_SI --model base --mode SI
- Task-2:
python main.py --work_dir ./work_dir/base_US --model base --mode US
Once the environment is set up, you can train or test the model on the available tasks. Follow the instructions in the individual task directories for specific commands.
python inference_submission.py \
--work_dir ./work_dir/base_SI \
--mode SI \
--model base \
--device 0 \
--output_dir ./submission/task-1python inference_submission.py \
--work_dir ./work_dir/base_US \
--mode US \
--model base \
--device 0 \
--output_dir ./submission/task-2Replace --model base with llm, slowfast, stgcn_conformer, or st_transformer depending on the model used. Update work_dir to match the training directory containing best_model.pt.
python test_evaluate.py --work_dir ./work_dir/llm_advslowfast_SI --model llm_advslowfast --mode SI --device 0python test_evaluate.py --work_dir ./work_dir/llm_advslowfast_US --model llm_advslowfast --mode US --device 1MSLR-Pose86K-CSLR-Isharah/
│
├── main.py # Main training/validation script
├── inference_submission.py # Inference and submission script
├── data_loader_test.py # dataloader process for the test set only
├── test_script.py # testing dataset
│
├── models/
│ ├── transformer.py # Transformer-based CSLR model(s)
│ └── gcn_transformer.py # gcn-based transfromer (other model)
│
├── utils/
| ├── datasetv2.py # processing the dataset
│ ├── decode.py # Decoding utilities (CTC, beam search, etc.)
│ ├── evaluation_script.py # sample evaluation
│ ├── metrics.py # various MT evaluation metrics.
│ └── text_ctc_utils.py # CTC predictions into gloss sequences.
|
│
├── data/
│ ├── public_si_dat/ # this database for task-1
| | ├── train.csv # train.csv with `arabic` text in gloss col (2 cols: id, gloss)
| | ├── dev.csv # dev.csv, similary
| | ├── pose_data_isharah1000_hands_lips_body_May12.pkl # pose data for training and validation
│ | └── pose_data_isharah1000_SI_test.pkl # pose data for testing
│ └── public_us_dat/ # this database for task-2
| ├── train.csv # train.csv with `arabic` text in gloss col (2 cols: id, gloss)
| ├── dev.csv # dev.csv, similary
| ├── pose_data_isharah1000_hands_lips_body_May12.pkl # pose data for training and validation
| └── pose_data_isharah1000_SI_test.pkl # pose data for testing
├── work_dir/ # Training logs, checkpoints, outputs
│ └── ... # (Organized by experiment/run)
│
├── requirements.txt # Python dependencies
└── README.md # Project description and instructions
This project is licensed under the MIT License.

