Robust Bengali Speech Recognition
Under Domain Shift

Benchmarking Whisper and IndicWav2Vec on OOD-Speech Corpus

🎤 Automatic Speech Recognition🌐 Low-Resource Languages🤖 Deep Learning📊 Domain Generalization

Project Overview

This research project focuses on Bengali speech recognition using the OOD-Speech dataset from the BengaliAI Speech Recognition Kaggle competition. The dataset is the largest publicly available Bengali speech recognition dataset, comprising approximately 1,177.94 hours of recordings from 22,645 native Bengali speakers.

Dataset Citation: This research uses datasets from Bengali.AI Open Source Datasets maintained by Bengali.AI for advancing Bengali language AI research.

📊 Dataset Scale

Training Data: 1,200 hours from ~24,000 people
Test Data: 17 different OOD domains
Speakers: India and Bangladesh
Focus: Out-of-distribution generalization

🌍 Language Context

Speakers: ~340 million globally
Diversity: Multiple dialects and prosodic features
Challenge: Domain shift (e.g., religious sermons)
Goal: Robust open-source ASR models

Research Objectives

✓ Robust Bengali speech recognition models
✓ Out-of-distribution generalization challenges
✓ Methods for handling diverse speech samples
✓ Comparative analysis of state-of-the-art ASR approaches

Data Source: This project utilizes the OOD-Speech corpus, part of the curated collection of open-source datasets maintained by Bengali.AI for the research community.

Research Contributions

📈

Comprehensive Benchmark

A comprehensive benchmark of state-of-the-art ASR models for Bengali on the OOD-Speech corpus, including multilingual Whisper-small, IndicWav2Vec, and a task-specific BengaliAI regional Whisper model.

🏆

Superior Performance

The BengaliAI regional Whisper model substantially outperforms both Whisper-small and IndicWav2Vec, achieving the best WER and CER and producing qualitatively fluent Bengali transcripts.

🎯

Domain Generalization

Domain-generalization fine-tuning on top of the BengaliAI Whisper baseline, using length-based grouping as a simple proxy for domain and a GroupDRO-style objective to improve robustness to difficult examples.

Key Results

WER Reduction

76%

relative reduction vs. multilingual Whisper

Additional Improvement

40%

WER reduction with DG fine-tuning

Technical Achievements

Models Implemented

✓Multilingual Whisper-small (baseline)
✓IndicWav2Vec (CTC-based)
✓BengaliAI Regional Whisper (task-specific)
✓Domain-Generalization Fine-tuned Whisper

Training & Optimization

✓GroupDRO-style loss for domain generalization
✓Length-based domain grouping strategy
✓Hyperparameter optimization with Optuna
✓Decoder-only fine-tuning approach

Evaluation & Analysis

✓Word Error Rate (WER) computation
✓Character Error Rate (CER) metrics
✓Length-wise error analysis
✓Error-type breakdown visualization

Tools & Demos

✓Gradio demo for local testing
✓FastAPI web server for browser-based ASR
✓Analysis and plotting utilities
✓Comprehensive evaluation pipeline

Technical Stack

PyTorchTransformersWhisperWav2Vec2OptunaFastAPIGradiojiwerXeLaTeX

Models & Results

BengaliAI Regional Whisper

Task-Specific

Best

Fine-tuned on Bengali speech data, achieving best performance

Best WER/CER

Whisper-small

Multilingual Baseline

Generic multilingual model, zero-shot evaluation

Baseline comparison

IndicWav2Vec

CTC-based

Indic-specific Wav2Vec2 model for Bengali

Competitive baseline

DG-Finetuned Whisper

Domain-Generalized

Best

GroupDRO fine-tuning for improved robustness

+40% WER improvement

Evaluation Metrics

WER

Word Error Rate

Primary evaluation metric

CER

Character Error Rate

Orthographic accuracy

OOD

Out-of-Distribution

17 test domains

Code Structure

Repository Organization

src/models/

Model architectures (Whisper, Wav2Vec2 variants)

whisper_zero_shot.pywav2vec2_zero_shot.pywhisper_zero_shot_dg.py

src/train/

Training scripts and hyperparameter optimization

whisper_dg_finetune.pyhyperparameter_search.py

src/evaluation/

Evaluation metrics and submission utilities

metrics.pysubmission.py

src/dataset/

Dataset loading and preprocessing

ood_speech.pyutils.py

tools/

Analysis and visualization utilities

analysis_plots.pycompare_baselines.py

demo/ & server/

Interactive demos (Gradio & FastAPI)

bengali_ood_asr_demo.pyapi.py

📊

Experiments

Comprehensive experiment logs and results in CSV format

📈

Figures

Generated visualizations for paper publication

📝

Documentation

Detailed guides and setup instructions

Robust Bengali Speech RecognitionUnder Domain Shift