Robust Bengali Speech Recognition
Under Domain Shift
Benchmarking Whisper and IndicWav2Vec on OOD-Speech Corpus
Project Overview
This research project focuses on Bengali speech recognition using the OOD-Speech dataset from the BengaliAI Speech Recognition Kaggle competition. The dataset is the largest publicly available Bengali speech recognition dataset, comprising approximately 1,177.94 hours of recordings from 22,645 native Bengali speakers.
Dataset Citation: This research uses datasets from Bengali.AI Open Source Datasets maintained by Bengali.AI for advancing Bengali language AI research.
📊 Dataset Scale
- Training Data: 1,200 hours from ~24,000 people
- Test Data: 17 different OOD domains
- Speakers: India and Bangladesh
- Focus: Out-of-distribution generalization
🌍 Language Context
- Speakers: ~340 million globally
- Diversity: Multiple dialects and prosodic features
- Challenge: Domain shift (e.g., religious sermons)
- Goal: Robust open-source ASR models
Research Objectives
- ✓ Robust Bengali speech recognition models
- ✓ Out-of-distribution generalization challenges
- ✓ Methods for handling diverse speech samples
- ✓ Comparative analysis of state-of-the-art ASR approaches
Data Source: This project utilizes the OOD-Speech corpus, part of the curated collection of open-source datasets maintained by Bengali.AI for the research community.
Research Contributions
Comprehensive Benchmark
A comprehensive benchmark of state-of-the-art ASR models for Bengali on the OOD-Speech corpus, including multilingual Whisper-small, IndicWav2Vec, and a task-specific BengaliAI regional Whisper model.
Superior Performance
The BengaliAI regional Whisper model substantially outperforms both Whisper-small and IndicWav2Vec, achieving the best WER and CER and producing qualitatively fluent Bengali transcripts.
Domain Generalization
Domain-generalization fine-tuning on top of the BengaliAI Whisper baseline, using length-based grouping as a simple proxy for domain and a GroupDRO-style objective to improve robustness to difficult examples.
Key Results
WER Reduction
76%
relative reduction vs. multilingual Whisper
Additional Improvement
40%
WER reduction with DG fine-tuning
Technical Achievements
Models Implemented
- ✓Multilingual Whisper-small (baseline)
- ✓IndicWav2Vec (CTC-based)
- ✓BengaliAI Regional Whisper (task-specific)
- ✓Domain-Generalization Fine-tuned Whisper
Training & Optimization
- ✓GroupDRO-style loss for domain generalization
- ✓Length-based domain grouping strategy
- ✓Hyperparameter optimization with Optuna
- ✓Decoder-only fine-tuning approach
Evaluation & Analysis
- ✓Word Error Rate (WER) computation
- ✓Character Error Rate (CER) metrics
- ✓Length-wise error analysis
- ✓Error-type breakdown visualization
Tools & Demos
- ✓Gradio demo for local testing
- ✓FastAPI web server for browser-based ASR
- ✓Analysis and plotting utilities
- ✓Comprehensive evaluation pipeline
Technical Stack
Models & Results
BengaliAI Regional Whisper
Task-SpecificFine-tuned on Bengali speech data, achieving best performance
Best WER/CER
Whisper-small
Multilingual BaselineGeneric multilingual model, zero-shot evaluation
Baseline comparison
IndicWav2Vec
CTC-basedIndic-specific Wav2Vec2 model for Bengali
Competitive baseline
DG-Finetuned Whisper
Domain-GeneralizedGroupDRO fine-tuning for improved robustness
+40% WER improvement
Evaluation Metrics
WER
Word Error Rate
Primary evaluation metric
CER
Character Error Rate
Orthographic accuracy
OOD
Out-of-Distribution
17 test domains
Code Structure
Repository Organization
src/models/Model architectures (Whisper, Wav2Vec2 variants)
src/train/Training scripts and hyperparameter optimization
src/evaluation/Evaluation metrics and submission utilities
src/dataset/Dataset loading and preprocessing
tools/Analysis and visualization utilities
demo/ & server/Interactive demos (Gradio & FastAPI)
Experiments
Comprehensive experiment logs and results in CSV format
Figures
Generated visualizations for paper publication
Documentation
Detailed guides and setup instructions