Logo of Ibrahim Sadik Tamim
Wav2Vec2 Emotion Recognition

Wav2Vec2 Emotion Recognition

Implementing a fine-tuning pipeline for Facebook's Wav2Vec2 model to perform emotion recognition from audio data. The model classifies audio into 21 different emotion categories with varying intensity levels.

Type Deep Learning

Role Ai Engineer

Timeline 2025

Link

X-Ai

Wav2Vec2 Emotion Recognition Model Analysis

Overview

This notebook implements a fine-tuning pipeline for Facebook’s Wav2Vec2 model to perform emotion recognition from audio data. The model classifies audio into 21 different emotion categories with varying intensity levels.

Dataset Structure

Emotion Labels (21 categories)

The model predicts emotions with intensity levels:

Technical Implementation

1. Data Loading and Preprocessing

2. Data Augmentation (Training Only)

Uses audiomentations library with:

3. Model Architecture

4. Training Configuration

Training Results Analysis

Performance Progression

StepTraining LossValidation LossAccuracy
5002.5392.36026.6%
10002.1031.93638.5%
30001.0711.45454.9%
45000.7311.60862.4%
60000.5131.79961.0%

Key Observations

✅ Positive Indicators

  1. Strong Learning: Training loss decreased from 2.54 to 0.51 (80% reduction)
  2. Good Initial Progress: Validation accuracy improved from 26.6% to 62.4%
  3. Reasonable Baseline: 60% accuracy on 21-class problem (vs 4.8% random chance)

⚠️ Concerning Patterns

  1. Overfitting Signs:

    • Training loss continues decreasing while validation loss increases after step 3000
    • Gap between training and validation loss widens significantly
    • Validation accuracy plateaus around 60%
  2. Model Convergence Issues:

    • Validation loss becomes unstable (increases from 1.45 to 1.80)
    • Performance degradation in later steps suggests overtraining

Final Test Results

Recommendations for Improvement

1. Address Overfitting

2. Hyperparameter Tuning

3. Model Architecture

4. Data Analysis

5. Advanced Techniques

Code Quality Assessment

Strengths

Areas for Improvement

Conclusion

This is a solid implementation that achieves reasonable performance (60% accuracy) on a challenging 21-class emotion recognition task. The main issue is overfitting, which could be addressed with early stopping and regularization techniques. The code demonstrates good understanding of modern NLP/audio processing pipelines.

other works

All KTU PYQs Fetcher