A deep learning project that uses self-supervised contrastive learning on audio spectrograms to classify music and recommend similar songs based on learned embeddings.
Detailed documentation and guides have been moved to the docs/ directory:
- Quickstart Guide: Get up and running quickly.
- Audio Augmentation Guide: Specific guide for the audio augmentation pipeline.
- Augmentation Summary: Technical details on the augmentation strategies.
This project implements a self-supervised learning approach to learn meaningful audio representations without requiring labeled data. By training a CNN encoder with contrastive learning on augmented spectrograms, the model learns to identify similar songs and can be used for:
- Music Classification: Categorize songs by genre, mood, or style
- Song Recommendation: Find similar songs based on audio features
- Audio Similarity Search: Retrieve songs that sound alike
The system leverages the power of contrastive learning to create robust audio embeddings that capture the essence of musical content.
Music Classification by spectogram/
β
βββ README.md # This file
βββ docs/ # Documentation and guides
βββ configs/ # Configuration files (YAML)
β βββ model_config.yaml # CNN architecture settings
β βββ training_config.yaml # Training hyperparameters
β βββ data_config.yaml # Data processing settings
β
βββ notebooks/ # Jupyter Notebooks
β βββ Music_Classification_Training_Colab.ipynb # Main training pipeline
β
βββ CNN/ # Core source code
β βββ models/ # Encoder & Projection Head
β βββ augmentation/ # Audio augmentations
β βββ data/ # Dataset & Dataloaders
β βββ training/ # Training logic
β βββ recommendation/ # Recommendation engine
β
βββ AudioToSpectogram/ # Preprocessing scripts
β βββ output/ # Generated spectrograms (gitignored)
β βββ output_mel/ # Generated Mel-spectrograms (gitignored)
β βββ fma_small_dataset/ # Dataset directory (gitignored)
β
βββ requirements.txt # Project dependencies
# Clone the repository
git clone <repository-url>
cd "Music Classification by spectogram"
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txtThe project is fully configurable via YAML files in the configs/ directory. Key settings include:
configs/data_config.yaml: Controls audio processing.- Default: 3.0s duration, 128 mels, 22050Hz sample rate.
- Dataset: Points to
AudioToSpectogram/fma_small_dataset.
configs/model_config.yaml: Defines the CNN architecture.- Default: 4-block CNN encoder (64, 128, 256, 512 filters).
configs/training_config.yaml: Sets training hyperparameters.- Default: 200 epochs, batch size 64, Adam optimizer (lr=0.001), NT-Xent loss (temp=0.5).
The primary entry point for training and experimentation is the Jupyter Notebook:
notebooks/Music_Classification_Training_Colab.ipynb
This notebook covers the entire pipeline:
- Data Loading: Loads audio/spectrograms using settings from
data_config.yaml. - Augmentation: Visualizes and applies audio augmentations.
- Model Initialization: Builds the CNN encoder defined in
model_config.yaml. - Training: Runs the contrastive learning loop using
training_config.yaml. - Evaluation: Visualizes training curves and embeddings.
To run it:
jupyter notebook notebooks/Music_Classification_Training_Colab.ipynb- Waveform Augmentation: Apply pitch, tempo, noise, etc. to raw audio.
- Spectrogram Conversion: Convert augmented audio to mel-spectrograms.
- Spectrogram Augmentation: Apply masking and warping to the spectrogram.
- CNN Encoder: Extract high-level audio features.
- Contrastive Learning: Train with self-supervised contrastive loss.
- Embeddings & Search: Generate embeddings and find similar songs.
- No labels required: Learn from raw audio data
- Contrastive learning: SimCLR-based approach
- Data efficiency: Learn robust representations with limited data
The system implements a comprehensive augmentation pipeline to ensure robust feature learning:
Stage 1: Waveform Augmentations (Raw Audio) Randomly selects 3 per sample:
- Pitch Shift: Β±1-3 semitones (Pitch invariance)
- Tempo Stretch: Β±5-12% speed change (Tempo invariance)
- Gain Adjustment: Β±3-6 dB volume change
- Parametric EQ: Low-pass, high-pass, or bandpass filtering
- Dynamic Range Compression: Reduces dynamic range
- Environmental Noise: Adds background noise (SNR 10-30 dB)
- Convolutional Reverb: Simulates room acoustics
Stage 2: Spectrogram Augmentations (Mel-Spectrogram) Randomly selects 2 per sample:
- Time Masking: Masks random time steps (SpecAugment)
- Frequency Masking: Masks random frequency bands (SpecAugment)
- Time Warping: Deforms the time axis for robustness
Note: No axis flips or color jittering are used to preserve musical structure.
- Deep convolutional architecture optimized for spectrograms
- Learns hierarchical audio features
- Produces compact, discriminative embeddings
- Fast cosine similarity search
- Scalable to large music databases
- Real-time song recommendations
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Audio conversion code adapted from AudioToSpectogram
- SimCLR paper: A Simple Framework for Contrastive Learning of Visual Representations
- SpecAugment: SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
For questions or feedback, please open an issue on GitHub.
Happy Music Coding! π΅πΆ
