sslsv is a PyTorch-based deep learning toolkit consisting of a collection of Self-Supervised Learning (SSL) frameworks for learning speaker representations, applicable to various speaker-related downstream tasks, notably Speaker Verification (SV).
Its main objectives are to: (1) provide implementations of state-of-the-art SSL frameworks by adapting algorithms from the computer vision domain; and (2) evaluate them within a consistent and comparable environment.
An overview of the general training and evaluation framework is provided in the figure below.
- June 2025 β π Release of results and checkpoints (v2.0).
- June 2025 β π Support for Python 3.13 and PyTorch 2.7.
- December 2024 β π§ͺ Implementation of SimCLR MultiViews and MoCo Margins.
- November 2024 β π‘ Implementation of Self-Supervised Positive Sampling (SSPS).
- July 2024 β π± Implementation of more losses for SimCLR Margins (SphereFace, CurricularFace, MagFace, AdaFace).
- May 2024 β π Documentation of the complete codebase.
- April 2024 β π οΈ Complete refactoring, including typing, tests, and coding style (v2.0).
- January 2024 β π Implementation of the W-MSE framework.
- July 2023 β β‘ Support for PyTorch Distributed Data Parallel (DDP).
- June 2023 β π§ Evaluation on language, emotion, age, and gender recognition tasks.
- April 2023 β π Additional benchmarks (SITW, VOiCES) and metrics (CLLR, ActDCF, AvgRPrec).
- March 2023 β π Support for cosine scoring normalizations and PLDA evaluations.
- January 2023 β π§ͺ Implementation of SimCLR Margins (CosFace and ArcFace).
- December 2022 β π Implementation of SSL frameworks: LIM, CPC, SimCLR, MoCo, Barlow Twins, VICReg, VIbCReg, DeepCluster, SwAV, SimSiam, BYOL, and DINO.
- June 2022 β π First release of sslsv (v1.0).
General
- Data:
- Supervised and Self-supervised datasets (siamese and DINO sampling)
- Audio augmentation (noise and reverberation)
 
- Training:
- CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
- Checkpointing, resuming, early stopping and logging
- Tensorboard and wandb
 
- Evaluation:
- Speaker verification
- Backend: Cosine scoring and PLDA
- Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
 
- Classification (emotion, language, ...)
 
- Speaker verification
- Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
- Misc: scalable config, typing, documentation and tests
Encoders
- 
TDNN ( sslsv.encoders.TDNN)
 X-vectors: Robust dnn embeddings for speaker recognition (PDF)
 David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur
- 
Simple Audio CNN ( sslsv.encoders.SimpleAudioCNN)
 Representation Learning with Contrastive Predictive Coding (arXiv)
 Aaron van den Oord, Yazhe Li, Oriol Vinyals
- 
ResNet-34 ( sslsv.encoders.ResNet34)
 VoxCeleb2: Deep Speaker Recognition (arXiv)
 Joon Son Chung, Arsha Nagrani, Andrew Zisserman
- 
ECAPA-TDNN ( sslsv.encoders.ECAPATDNN)
 ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (arXiv)
 Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck
Frameworks
- 
LIM ( sslsv.methods.LIM)
 Learning Speaker Representations with Mutual Information (arXiv)
 Mirco Ravanelli, Yoshua Bengio
- 
CPC ( sslsv.methods.CPC)
 Representation Learning with Contrastive Predictive Coding (arXiv)
 Aaron van den Oord, Yazhe Li, Oriol Vinyals
- 
SimCLR ( sslsv.methods.SimCLR)
 A Simple Framework for Contrastive Learning of Visual Representations (arXiv)
 Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
- 
MoCo v2+ ( sslsv.methods.MoCo)
 Improved Baselines with Momentum Contrastive Learning (arXiv)
 Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He
- 
DeepCluster v2 ( sslsv.methods.DeepCluster)
 Deep Clustering for Unsupervised Learning of Visual Features (arXiv)
 Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze
- 
SwAV ( sslsv.methods.SwAV)
 Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (arXiv)
 Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin
- 
W-MSE ( sslsv.methods.WMSE)
 Whitening for Self-Supervised Representation Learning (arXiv)
 Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe
- 
Barlow Twins ( sslsv.methods.BarlowTwins)
 Barlow Twins: Self-Supervised Learning via Redundancy Reduction (arXiv)
 Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny
- 
VICReg ( sslsv.methods.VICReg)
 VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (arXiv)
 Adrien Bardes, Jean Ponce, Yann LeCun
- 
VIbCReg ( sslsv.methods.VIbCReg)
 Computer Vision Self-supervised Learning Methods on Time Series (arXiv)
 Daesoo Lee, Erlend Aune
- 
BYOL ( sslsv.methods.BYOL)
 Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (arXiv)
 Jean-Bastien Grill, Florian Strub, Florent AltchΓ©, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, RΓ©mi Munos, Michal Valko
- 
SimSiam ( sslsv.methods.SimSiam)
 Exploring Simple Siamese Representation Learning (arXiv)
 Xinlei Chen, Kaiming He
- 
DINO ( sslsv.methods.DINO)
 Emerging Properties in Self-Supervised Vision Transformers (arXiv)
 Mathilde Caron, Hugo Touvron, Ishan Misra, HervΓ© JΓ©gou, Julien Mairal, Piotr Bojanowski, Armand Joulin
Methods (contributions)
- 
Combiner ( sslsv.methods.Combiner)
 Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (arXiv)
 Theo Lepage, Reda Dehak
- 
SimCLR Margins ( sslsv.methods.SimCLRMargins)
 Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
 Theo Lepage, Reda Dehak
- 
MoCo Margins ( sslsv.methods.MoCoMargins)
 Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
 Theo Lepage, Reda Dehak
- 
SSPS ( sslsv.methods._SSPS)
 Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling (arxiv)
 Theo Lepage, Reda Dehak
sslsv runs on Python 3.13.3 with the following dependencies.
| Module | Versions | 
|---|---|
| torch | 2.7.1 | 
| torchaudio | 2.7.1 | 
| numpy | * | 
| pandas | * | 
| soundfile | * | 
| scikit-learn | * | 
| speechbrain | * | 
| tensorboard | * | 
| wandb | * | 
| ruamel.yaml | * | 
| dacite | * | 
| prettyprinter | * | 
| tqdm | * | 
Note: developers will also need pytest, pre-commit and twine to work on this project.
Speaker recognition:
Language recognition:
Emotion recognition:
Data-augmentation:
Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.
python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/The resulting data folder shoud have the structure presented below.
data
βββ musan_split/
βββ simulated_rirs/
βββ voxceleb1/
βββ voxceleb2/
βββ voxceleb1_test_O
βββ voxceleb1_test_H
βββ voxceleb1_test_E
βββ voxsrc2021_val
βββ voxceleb1_train.csv
βββ voxceleb2_train.csv
Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/ folder.
- 
Example format of a train file ( voxceleb1_train.csv)File,Speaker voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001 ... voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
- 
Example format of a trials file ( voxceleb1_test_O)1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav ... 0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
- Clone this repository: git clone https://github.com/theolepage/sslsv.git.
- Install dependencies: pip install -r requirements.txt.
Note: sslsv can also be installed as a standalone package via pip with pip install sslsv or with pip install . (in the project root folder) to get the latest version.
- Start a training (2 GPUs): ./train_ddp.sh 2 <config_path>.
- Evaluate your model (2 GPUs): ./evaluate_ddp.sh 2 <config_path>.
Note: use sslsv/bin/train.py and sslsv/bin/evaluate.py for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).
You can visualize your experiments with tensorboard --logdir models/your_model/.
Use wandb online and wandb offline to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY.
Documentation is currently being developed...
- Configs: models/ssl/voxceleb2/
- Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: Fast ResNet-34 and ECAPA-TDNN
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint | 
|---|---|---|---|---|
| LIM | lim/lim_loss-NCE_proj-2048-BN-R-2048-BN-R-512 | 16.13 | 0.9015 | |
| CPC | cpc/cpc_t-4_agg-GRU-1-256 | 12.77 | 0.8033 | |
| SimCLR | simclr/simclr_proj-none_t-0.03 | 9.05 | 0.6364 | π | 
| MoCo | moco/moco_proj-none_Q-32768_t-0.03_m-0.999 | 8.49 | 0.5990 | π | 
| DeepCluster | deepcluster/deepcluster_proj-2048-BN-R-2048-BN-R-512_K-3000-3000-3000_t-0.1 | 15.16 | 0.8193 | |
| SwAV | swav/swav_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 | 11.82 | 0.7177 | π | 
| W-MSE | wmse/wmse_proj-1024-BN-R-64_ws-128 | 14.62 | 0.8506 | |
| Barlow Twins | barlowtwins/barlowtwins_proj-2048-BN-R-2048-BN-R-512_lambda-0.005 | 13.22 | 0.7658 | |
| VICReg | vicreg/vicreg_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 | 11.33 | 0.6658 | π | 
| BYOL | byol/byol_proj-2048-BN-R-2048-BN-R-512_pred-4096-BN-R-256_m-0.996-sched | 13.99 | 0.7509 | |
| SimSiam | simsiam/simsiam_proj-2048-BN-R-2048-BN-R-512-BN_pred-512-BN-R-2048 | 28.94 | 0.9984 | |
| DINO | dino/dino_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 | 6.04 | 0.4526 | π | 
| Supervised | supervised/supervised_loss-AAM_s-30_m-0.2 | 2.95 | 0.3122 | π | 
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint | 
|---|---|---|---|---|
| SimCLR | simclr/simclr_enc-ECAPATDNN-1024_proj-none_t-0.03 | 6.41 | 0.5160 | π | 
| MoCo | moco/moco_enc-ECAPATDNN-1024_proj-none_Q-32768_t-0.03_m-0.999 | 6.48 | 0.5372 | π | 
| SwAV | swav/swav_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_K-6000_t-0.1 | 8.12 | 0.6148 | π | 
| VICReg | vicreg/vicreg_enc-ECAPATDNN-1024_proj-2048-BN-R-2048-BN-R-512_inv-1.0_var-1.0_cov-0.1 | 7.42 | 0.5659 | π | 
| DINO | dino/dino_enc-ECAPATDNN-1024_proj-2048-BN-G-2048-BN-G-256-L2-65536_G-2x4_L-4x2_t-0.04 | 2.82 | 0.3463 | π | 
| Supervised | supervised/supervised_enc-ECAPATDNN-1024_loss-AAM_s-30_m-0.2 | 1.34 | 0.1521 | π | 
- Configs: models/ssps/voxceleb2/
- Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: ECAPA-TDNN
| Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint | 
|---|---|---|---|---|
| SimCLR | simclr_e-ecapa/ssps_kmeans_25k_uni-1 | 2.57 | 0.3033 | π | 
| DINO | dino_e-ecapa/ssps_kmeans_25k_uni-1 | 2.53 | 0.2843 | π | 
sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.
If you use sslsv, please consider starring this repository on GitHub and citing one the following papers.
@InProceedings{lepage2025SSPS,
  title     = {SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2025},
  booktitle = {Interspeech 2025},
  url       = {https://arxiv.org/abs/2505.14561},
}
@Article{lepage2025SSLSVBootstrappedPositiveSampling,
  title     = {Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2025},
  journal   = {arXiv preprint library},
  url       = {https://arxiv.org/abs/2501.17772},
}
@InProceedings{lepage2024AdditiveMarginSSLSV,
  title     = {Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2024},
  booktitle = {The Speaker and Language Recognition Workshop (Odyssey 2024)},
  pages     = {38--42},
  doi       = {10.21437/odyssey.2024-6},
  url       = {https://www.isca-archive.org/odyssey_2024/lepage24_odyssey.html},
}
@InProceedings{lepage2023ExperimentingAdditiveMarginsSSLSV,
  title     = {Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2023},
  booktitle = {Interspeech 2023},
  pages     = {4708--4712},
  doi       = {10.21437/Interspeech.2023-1479},
  url       = {https://www.isca-archive.org/interspeech_2023/lepage23_interspeech.html},
}
@InProceedings{lepage2022LabelEfficientSSLSV,
  title     = {Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2022},
  booktitle = {Interspeech 2022},
  pages     = {4018--4022},
  doi       = {10.21437/Interspeech.2022-802},
  url       = {https://www.isca-archive.org/interspeech_2022/lepage22_interspeech.html},
}This project is released under the MIT License.
