Fast, memory-efficient extraction of variable sites from FASTA alignments.
Quick Start Β· Features Β· Usage Β· Benchmarks Β· Citation
Paula Ruiz-Rodriguez1
and Mireia Coscolla1
1. Institute for Integrative Systems Biology, I2SysBio, University of Valencia-CSIC, Valencia, Spain
SNPick extracts variable (SNP) sites from whole-genome FASTA alignments. It produces reduced alignments ready for phylogenetic inference with ascertainment bias correction (ASC) in IQ-TREE and RAxML, and optionally generates VCF files.
Why not snp-sites? snp-sites works well for small datasets but struggles with large alignments β it loads everything into memory and scales poorly. SNPick uses a zero-copy memory-mapped architecture that handles thousands of genomes in seconds with minimal RAM.
| SNPick | snp-sites | |
|---|---|---|
| Architecture | Zero-copy mmap, parallel scan | Full matrix in memory |
| 250 seqs Γ 4.4 Mbp | 0.9 s, 105 MB | 9.5 s, 520 MB |
| 1000 seqs Γ 4.4 Mbp | ~3 s, ~140 MB | >26 min (killed), 3+ GB |
| ASC fconst output | β Built-in | β Not supported |
| VCF output | β Optional | β Default |
| Gap handling | β
Optional (-g) |
β Default |
| IUPAC ambiguous | β Tracked as ambiguous |
# Install
conda install -c bioconda snpick
# Extract variable sites
snpick -f alignment.fasta -o snps.fasta
# With VCF output
snpick -f alignment.fasta -o snps.fasta --vcf
# Include gaps as informative
snpick -f alignment.fasta -o snps.fasta -gIdentifies positions with more than one observed nucleotide across all sequences. Constant and ambiguous-only positions are excluded from the output.
Reports constant site counts (fconst) directly, formatted for IQ-TREE's +ASC models:
[snpick] ASC fconst: 744123,1382922,1382180,743556
Use in IQ-TREE:
iqtree2 -s snps.fasta -m GTR+ASC -fconst 744123,1382922,1382180,743556Optional VCF v4.2 output with per-sample genotypes. Reference allele taken from the first sequence. Ambiguous bases reported as missing (.).
- Ambiguous bases (N, R, Y, etc.): not counted as alleles β positions are only variable if they have β₯2 standard bases (A, C, G, T)
- Gaps (
-): ignored by default, included as a 5th character with-g
Automatic multi-threaded scanning via Rayon when the dataset is large enough. Falls back to single-threaded for small inputs to avoid overhead.
conda install -c bioconda snpick
# or
mamba install -c bioconda snpickgit clone https://github.com/PathoGenOmics-Lab/snpick.git
cd snpick
cargo build --release
# Binary at target/release/snpickwget https://github.com/PathoGenOmics-Lab/snpick/releases/latest/download/snpick
chmod +x snpicksnpick [OPTIONS] --fasta <FASTA> --output <OUTPUT>
| Argument | Required | Description |
|---|---|---|
-f, --fasta <FILE> |
β | Input FASTA alignment |
-o, --output <FILE> |
β | Output FASTA (variable sites only) |
-g, --include-gaps |
Treat gaps (-) as a 5th character |
|
--vcf |
Generate VCF file (derived from output name) | |
--vcf-output <FILE> |
Custom VCF output path |
Input (alignment.fasta):
>sequence1
ATGCTAGCTAGCTAGCTA
>sequence2
ATGCTAGCTGGCTAGCTA
>sequence3
ATGCTAGCTAGCTAGCTA
Command:
snpick -f alignment.fasta -o snps.fastaOutput (snps.fasta):
>sequence1
A
>sequence2
G
>sequence3
A
stderr:
[snpick] Mapped 63 bytes. 3 sequences Γ 18 positions.
[snpick] 1 variable, 17 constant (A:4 C:4 G:4 T:5), 0 ambiguous-only, 18 total.
[snpick] ASC fconst: 4,4,4,5
[snpick] Done in 0.00s. 1 vars from 3 seqs Γ 18 pos.
Simulated M. tuberculosis-like genomes (4.4 Mbp, ~65% GC, 3.6% variable sites).
SNPick maintains O(L) memory regardless of sequence count, while snp-sites requires O(NΓL).
Input FASTA ββmmapβββΆ Index records βββΆ Pass 1: bitmask scan βββΆ Analyze
β (parallel) β
β βΌ
ββββββββββββΆ Pass 2: extract sites βββΆ FASTA + VCF
(sparse random access)
- Single memory-mapped file shared across both passes β zero copies
- Pass 1: OR-based bitmask over all sequences (parallel with Rayon)
- Pass 2: only reads variable positions (sparse access via mmap)
- Lookup tables: 256-byte arrays for O(1) nucleotide classification and case conversion
If you use SNPick in your research, please cite:
@software{snpick,
author = {Ruiz-Rodriguez, Paula and Coscolla, Mireia},
title = {SNPick: Fast extraction of variable sites from FASTA alignments},
url = {https://github.com/PathoGenOmics-Lab/snpick},
doi = {10.5281/zenodo.14191809},
license = {GPL-3.0}
}|
Paula Ruiz-Rodriguez π» π¬ π€ π£ π¨ π§ |
Mireia Coscolla π π€ π§βπ« π¬ π |
This project follows the all-contributors specification (emoji key).


