138M ChatML training stack for Apple Silicon using MLX.
Canonical remote branch: main. Historical legacy-default state is preserved at tag archive/origin-master-2026-03-20.
The repo now treats one path as first-class:
clean Quality2K continuation -> explicitly approved pinned checkpoint -> v20 align/full/repair SFT recovery curriculum -> broad raw chat gate
The v19 repair passed a narrow gate but fails broad chat. Keep v19 checkpoints, loaders, tokenizer, model topology, ChatML rendering, and assistant-only loss masking compatible; do not overwrite or repoint v19 artifacts while v20 is being trained. The latest v19 canonical run state is incomplete/stale for coherent multi-turn chat decisions, so v20 is now the active recovery path.
Active entrypoints:
scripts/build_pretrain_quality2k.pyscripts/run_pretrain_quality2k_terminal.shscripts/audit_dense_mainline.pyscripts/review_plain_generation.pyscripts/select_quality2k_checkpoint.pyscripts/pin_quality2k_checkpoint.pyscripts/build_sft_release.pyscripts/run_sft_release.pyscripts/run_sft_release_v20.pyscripts/run_multiturn_coherence_eval.py(scored raw/guarded broad chat suite; see SFT Runbook)
Research branch entrypoints:
scripts/extend_tokenizer_with_vm_tokens.pyscripts/build_vm_pilot_dataset.pyscripts/init_vm_from_dense.pyscripts/extend_tokenizer_with_wasm_tokens.pyscripts/normalize_local_docs.pyscripts/build_wasm_subset_corpus.pyscripts/build_wasm80m_pretrain_corpus.pyscripts/build_wasm80m_sft_corpora.pyscripts/run_wasm80m_pretrain.pyscripts/run_wasm80m_sft.pyscripts/eval_wasm80m.py
Historical probe-era and experimental material is retained only as archived reference. See Archive Notes.
Historical dense shims:
scripts/build_sft_v19_release.pyscripts/run_sft_release_v19.pyscripts/build_sft_v18_release.pyscripts/run_sft_release_v18.pyscripts/run_sft_release_v18_terminal.shThese remain compatibility shims only and are non-authoritative for release decisions.
The WASM80m scripts listed under “Research branch entrypoints” are a parallel tokenizer/model line (docs/wasm80m_runbook.md); they are not part of finishing dense 138M v20 chat.
The only architecture on the release path is the dense 138M line. Experimental dense_vm and dense_wasm80m work are isolated to separate branch/config families and do not share checkpoint compatibility with the dense mainline.
- Preserved raw pretrain base:
checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl - Active continuation config:
configs/pretrain_mlx_138m_quality2k.yaml - Active continuation outputs:
checkpoints/pretrain_mlx_138m_quality2k - Canonical SFT handoff:
checkpoints/pretrain_mlx_138m_quality2k/selected_for_sft.pkl - active v20 SFT configs:
configs/sft_release_v20_align.yamlconfigs/sft_release_v20_full.yamlconfigs/sft_release_v20_repair.yaml
- active v20 SFT corpora:
data/sft_chatml_v20_align.jsonldata/sft_chatml_v20_release.jsonldata/sft_chatml_v20_eval.jsonldata/sft_chatml_v20_repair.jsonl
- active v20 shard directories:
data/sft_chatml_shards_v20_aligndata/sft_chatml_shards_v20_releasedata/sft_chatml_eval_shards_v20_releasedata/sft_chatml_shards_v20_repair
- v20 run outputs:
checkpoints/sft_release_v20_*andreports/sft_release_v20_runs/* - best observed v20 repair probe, not promotable:
checkpoints/sft_release_v20_repair_gatebridge/sft_step_1000.pkl- broad raw:
106/120scored checks,68/80scenarios,rewrite_rate=0.0 - broad guarded:
107/120scored checks,69/80scenarios,rewrite_rate=0.1333 - blockers: raw arithmetic follow-up misses plus practical/factual lexical misses; guarded rewrite rate is above the
0.10cap
- broad raw:
- v19 compatibility pin:
checkpoints/sft_release_v19_repair/selected_for_future_work.pklremains loadable evidence only until a v20 checkpoint passes broad raw/guarded eval, lineage checks, manifest hashes, and manual smoke prompts. - Eval commands, gate CLI, release bundle, and optional MLX smoke tests: docs/eval.md. Pin promotion,
raw_replyvsreply, andgate_report.jsonretention: docs/sft_runbook.md (sections after Candidate Eval). - Mainline pin metadata for approved selections includes lineage fields:
run_id,source_checkpoint,selected_step,gate_report_path,manifest_hash, andmainline_valid.
python -m venv .venv
source .venv/bin/activate
pip install -e .
PYTHONPATH=src python scripts/setup_verification.pyBuild the curated continuation corpus:
source .venv/bin/activate
PYTHONPATH=src python scripts/build_pretrain_quality2k.pyThe active 138M continuation runtime contract is:
context: 2048 tokensdropout: 0.0compile: truecompile_granularity: microbatchprecision: bfloat16micro_batch_size: 1grad_accum_steps: 16gradient_checkpointing: false
Run the continuation from Terminal:
cd /Users/admin/Downloads/VSCode/AnarchoBot
./scripts/run_pretrain_quality2k_terminal.shStart a fresh continuation explicitly:
cd /Users/admin/Downloads/VSCode/AnarchoBot
./scripts/run_pretrain_quality2k_terminal.sh --clean-runMonitor the run:
source .venv/bin/activate
PYTHONPATH=src python scripts/metrics_window.py \
--log-dir checkpoints/pretrain_mlx_138m_quality2k/logs \
--config configs/pretrain_mlx_138m_quality2k.yamlValidate the staged continuation checkpoints before extending the run:
source .venv/bin/activate
PYTHONPATH=src python scripts/validate_mainline_training.py grad-coverage \
--config configs/pretrain_mlx_138m_quality2k.yaml \
--checkpoint checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl
PYTHONPATH=src python scripts/validate_mainline_training.py checkpoint-diff \
--config configs/pretrain_mlx_138m_quality2k.yaml \
--start-checkpoint checkpoints/pretrain_mlx_138m_chatml/mlx_step_130000.pkl \
--end-checkpoint checkpoints/pretrain_mlx_138m_quality2k/mlx_step_11000.pklFor the completed 12000 continuation run, the preserved candidate pool is 8000, 9000, 10000, 11000, and 12000. Earlier checkpoints rotated out under ckpt_keep: 5.
Select the checkpoint with the deterministic continuation handoff rule:
source .venv/bin/activate
PYTHONPATH=src python scripts/select_quality2k_checkpoint.py \
--manifest examples/quality2k_selection_manifest.json \
--print-pin-commandThe selector uses held-out perplexity with earliest-step tie-break, and only blocks candidates for checkpoint-diff failure, non-finite/missing perplexity, or catastrophic plain-generation regression versus the base review.
Pin the chosen continuation checkpoint only after the clean rerun validations pass:
source .venv/bin/activate
PYTHONPATH=src python scripts/pin_quality2k_checkpoint.py \
--checkpoint checkpoints/pretrain_mlx_138m_quality2k/mlx_step_11000.pkl \
--mainline-valid \
--artifact-role mainline_candidate \
--validation-basis "base grad coverage + compile parity passed; checkpoint diff passed; held-out perplexity won preserved 8000-12000 pool; no catastrophic plain-generation regression vs base"Export a Hugging Face token at runtime before rebuilding the canonical natural-chat slice:
export HF_TOKEN=...Build the v20 SFT corpora:
source .venv/bin/activate
PYTHONPATH=src python scripts/build_sft_release.py --version v20 --clean-outputThe standalone builder writes reports/sft_v20_release_build/build_summary.json. The shared runner writes per-run build reports under reports/sft_v20_release_builds/<run_id>/build_summary.json.
The current v20 build reports these manifest counts:
- align:
6000examples - release:
35000examples - eval:
2000examples - repair:
4000examples
The shared runner now validates manifest_examples against these bands:
- align:
5000-8000 - release:
30000-45000 - eval:
>=1800 - repair:
3500-5000
Run the v20 curriculum:
cd /Users/admin/Downloads/VSCode/AnarchoBot
PYTHONPATH=src .venv/bin/python scripts/run_sft_release_v20.pyDefault v20 release controls include:
- v20 corpus rebuild and manifest validation before training
- align
4000, full16000, repair1000 - repair shards use numeric-token loss weighting for digit-containing assistant tokens (
6.0) to give arithmetic/structured utility errors enough gradient without changing checkpoint format - exact broad arithmetic gate prompts are filtered from all train/eval/repair splits; repair may use near-holdout arithmetic and gate-bridge chat examples for failure classes, not exact broad-gate prompts
- dual-track raw/guarded gating with broad raw multi-turn chat required for promotion
- rewrite-rate cap (
<=0.10by default); policy rewrites are a secondary safety net, not release proof
Current gate evidence says to rebuild the repair mix around the remaining raw failures rather than extend the same repair run blindly. The gatebridge probe improved over the failed v19 repair baseline, but it is still not a selected or release-ready checkpoint.
The later sft_release_v20_repair_gatebridge_chatfix restart regressed to 102/120 raw scored checks and 65/80 raw scenarios, so it is also diagnostic-only.
Before the full run, use scripts/benchmark_sft_throughput.py for short checkpoint-compatible probes of micro-batch, compile, and prefetch settings.
selected_for_sft.pkl is now blocked from the canonical SFT path unless its sibling metadata file marks it mainline_valid: true.
Run the static dense-mainline audit at any time without touching training:
source .venv/bin/activate
PYTHONPATH=src python scripts/audit_dense_mainline.py \
--json-output reports/pretrain_quality2k_review/static_dense_audit.jsonsource .venv/bin/activate
pip install pytest
PYTHONPATH=src pytestOptional MLX checkpoint smoke tests (loads weights on GPU; set ANARCHOBOT_CANONICAL_CKPT to a v20 candidate once one exists, otherwise the legacy v19 repair pin remains the compatibility smoke default):
ANARCHOBOT_RUN_MLX_TESTS=1 PYTHONPATH=src pytest -m mlx_checkpoint tests/test_canonical_checkpoint.pyRepo-tracked content is source, prompts, configs, tests, docs, and curated evidence.
Runtime artifacts are intentionally untracked:
- continuation checkpoints
- generated shard directories
- runtime reports
- transient build JSONL/message dumps
Preserved historical evidence lives under legacy_evidence/.