A no-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs.
Pick categories, set proportions, click Generate — the app handles topic planning, example generation, quality scoring, and export to a ready-to-train JSONL file.
Dataset Generator is a desktop app that automates the full dataset generation pipeline — topic planning, multi-turn conversation generation, quality validation via LLM Judge, deduplication, and HuggingFace Hub upload. No scripts to write, no ML infra to configure.
Under the hood it runs a three-stage engine: instead of a single "generate 100 examples" prompt, the app first decomposes the job into unique topics and outlines, only then generating the actual examples. The result: diverse, coherent data without the repetitive patterns of naive generation.
Everything stays local. API keys live in SQLite on your device, datasets land in ~/.datasetgenerator/. Talk to OpenRouter for ~300 cloud models, or point the app at a local Ollama / LM Studio / llama.cpp server for fully offline generation — both modes share the same pipeline.
Note on provider terms. Users are responsible for complying with the terms of service of the LLM providers they use through OpenRouter. Some providers restrict using model outputs for training competitive models — check the ToS of your chosen model before generating datasets for fine-tuning.
I recently started fine-tuning open-source LLMs as a hobby, and build software with AI coding agents. I wanted a simple way to generate training datasets without writing custom scripts every time — pick categories, configure the pipeline, click Generate, get a JSONL ready for training. There's plenty of datasets on HuggingFace, but sometimes you want one tailored to your specific categories and proportions. So I built the tool I wanted to use.
Datasets generated by this app were used to fine-tune Qwen2.5-Coder-7B-Instruct and evaluated against the base model on HumanEval / HumanEval+ (pass@1, average across 5 runs). Every model in the pipeline — topic planner, example generator, and LLM Judge — was open-source (Llama, Qwen, DeepSeek, Mistral via OpenRouter). No proprietary APIs.
| Model | HumanEval | HumanEval+ |
|---|---|---|
| Base Qwen2.5-Coder-7B-Instruct | 55.5% (±2.1) | 49.0% (±1.9) |
| FT V1 (750 samples) | 57.2% (±1.0) | 51.0% (±0.5) |
| FT V2 (this pipeline, 1135 samples) | 60.0% (±0.9) | 54.0% (±1.8) |
+4.5 pts on HumanEval, +5.0 pts on HumanEval+ vs base. Error bars don't overlap — the difference is statistically significant.
🤗 Artifacts: fine-tuned model · V1 dataset (750 samples) · V2 dataset (1135 samples)
This benchmark validates the pipeline on a coding-focused dataset (multi-turn coding assistance with explanations). The tool itself is domain-agnostic — define any categories (writing, Q&A, math, customer support, etc.) and the same workflow applies. Results depend on category configuration, judge criteria, and model selection — your mileage may vary.
demo-dataset.mp4
Generating 10 examples across 2 categories in ShareGPT format with the LLM Judge enabled.
Actively developed — bug reports and feature requests welcome via Issues. General questions and ideas → Discussions.
- Plan-then-Execute pipeline — three stages (topics → outlines → examples), each can use a different model
- Tests - 460-test suite (unit + integration + E2E) — internal
- Cloud + local providers — OpenRouter for ~300 cloud models, plus Ollama / LM Studio / any OpenAI-compatible endpoint for fully offline generation. Mix and match per category (e.g. local generator + cloud judge).
- Per-category configuration — any number of categories with custom proportions, descriptions, and dedicated models
- LLM Judge — a second model scores every example 0–100 against editable criteria; rejected examples are regenerated
- Real-time SSE dashboard — global and per-category progress, live example feed, running cost
- Three export formats — ShareGPT, Alpaca, ChatML
- Multi-turn conversations — 1–5 turns generated coherently in one LLM call
- Actual cost tracking — pulls real
usagetokens from every response, multiplies by live pricing - Embedding-based deduplication — cosine similarity over OpenRouter embeddings
- Quality Report — judge histogram, token stats, efficiency, export to JSON/CSV
- Dataset history + in-app preview — turn-by-turn rendering, code highlighting, dataset merging
- Reasoning post-process — add first-person
<think>…</think>rationales to any completed dataset; pick a different model (e.g. cheaper / neutral) than the one that generated the answers - HuggingFace Hub upload — one-click push to your repo
- Fine-tuning a domain-specific assistant — coding, legal, medical, customer support. The benchmark above is exactly this flow.
- Instruction datasets at any scale — SFT-ready JSONL for models from 7B edge deployments up to 70B+; merge multiple jobs to grow the corpus.
- Experimenting with fine-tuning — quickly test how different category compositions affect model behavior without weeks of data curation.
- Multi-turn conversation datasets — generate 3–5 turn dialogues for training agentic behaviors.
Beyond OpenRouter cloud models, the app talks to any OpenAI-compatible endpoint — Ollama, LM Studio, llama.cpp, vLLM, TGI, or your own server. Run the entire pipeline offline, or mix freely: e.g. local generator + cloud judge, or different models per category.
Setup: start your local server (ollama serve on port 11434, LM Studio's server tab on 1234, etc.), then in the app open Settings → Providers → Auto-detect local. Endpoints are discovered automatically; any custom base URL of the form http://host:port/v1 also works. For fully offline runs, pick a local embedding model (e.g. nomic-embed-text) in Settings → Dedup.
Dataset generation is more demanding than general chat — the model has to produce strict JSON, follow multi-turn structure, and stay coherent across many examples. A model that's perfectly fine for chat may fail validation here.
| Size | Recommendation | Notes |
|---|---|---|
| <7B (Llama 3.2:3B, etc.) | Not recommended | Frequent JSON validation failures, repetitive content, schema drift |
| 7B–13B (Mistral 7B, Llama 3.1:8B) | Casual use only | Works for experimentation, but expect noticeable skip rate and lower diversity |
| 14B (Qwen2.5-Coder:14B, Qwen3:14B) | Pragmatic minimum | Stable generation, clean output, low skip rate |
| 32B+ (Qwen2.5-Coder-32B, DeepSeek-V3, GLM-4-32B) | Recommended target | Quality approaches cloud providers |
If you don't have the GPU for 14B+, OpenRouter is the better path — same pipeline, no hardware constraint, open-source models cost cents per 1000 examples.
Turn any completed dataset into a reasoning-style training set — every assistant turn gets a first-person internal monologue (<think>…</think>) explaining the thought process before the actual answer. Matches the DeepSeek-R1 / Qwen3-thinking convention out of the box.
Hit Add Reasoning on any completed dataset in /history, pick a model per category (independent from the gen model — common pattern: cloud generator + local 14B for the reasoning pass for free, neutral rationales), choose a format, and a new reasoning job is created — the source dataset is never modified.
Two export formats:
| Format | What lands in JSONL | When to pick |
|---|---|---|
| Inline | <think>…</think> injected at the start of every assistant turn |
Broadest compatibility — Axolotl, Unsloth, Llama-Factory, TRL all read it out of the box. Standard since DeepSeek-R1. |
| Separate | Top-level reasoning: [...] array (one entry per assistant turn) |
Cleaner schema, needs a matching trainer template. Useful for ablation studies or custom training pipelines. |
Each format is its own job — re-run on the same source with a different format and you get an independent dataset, no conflict. You can publish both side-by-side on HuggingFace and cross-link them in the model card.
Pitch for fine-tuners: the reasoning pass model is independent from the gen model, so the rationales aren't post-hoc justifications written by the same model that produced the answer — they're a separate analysis, which tends to expose blind spots and reduce style-leak from a single source.
Pre-built binaries are on the Releases page — no Python, no Node.js required.
| Platform | File | Size | Usage |
|---|---|---|---|
| Windows 10/11 (x64) | DatasetGenerator-windows-x64.zip |
~100 MB | Extract → double-click DatasetGenerator.exe |
| Linux (AppImage) | DatasetGenerator-x86_64.AppImage |
~140 MB | chmod +x → double-click |
| Linux (tar.gz) | DatasetGenerator-linux-x64.tar.gz |
~140 MB | Extract → run ./DatasetGenerator |
Windows — SmartScreen warning
Unsigned executable: on first run click More info → Run anyway. App data is stored in %APPDATA%\DatasetGenerator\.
Linux AppImage — FUSE on Ubuntu 24.04
chmod +x DatasetGenerator-x86_64.AppImage
./DatasetGenerator-x86_64.AppImageIf dlopen(): error loading libfuse.so.2 appears:
sudo apt install libfuse2t64 # Ubuntu 24.04+
sudo apt install libfuse2 # Ubuntu 22.04 and olderLinux tar.gz — GTK/WebKit requirements
tar -xzf DatasetGenerator-linux-x64.tar.gz
cd DatasetGenerator
./DatasetGeneratorRequires GTK 3 and WebKit2GTK 4.1 (pre-installed on Ubuntu 24.04+, Fedora 38+). On older systems:
sudo apt install libgtk-3-0 libwebkit2gtk-4.1-0git clone https://github.com/AronDaron/dataset-generator.git
cd dataset-generator
# Backend
cd backend
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/uvicorn app.main:app --reload --port 8000
# Frontend (new terminal — start from the cloned repo)
cd dataset-generator/frontend
npm install
npm run devBackend on http://localhost:8000, frontend on http://localhost:3000. Open Settings → enter your OpenRouter API key → pick a category → click Generate.
Windows: replace
./venv/bin/pipand./venv/bin/uvicornwithvenv\Scripts\pip.exeandvenv\Scripts\uvicorn.exe.
Requirements: Python 3.10+ (python3-venv, python3-pip), Node.js 20+ (includes npm), git, curl, and an OpenRouter API key — optional if you only use local models (Ollama / LM Studio).
Stack: Next.js 16 + React 19, FastAPI + Pydantic v2, SQLite (aiosqlite), SSE for progress, Pywebview + PyInstaller for packaging.
Is Linux fully supported? Yes — the app ships AppImage and tar.gz builds and all features work cross-platform. That said, day-to-day development and manual testing happens on Windows; Linux builds are verified with automated smoke tests but don't get the same amount of hands-on time. If something feels off on Linux, please open an Issue — I'll take a look.
How much does it cost to generate 1000 examples? Depends on model choice, turn count, and judge strictness. With open-source models available on OpenRouter (Llama 3.x, Qwen 2.5, DeepSeek, Mistral) expect single-digit dollars per 1000 multi-turn examples. Note that the UI shows the cost of accepted examples only — real spend includes rejected and skipped examples plus retries, typically 1.5-2x the displayed cost depending on judge threshold.
Is my API key safe?
Keys are stored locally in SQLite (~/.datasetgenerator/database.sqlite). No telemetry, no remote calls except to OpenRouter and (optionally) HuggingFace Hub. Nothing leaves your machine unless you push a dataset.
Why AGPL-3.0 and not MIT? To prevent closed-source SaaS forks. You're free to use, modify, and self-host — but if you deploy a derivative as a hosted service, your users have the right to receive your source code. Commercial licensing is negotiable — open an Issue or contact me directly.
GNU Affero General Public License v3.0 — see LICENSE.
Strong copyleft: you're free to use, modify, and redistribute, but any derivative work — including SaaS / network-deployed versions — must release its full source under the same license. For proprietary commercial use, open an issue or contact me directly.
