Dataset Generator

A no-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs.

Pick categories, set proportions, click Generate — the app handles topic planning, example generation, quality scoring, and export to a ready-to-train JSONL file.

About

Dataset Generator is a desktop app that automates the full dataset generation pipeline — topic planning, multi-turn conversation generation, quality validation via LLM Judge, deduplication, and HuggingFace Hub upload. No scripts to write, no ML infra to configure.

Under the hood it runs a three-stage engine: instead of a single "generate 100 examples" prompt, the app first decomposes the job into unique topics and outlines, only then generating the actual examples. The result: diverse, coherent data without the repetitive patterns of naive generation.

Everything stays local. API keys live in SQLite on your device, datasets land in ~/.datasetgenerator/. Talk to OpenRouter for ~300 cloud models, or point the app at a local Ollama / LM Studio / llama.cpp server for fully offline generation — both modes share the same pipeline.

Note on provider terms. Users are responsible for complying with the terms of service of the LLM providers they use through OpenRouter. Some providers restrict using model outputs for training competitive models — check the ToS of your chosen model before generating datasets for fine-tuning.

Why I built this

I recently started fine-tuning open-source LLMs as a hobby, and build software with AI coding agents. I wanted a simple way to generate training datasets without writing custom scripts every time — pick categories, configure the pipeline, click Generate, get a JSONL ready for training. There's plenty of datasets on HuggingFace, but sometimes you want one tailored to your specific categories and proportions. So I built the tool I wanted to use.

Benchmark

Datasets generated by this app were used to fine-tune Qwen2.5-Coder-7B-Instruct and evaluated against the base model on HumanEval / HumanEval+ (pass@1, average across 5 runs). Every model in the pipeline — topic planner, example generator, and LLM Judge — was open-source (Llama, Qwen, DeepSeek, Mistral via OpenRouter). No proprietary APIs.

Model	HumanEval	HumanEval+
Base Qwen2.5-Coder-7B-Instruct	55.5% (±2.1)	49.0% (±1.9)
FT V1 (750 samples)	57.2% (±1.0)	51.0% (±0.5)
FT V2 (this pipeline, 1135 samples)	60.0% (±0.9)	54.0% (±1.8)

+4.5 pts on HumanEval, +5.0 pts on HumanEval+ vs base. Error bars don't overlap — the difference is statistically significant.

🤗 Artifacts: fine-tuned model · V1 dataset (750 samples) · V2 dataset (1135 samples)

Benchmark results — HumanEval / HumanEval+ pass@1

_{This benchmark validates the pipeline on a coding-focused dataset (multi-turn
coding assistance with explanations). The tool itself is domain-agnostic — define
any categories (writing, Q&A, math, customer support, etc.) and the same workflow
applies. Results depend on category configuration, judge criteria, and model
selection — your mileage may vary.}

Demo

demo-dataset.mp4

_{Generating 10 examples across 2 categories in ShareGPT format with the LLM Judge enabled.}

Features

Actively developed — bug reports and feature requests welcome via Issues. General questions and ideas → Discussions.

Plan-then-Execute pipeline — three stages (topics → outlines → examples), each can use a different model
Tests - 460-test suite (unit + integration + E2E) — internal
Cloud + local providers — OpenRouter for ~300 cloud models, plus Ollama / LM Studio / any OpenAI-compatible endpoint for fully offline generation. Mix and match per category (e.g. local generator + cloud judge).
Per-category configuration — any number of categories with custom proportions, descriptions, and dedicated models
LLM Judge — a second model scores every example 0–100 against editable criteria; rejected examples are regenerated
Real-time SSE dashboard — global and per-category progress, live example feed, running cost
Three export formats — ShareGPT, Alpaca, ChatML
Multi-turn conversations — 1–5 turns generated coherently in one LLM call
Actual cost tracking — pulls real usage tokens from every response, multiplies by live pricing
Embedding-based deduplication — cosine similarity over OpenRouter embeddings
Quality Report — judge histogram, token stats, efficiency, export to JSON/CSV
Dataset history + in-app preview — turn-by-turn rendering, code highlighting, dataset merging
Reasoning post-process — add first-person <think>…</think> rationales to any completed dataset; pick a different model (e.g. cheaper / neutral) than the one that generated the answers
HuggingFace Hub upload — one-click push to your repo

Use cases

Fine-tuning a domain-specific assistant — coding, legal, medical, customer support. The benchmark above is exactly this flow.
Instruction datasets at any scale — SFT-ready JSONL for models from 7B edge deployments up to 70B+; merge multiple jobs to grow the corpus.
Experimenting with fine-tuning — quickly test how different category compositions affect model behavior without weeks of data curation.
Multi-turn conversation datasets — generate 3–5 turn dialogues for training agentic behaviors.

Local models (Ollama / LM Studio / OpenAI-compatible)

Beyond OpenRouter cloud models, the app talks to any OpenAI-compatible endpoint — Ollama, LM Studio, llama.cpp, vLLM, TGI, or your own server. Run the entire pipeline offline, or mix freely: e.g. local generator + cloud judge, or different models per category.

Setup: start your local server (ollama serve on port 11434, LM Studio's server tab on 1234, etc.), then in the app open Settings → Providers → Auto-detect local. Endpoints are discovered automatically; any custom base URL of the form http://host:port/v1 also works. For fully offline runs, pick a local embedding model (e.g. nomic-embed-text) in Settings → Dedup.

Model size matters

Dataset generation is more demanding than general chat — the model has to produce strict JSON, follow multi-turn structure, and stay coherent across many examples. A model that's perfectly fine for chat may fail validation here.

Size	Recommendation	Notes
<7B (Llama 3.2:3B, etc.)	Not recommended	Frequent JSON validation failures, repetitive content, schema drift
7B–13B (Mistral 7B, Llama 3.1:8B)	Casual use only	Works for experimentation, but expect noticeable skip rate and lower diversity
14B (Qwen2.5-Coder:14B, Qwen3:14B)	Pragmatic minimum	Stable generation, clean output, low skip rate
32B+ (Qwen2.5-Coder-32B, DeepSeek-V3, GLM-4-32B)	Recommended target	Quality approaches cloud providers

If you don't have the GPU for 14B+, OpenRouter is the better path — same pipeline, no hardware constraint, open-source models cost cents per 1000 examples.

Reasoning post-process

Turn any completed dataset into a reasoning-style training set — every assistant turn gets a first-person internal monologue (<think>…</think>) explaining the thought process before the actual answer. Matches the DeepSeek-R1 / Qwen3-thinking convention out of the box.

Hit Add Reasoning on any completed dataset in /history, pick a model per category (independent from the gen model — common pattern: cloud generator + local 14B for the reasoning pass for free, neutral rationales), choose a format, and a new reasoning job is created — the source dataset is never modified.

Two export formats:

Format	What lands in JSONL	When to pick
Inline	`<think>…</think>` injected at the start of every assistant turn	Broadest compatibility — Axolotl, Unsloth, Llama-Factory, TRL all read it out of the box. Standard since DeepSeek-R1.
Separate	Top-level `reasoning: [...]` array (one entry per assistant turn)	Cleaner schema, needs a matching trainer template. Useful for ablation studies or custom training pipelines.

Each format is its own job — re-run on the same source with a different format and you get an independent dataset, no conflict. You can publish both side-by-side on HuggingFace and cross-link them in the model card.

Pitch for fine-tuners: the reasoning pass model is independent from the gen model, so the rationales aren't post-hoc justifications written by the same model that produced the answer — they're a separate analysis, which tends to expose blind spots and reduce style-leak from a single source.

Download

Pre-built binaries are on the Releases page — no Python, no Node.js required.

Platform	File	Size	Usage
Windows 10/11 (x64)	`DatasetGenerator-windows-x64.zip`	~100 MB	Extract → double-click `DatasetGenerator.exe`
Linux (AppImage)	`DatasetGenerator-x86_64.AppImage`	~140 MB	`chmod +x` → double-click
Linux (tar.gz)	`DatasetGenerator-linux-x64.tar.gz`	~140 MB	Extract → run `./DatasetGenerator`

Windows — SmartScreen warning

Unsigned executable: on first run click More info → Run anyway. App data is stored in %APPDATA%\DatasetGenerator\.

Linux AppImage — FUSE on Ubuntu 24.04

chmod +x DatasetGenerator-x86_64.AppImage
./DatasetGenerator-x86_64.AppImage

If dlopen(): error loading libfuse.so.2 appears:

sudo apt install libfuse2t64   # Ubuntu 24.04+
sudo apt install libfuse2      # Ubuntu 22.04 and older

Linux tar.gz — GTK/WebKit requirements

tar -xzf DatasetGenerator-linux-x64.tar.gz
cd DatasetGenerator
./DatasetGenerator

Requires GTK 3 and WebKit2GTK 4.1 (pre-installed on Ubuntu 24.04+, Fedora 38+). On older systems:

sudo apt install libgtk-3-0 libwebkit2gtk-4.1-0

Quick start

git clone https://github.com/AronDaron/dataset-generator.git
cd dataset-generator

# Backend
cd backend
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/uvicorn app.main:app --reload --port 8000

# Frontend (new terminal — start from the cloned repo)
cd dataset-generator/frontend
npm install
npm run dev

Backend on http://localhost:8000, frontend on http://localhost:3000. Open Settings → enter your OpenRouter API key → pick a category → click Generate.

Windows: replace ./venv/bin/pip and ./venv/bin/uvicorn with venv\Scripts\pip.exe and venv\Scripts\uvicorn.exe.

Requirements: Python 3.10+ (python3-venv, python3-pip), Node.js 20+ (includes npm), git, curl, and an OpenRouter API key — optional if you only use local models (Ollama / LM Studio).

Stack: Next.js 16 + React 19, FastAPI + Pydantic v2, SQLite (aiosqlite), SSE for progress, Pywebview + PyInstaller for packaging.

FAQ

Is Linux fully supported? Yes — the app ships AppImage and tar.gz builds and all features work cross-platform. That said, day-to-day development and manual testing happens on Windows; Linux builds are verified with automated smoke tests but don't get the same amount of hands-on time. If something feels off on Linux, please open an Issue — I'll take a look.

How much does it cost to generate 1000 examples? Depends on model choice, turn count, and judge strictness. With open-source models available on OpenRouter (Llama 3.x, Qwen 2.5, DeepSeek, Mistral) expect single-digit dollars per 1000 multi-turn examples. Note that the UI shows the cost of accepted examples only — real spend includes rejected and skipped examples plus retries, typically 1.5-2x the displayed cost depending on judge threshold.

Is my API key safe? Keys are stored locally in SQLite (~/.datasetgenerator/database.sqlite). No telemetry, no remote calls except to OpenRouter and (optionally) HuggingFace Hub. Nothing leaves your machine unless you push a dataset.

Why AGPL-3.0 and not MIT? To prevent closed-source SaaS forks. You're free to use, modify, and self-host — but if you deploy a derivative as a hosted service, your users have the right to receive your source code. Commercial licensing is negotiable — open an Issue or contact me directly.

License

GNU Affero General Public License v3.0 — see LICENSE.

Strong copyleft: you're free to use, modify, and redistribute, but any derivative work — including SaaS / network-deployed versions — must release its full source under the same license. For proprietary commercial use, open an issue or contact me directly.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
backend		backend
docs/assets		docs/assets
frontend		frontend
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_generator.spec		dataset_generator.spec
desktop.py		desktop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Generator

About

Why I built this

Benchmark

Demo

Features

Use cases

Local models (Ollama / LM Studio / OpenAI-compatible)

Model size matters

Reasoning post-process

Download

Quick start

FAQ

License

About

Uh oh!

Releases 7

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Dataset Generator

About

Why I built this

Benchmark

Demo

Features

Use cases

Local models (Ollama / LM Studio / OpenAI-compatible)

Model size matters

Reasoning post-process

Download

Quick start

FAQ

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages