ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:
- A provider registry to plug in new data sources (Discord, Slack, custom exports, etc.).
- A plugin registry for formatters, converters, and filters to transform and export data to ChatML, Gemini JSONL, custom context formats, and more.
- A fully configurable, extensible pipeline: ingest β normalize β filter β importanceβscore β sample β format β export.
ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.
- Key Features
- What It Can & Cannot Do
- Who Should Use It
- Installation
- Quick Start
- Configuration
- CLI Usage
- Provider Registry
- Builtβin Providers
- Writing Your Own Provider
- Plugin Registry
- Formatters
- Converters
- Filters
- Writing Your Own Plugin
- Contributing
- License
-
Dynamic Provider Loading
Automatically discover and load data providers from your projectβs providers folder. -
Normalized Conversation Format
All data converges to aConvoKitConversationinterface: metadata + message arrays. -
Context Formatting
Generate a single, line-delimited training string (CKContext) with options for timeβgaps, newβconversation markers, and importance scoring. -
TurnβList Conversion
Break context into turn lists (CKTurnListConversation) for sampling or LLMβspecific export. -
Weighted Sampling
Sample by conversation importance to focus on highβvalue exchanges. -
Export Plugins
Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats. -
Filter Plugins
Drop unwanted messages (e.g. linksβonly, emojiβonly, codeβonly) via a simple plugin API.
Can:
- Ingest JSON exports from Discord (via DiscordChatExporter), or any custom source you add via the Provider Registry.
- Normalize and filter conversations by message content, length, or custom rules.
- Score message & conversation importance automatically based on time, length, and frequency.
- Sample highlyβimportant conversations for training budgets.
- Export to popular LLM chat formats (ChatML, Gemini), or easily extendable.
Cannot:
- Perform LLM inference or model training directly. - Yet ;)
- Resolve references across conversations (thread linking across channels).
- Guarantee perfect import schema for every data sourceβyou may need to write a provider to handle custom formats.
- Handle binary or nonβJSON data without extending a provider to preprocess it.
- NLP / ML Engineers preparing chatβbased LLM fineβtuning or analysis datasets.
- Bot / Chat Service Developers needing to transform raw chat logs into structured training data.
- Researchers studying conversation dynamics or designing importanceβbased sampling strategies.
- Community Contributors eager to add support for new platforms or export formats.
- Personality Generate a deep and comprehensive personality prompt based off your output ck_context
- Fine-tuning Fine-tune models with exported training data (Currently mainly looking at Gemini) (Contributions welcome!)
- Model Testing Test your fine-tuned model via the terminal (Currently mainly looking at Gemini) (Contributions welcome!)
- Unit Tests Adding unit tests would help keep everything maintainable and stable (or so i've heard)
# Install globally (recommended for CLI use)
npm install -g convokit
# Or install locally in your project
npm install convokitimport { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';
config();
await loadConfig();
async function run() {
const ck = new ConvoKit();
await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
// We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
const convoData = await ck.processDataFromProviders();
const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
await ck.convertToCKTurnList();
await ck.getWeightedSample(getConfig().sampleSize);
const chatml = await ck.exportToChatML(getConfig().systemPrompt);
const gemini = await ck.exportToGemini(getConfig().systemPrompt);
// Do whatever you want with the outputs
}
run();Make sure you have set up providers and dir structure first
By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file
| Key | Description |
|---|---|
| inputDataDirName | Directory containing raw chat exports (relative to project root). |
| outputDataDirName | Directory to write formatted outputs. |
| targetUsers | JSON array mapping each provider to a target user ID for context generation. |
| sampleSize | Number of conversations to sample by importance. |
| systemPrompt | System prompt used in ChatML/Gemini exports. |
| minImportanceChat (optional) | Minimum average importance score for a conversation (default: 120). |
| minImportanceMessage (optional) | Minimum importance score for a single message (default: 100). |
| enableDebugging (optional) | Enable or disable debug-level logs. |
| enablePerformanceStats (optional) | Enable or disable performance stats (timers). |
| shouldMergeConsecutiveMessages (optional) | Merge consecutive messages when converting to CKTurnList. |
| enableWarnings (optional) | Toggle the display of warning messages. |
| anonymizeProviderConversationIds (optional) | Anonymize provider conversation IDs to protect sensitive data. |
| localProviderDirectory (optional) | Directory name of where to load custom providers from. |
| localPluginDirectory (optional) | Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! ) |
In your convokit.config.json file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.
Example for use with the Discord provider, with inputDataDirName set to InputData:
convokit/
βββ index.ts
βββ convokit.config.json
βββ ... other files and folders
βββ InputData
βββ discord
βββ Direct Messages - fishylunar [000000000000000].json
Note: the filenames of the exported data doesnt matter, but the extension does.
ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid convokit.config.json file in your project root or have set the corresponding environment variables.
Running Commands:
# If installed globally
convokit <command> [options]
# If installed locally, using npx
npx convokit <command> [options]
# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]Common Options:
-p, --providers <ids>: Specify a comma-separated list of provider IDs (e.g.,discord,telegram) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in yourinputDataDirNamethat are registered.-o, --output <file>: Specify an output file path to save the results of commands likecontextorexport. If omitted, results are generated but not saved to a file (stats/logs will still be shown).
Commands:
create-config(alias:cfg): Creates an exampleconvokit.config.jsonfile in the current directory. Run this first if you don't have a config file.convokit create-config
providers: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the--providersoption.convokit providers
plugins: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the<converter_id>for theexportcommand.convokit plugins
context: Processes data from specified (or all) providers and generates theCKContextoutput based on your configuration (targetUsers, importance scores, etc.).# Generate context from all providers and save to context.txt convokit context -o context.txt # Generate context using only 'discord' provider data and save convokit context --providers discord -o discord_context.txt # Generate context from all providers and save to context.json including stats convokit context -o context.json --stats
export <converter_id>: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (usingsampleSizefrom config), and finally exports the data using the specified<converter_id>.# Export data using the 'chatml' converter, save to chatml_export.jsonl convokit export chatml -o chatml_export.jsonl # Export using 'gemini' converter from 'telegram' provider only, save output convokit export gemini --providers telegram -o telegram_gemini.jsonl
Example Workflow:
# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)
# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram
# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini
# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl
# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.jsonConvoKit discovers providers from providers via ProviderRegistry. Each provider must:
- Implement
ConvoKitProviderwithTest()andConvert(). - Export a static
ProviderInfoobject. - Register itself via
ProviderRegistry.register(id, ProviderClass, ProviderInfo).
- Discord (
providers/discord.ts): Reads JSON exports from DiscordChatExporter. - Telegram (
providers/telegram.ts): Reads JSON exports from the Telegram Desktop app.
Contributions are more than welcome! <3
- Create
/providers/MyPlatform.ts.
To make a local provider, put the
MyPlatform.tsfile in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in/providers/MyPlatform.ts
- Define your data schema, compatibility check, and conversion:
export const ProviderInfo = {
name: "MyPlatform Exporter",
description: "Imports MyPlatform chat JSON.",
version: "1.0.0",
author: "You",
InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};
export class Provider implements ConvoKitProvider {
constructor(private raw: any) {}
Test(): boolean {
// return true if raw matches your schema
}
Convert(): ConvoKitConversation {
// transform raw β ConvoKitConversation
}
}
// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);- Place your exports in
InputData/MyPlatform/*.json. - Run
ck.loadProviders()andck.processDataFromProviders()to include your data.
Plugins extend ConvoKitβs pipeline at three points:
- Formatters (formatters)
- Converters (converters)
- Filters (filters)
They selfβregister via PluginRegistry.registerFormatter/Converter/Filter().
- Context Formatter (
id: context): Builds the CKContext string with importance and markers.
- ChatML Converter (
id: chatml): Exports LLM chatml JSONL. - Gemini Converter (
id: gemini): Exports Geminiβstyle JSONL.
- LinkOnlyFilter (
id: link-only): Excludes messages that are URLs only.
-
Formatters
export class MyFormatter implements FormatterPluginClass { PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" }; apply(data, options) { /* return CKContextResult */ } } PluginRegistry.registerFormatter(MyFormatter);
-
Converters
export class MyConverter implements ConverterPluginClass { PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" }; async apply(convs, prompt) { /* return string[] */ } } PluginRegistry.registerConverter(MyConverter);
-
Filters
export class MyFilter implements FilterPluginClass { PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" }; filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT'; apply(content) { /* return boolean */ } } PluginRegistry.registerFilter(MyFilter);
Contributions are very welcome!
- Suggest a feature via GitHub Issues.
- Report bugs or raise PRs to fix them.
- Add new providers (Slack, Teams, custom exports).
- Write plugins for new formats or filters.
This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!
{ "inputDataDirName": "InputData", "outputDataDirName": "OutputData", "targetUsers": [ { "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" } ], "sampleSize": 5000, "systemPrompt": "You are a helpful assistant.", "minImportanceChat": 120, "minImportanceMessage": 100, "enableDebugging": false, "enablePerformanceStats": false, "shouldMergeConsecutiveMessages": true, "enableWarnings": true, "anonymizeProviderConversationIds": false, "localProvidersDir": "LocalProviders", "localPluginsDir": "LocalPlugins", }