Looking for the JS/TS version? Check out LangChain.js.
pip install langchain-text-splitters
LangChain Text Splitters contains utilities for splitting into chunks a wide variety of text documents.
For full documentation, see the API reference.
See our Releases and Versioning policies.
We encourage pinning your version to a specific version in order to avoid breaking your CI when we publish new tests. We recommend upgrading to the latest version periodically to make sure you have the latest tests.
Not pinning your version will ensure you always have the latest tests, but it may also break your CI if we introduce tests that your integration doesn't pass.
As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.
For detailed information on how to contribute, see the Contributing Guide.
Splitting text that looks at characters.
Splitting text by recursively look at characters.
Splitting text to tokens using sentence model tokenizer.
Attempts to split the text along Python syntax.
Attempts to split the text along Markdown-formatted headings.
Splitting markdown files based on specified headers.
Line type as TypedDict.
Header type as TypedDict.
An experimental text splitter for handling Markdown syntax.
Splitting text using Konlpy package.
Splits JSON data into smaller, structured chunks while preserving hierarchy.
Attempts to split the text along Latex-formatted layout elements.
Splitting text using Spacy package.
Element type as typed dict.
Split HTML content into structured Documents based on specified headers.
Splitting HTML files based on specified tag and font sizes.
Split HTML content preserving semantic structure.
Splitting text using NLTK package.
Text splitter that handles React (JSX), Vue, and Svelte code.
Interface for splitting text into chunks.
Splitting text to tokens using model tokenizer.
Enum of the programming languages.
Tokenizer data class.