tokenizer
stableText tokenization utilities for NLP and ML preprocessing: split text into tokens, build vocabularies, encode/decode token ID sequences, generate n-grams, pad or truncate sequences, and more.
use plugin tokenizer::{tokenize_whitespace, tokenize_words, char_tokenize, …} Functions (17)
- tokenize_whitespace Split text on whitespace into tokens
- tokenize_words Split text into alphanumeric word tokens
- char_tokenize Split text into individual character tokens
- ngrams Generate n-gram groups from a token list
- vocabulary Build a frequency-sorted vocab from tokens
- pad_sequence Pad or truncate a token list to a fixed length
- truncate Truncate a token list to a max length
- detokenize Join tokens back into a string
- word_count Count whitespace-separated words in text
- unique_tokens Remove duplicate tokens preserving order
- token_count Count the number of tokens in a list
- lowercase Lowercase all tokens in a list
- sentence_tokenize Split text into sentences
- encode Map tokens to IDs using a vocabulary table
- decode Map IDs back to tokens using a vocabulary table
- build_vocab Build a token-to-ID mapping from a token list
- split_on Split text by an arbitrary delimiter
Overview
tokenizer is a dependency-free toolkit for the text-preprocessing stages of an
NLP or ML pipeline. It has no opaque tokenizer object or hidden state: every
function takes plain values — a string, or a token list represented as an ordinary
array of strings — and returns a fresh value, so tokens flow through a chain of
transforms exactly like any other data. Use it when you need to split raw text
into tokens, normalize them, build a vocabulary, turn tokens into integer ID
sequences (and back), or shape those sequences into the fixed-length inputs a
model expects.
The mental model is a pipeline: start from text with tokenize_whitespace,
tokenize_words, char_tokenize, sentence_tokenize, or split_on; clean the
tokens with lowercase, unique_tokens, truncate, or pad_sequence; derive a
vocabulary with build_vocab or vocabulary; convert with encode / decode;
and join back to text with detokenize.
Common patterns
Normalize text into a clean, deduplicated token list:
use plugin tokenizer::{tokenize_words, lowercase, unique_tokens, token_count}
let tokens = tokenize_words("The cat sat. The CAT ran!")
let vocab = unique_tokens(lowercase(tokens))
print("distinct words: {token_count(vocab)}")
print(vocab[1])
Encode text into a fixed-length ID sequence for a model:
use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, pad_sequence}
let tokens = tokenize_whitespace("the quick brown fox")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let fixed = pad_sequence(tokens, 8, "<PAD>")
print(ids[1])
print(fixed[8])
Round-trip tokens through IDs and back to text:
use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, decode, detokenize}
let tokens = tokenize_whitespace("hello world hello")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let back = decode(ids, #{0: "hello", 1: "world"}, "<UNK>")
print(detokenize(back, " "))
Split text on whitespace into tokens
Splits the input text on any whitespace (spaces, tabs, newlines), returning a list of non-empty token strings. This is the fastest general-purpose tokenizer for English prose.
use plugin tokenizer::{tokenize_whitespace}
let tokens = tokenize_whitespace("hello world foo")
print(tokens[1])
print(tokens[2])
It collapses runs of mixed whitespace, so tabs and newlines split just like spaces:
use plugin tokenizer::{tokenize_whitespace, token_count}
let tokens = tokenize_whitespace("one\ttwo\nthree four")
print(token_count(tokens))
Split text into alphanumeric word tokens
Splits text into word tokens by keeping only alphanumeric characters and underscores, discarding punctuation and symbols. Suitable for NLP preprocessing where punctuation should be ignored.
use plugin tokenizer::{tokenize_words}
let tokens = tokenize_words("Hello, world! It's great.")
// produces: ["Hello", "world", "It", "s", "great"]
print(token_count(tokens))
Split text into individual character tokens
Splits the text into a list of individual Unicode characters. Useful for character-level language models or when operating on scripts that don't use spaces.
use plugin tokenizer::{char_tokenize}
let chars = char_tokenize("abc")
print(chars[1])
print(chars[2])
print(chars[3])
Generate n-gram groups from a token list
Generates all contiguous n-grams from a token list. Each n-gram is itself a list of n consecutive tokens. Returns an empty list if the token list is shorter than n.
use plugin tokenizer::{tokenize_whitespace, ngrams}
let tokens = tokenize_whitespace("the quick brown fox")
let bigrams = ngrams(tokens, 2)
let first = bigrams[1]
print("{first[1]} {first[2]}")
Raise n to build trigrams (or any window size) over the same tokens:
use plugin tokenizer::{tokenize_whitespace, ngrams}
let tokens = tokenize_whitespace("a b c d e")
let trigrams = ngrams(tokens, 3)
let g = trigrams[1]
print("{g[1]} {g[2]} {g[3]}")
Build a frequency-sorted vocab from tokens
Counts the frequency of each token and returns a frequency-sorted list of {token, count} tables (most frequent first). Useful for inspecting corpus statistics.
use plugin tokenizer::{tokenize_whitespace, vocabulary}
let tokens = tokenize_whitespace("the cat sat on the mat the cat")
let vocab = vocabulary(tokens)
let top = vocab[1]
print("{top["token"]}: {top["count"]}")
Because the result is ordered by frequency, you can walk the top entries to report the most common tokens in a corpus:
use plugin tokenizer::{tokenize_words, lowercase, vocabulary}
let tokens = lowercase(tokenize_words("Go go GO stop go stop"))
let stats = vocabulary(tokens)
print("{stats[1]["token"]} x{stats[1]["count"]}")
print("{stats[2]["token"]} x{stats[2]["count"]}")
Pad or truncate a token list to a fixed length
Pads the token list with pad_value until it reaches max_len, or truncates it if it is already longer. Used to create fixed-length inputs for neural network models.
use plugin tokenizer::{tokenize_whitespace, pad_sequence}
let tokens = tokenize_whitespace("hello world")
let padded = pad_sequence(tokens, 5, "<PAD>")
print(padded[3])
print(padded[5])
When the list is already longer than max_len, pad_sequence truncates it
instead, guaranteeing the output is always exactly max_len tokens:
use plugin tokenizer::{tokenize_whitespace, pad_sequence, token_count}
let tokens = tokenize_whitespace("one two three four five")
let fixed = pad_sequence(tokens, 3, "<PAD>")
print(token_count(fixed))
Truncate a token list to a max length
Returns the first max_len tokens from the list, dropping the rest. Does nothing if the list is already shorter than max_len.
use plugin tokenizer::{tokenize_whitespace, truncate}
let tokens = tokenize_whitespace("one two three four five six")
let short = truncate(tokens, 3)
print(token_count(short))
Join tokens back into a string
Joins a token list back into a single string using separator between tokens. Defaults to a single space if separator is omitted.
use plugin tokenizer::{tokenize_whitespace, lowercase, detokenize}
let tokens = tokenize_whitespace("Hello World")
let lower = lowercase(tokens)
let text = detokenize(lower, " ")
print(text)
Count whitespace-separated words in text
Counts whitespace-separated words in a raw string without producing a token list. Faster than tokenize_whitespace followed by token_count when you only need the count.
use plugin tokenizer::{word_count}
print(word_count("the quick brown fox"))
Remove duplicate tokens preserving order
Removes duplicate tokens from the list while preserving the original insertion order. The first occurrence of each token is kept.
use plugin tokenizer::{tokenize_whitespace, unique_tokens}
let tokens = tokenize_whitespace("a b a c b d")
let uniq = unique_tokens(tokens)
print(token_count(uniq))
Count the number of tokens in a list
Returns the number of tokens in a token list. Equivalent to checking the list length.
use plugin tokenizer::{tokenize_whitespace, token_count}
let tokens = tokenize_whitespace("one two three")
print(token_count(tokens))
Lowercase all tokens in a list
Returns a new token list with every token converted to lowercase. Does not modify the original list.
use plugin tokenizer::{tokenize_whitespace, lowercase}
let tokens = tokenize_whitespace("Hello World FOO")
let lower = lowercase(tokens)
print(lower[1])
Split text into sentences
Splits text into sentences by breaking on ., !, and ?. Ellipses (...) are treated as continuations and not split. Returns a list of sentence strings.
use plugin tokenizer::{sentence_tokenize}
let sentences = sentence_tokenize("Hello world. How are you? Fine!")
print(sentences[1])
print(sentences[2])
Map tokens to IDs using a vocabulary table
Maps each token to an integer ID using a vocab table (string keys, integer values). Tokens not found in the vocabulary are mapped to unk_id (default -1).
use plugin tokenizer::{tokenize_whitespace, build_vocab, encode}
let tokens = tokenize_whitespace("the cat sat")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, 0)
print(ids[1])
Unknown tokens fall back to unk_id. Here a hand-built vocabulary leaves
"fox" out, so it encodes to the chosen out-of-vocabulary id:
use plugin tokenizer::{tokenize_whitespace, encode}
let tokens = tokenize_whitespace("the fox")
let ids = encode(tokens, #{"the": 1}, 99)
print(ids[1])
print(ids[2])
Map IDs back to tokens using a vocabulary table
Maps a list of integer IDs back to tokens using a vocab table (integer keys, string values). IDs not found in the vocabulary are replaced with unk_token (default "<UNK>").
use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, decode}
let tokens = tokenize_whitespace("hello world")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let back = decode(ids, #{0: "hello", 1: "world"}, "<UNK>")
print(back[1])
Build a token-to-ID mapping from a token list
Assigns a unique integer ID to each distinct token in the list, in order of first appearance. Returns a table mapping token strings to integer IDs, suitable for use with encode.
use plugin tokenizer::{tokenize_whitespace, build_vocab, encode}
let tokens = tokenize_whitespace("the quick brown fox the quick")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
print(ids[1])
Split text by an arbitrary delimiter
Splits the text at every occurrence of delimiter (which can be multiple characters) and returns the parts as a token list. Unlike tokenize_whitespace, empty parts are preserved.
use plugin tokenizer::{split_on}
let parts = split_on("a,b,,c", ",")
print(parts[1])
print(parts[3])