tokenizer

stable

Text tokenization utilities for NLP and ML preprocessing: split text into tokens, build vocabularies, encode/decode token ID sequences, generate n-grams, pad or truncate sequences, and more.

use plugin tokenizer::{tokenize_whitespace, tokenize_words, char_tokenize, …}

17 functions AI & ML

/ filter jk navigate Esc clear

Functions (17)

tokenize_whitespace Split text on whitespace into tokens
tokenize_words Split text into alphanumeric word tokens
char_tokenize Split text into individual character tokens
ngrams Generate n-gram groups from a token list
vocabulary Build a frequency-sorted vocab from tokens
pad_sequence Pad or truncate a token list to a fixed length
truncate Truncate a token list to a max length
detokenize Join tokens back into a string
word_count Count whitespace-separated words in text
unique_tokens Remove duplicate tokens preserving order
token_count Count the number of tokens in a list
lowercase Lowercase all tokens in a list
sentence_tokenize Split text into sentences
encode Map tokens to IDs using a vocabulary table
decode Map IDs back to tokens using a vocabulary table
build_vocab Build a token-to-ID mapping from a token list
split_on Split text by an arbitrary delimiter

Overview

tokenizer is a dependency-free toolkit for the text-preprocessing stages of an NLP or ML pipeline. It has no opaque tokenizer object or hidden state: every function takes plain values — a string, or a token list represented as an ordinary array of strings — and returns a fresh value, so tokens flow through a chain of transforms exactly like any other data. Use it when you need to split raw text into tokens, normalize them, build a vocabulary, turn tokens into integer ID sequences (and back), or shape those sequences into the fixed-length inputs a model expects.

The mental model is a pipeline: start from text with tokenize_whitespace, tokenize_words, char_tokenize, sentence_tokenize, or split_on; clean the tokens with lowercase, unique_tokens, truncate, or pad_sequence; derive a vocabulary with build_vocab or vocabulary; convert with encode / decode; and join back to text with detokenize.

Common patterns

Normalize text into a clean, deduplicated token list:

use plugin tokenizer::{tokenize_words, lowercase, unique_tokens, token_count}

let tokens = tokenize_words("The cat sat. The CAT ran!")
let vocab = unique_tokens(lowercase(tokens))
print("distinct words: {token_count(vocab)}")
print(vocab[1])

Encode text into a fixed-length ID sequence for a model:

use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, pad_sequence}

let tokens = tokenize_whitespace("the quick brown fox")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let fixed = pad_sequence(tokens, 8, "<PAD>")
print(ids[1])
print(fixed[8])

Round-trip tokens through IDs and back to text:

use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, decode, detokenize}

let tokens = tokenize_whitespace("hello world hello")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let back = decode(ids, #{0: "hello", 1: "world"}, "<UNK>")
print(detokenize(back, " "))

tokenize_whitespace(text) → table

Split text on whitespace into tokens

Splits the input text on any whitespace (spaces, tabs, newlines), returning a list of non-empty token strings. This is the fastest general-purpose tokenizer for English prose.

use plugin tokenizer::{tokenize_whitespace}

let tokens = tokenize_whitespace("hello world  foo")
print(tokens[1])
print(tokens[2])

It collapses runs of mixed whitespace, so tabs and newlines split just like spaces:

use plugin tokenizer::{tokenize_whitespace, token_count}

let tokens = tokenize_whitespace("one\ttwo\nthree   four")
print(token_count(tokens))

tokenize_words(text) → table

Split text into alphanumeric word tokens

Splits text into word tokens by keeping only alphanumeric characters and underscores, discarding punctuation and symbols. Suitable for NLP preprocessing where punctuation should be ignored.

use plugin tokenizer::{tokenize_words}

let tokens = tokenize_words("Hello, world! It's great.")
// produces: ["Hello", "world", "It", "s", "great"]
print(token_count(tokens))

char_tokenize(text) → table

Split text into individual character tokens

Splits the text into a list of individual Unicode characters. Useful for character-level language models or when operating on scripts that don't use spaces.

use plugin tokenizer::{char_tokenize}

let chars = char_tokenize("abc")
print(chars[1])
print(chars[2])
print(chars[3])

ngrams(tokens, n) → table

Generate n-gram groups from a token list

Generates all contiguous n-grams from a token list. Each n-gram is itself a list of n consecutive tokens. Returns an empty list if the token list is shorter than n.

use plugin tokenizer::{tokenize_whitespace, ngrams}

let tokens = tokenize_whitespace("the quick brown fox")
let bigrams = ngrams(tokens, 2)
let first = bigrams[1]
print("{first[1]} {first[2]}")

Raise n to build trigrams (or any window size) over the same tokens:

use plugin tokenizer::{tokenize_whitespace, ngrams}

let tokens = tokenize_whitespace("a b c d e")
let trigrams = ngrams(tokens, 3)
let g = trigrams[1]
print("{g[1]} {g[2]} {g[3]}")

vocabulary(tokens) → table

Build a frequency-sorted vocab from tokens

Counts the frequency of each token and returns a frequency-sorted list of {token, count} tables (most frequent first). Useful for inspecting corpus statistics.

use plugin tokenizer::{tokenize_whitespace, vocabulary}

let tokens = tokenize_whitespace("the cat sat on the mat the cat")
let vocab = vocabulary(tokens)
let top = vocab[1]
print("{top["token"]}: {top["count"]}")

Because the result is ordered by frequency, you can walk the top entries to report the most common tokens in a corpus:

use plugin tokenizer::{tokenize_words, lowercase, vocabulary}

let tokens = lowercase(tokenize_words("Go go GO stop go stop"))
let stats = vocabulary(tokens)
print("{stats[1]["token"]} x{stats[1]["count"]}")
print("{stats[2]["token"]} x{stats[2]["count"]}")

pad_sequence(tokens, max_len, pad_value) → table

Pad or truncate a token list to a fixed length

Pads the token list with pad_value until it reaches max_len, or truncates it if it is already longer. Used to create fixed-length inputs for neural network models.

use plugin tokenizer::{tokenize_whitespace, pad_sequence}

let tokens = tokenize_whitespace("hello world")
let padded = pad_sequence(tokens, 5, "<PAD>")
print(padded[3])
print(padded[5])

When the list is already longer than max_len, pad_sequence truncates it instead, guaranteeing the output is always exactly max_len tokens:

use plugin tokenizer::{tokenize_whitespace, pad_sequence, token_count}

let tokens = tokenize_whitespace("one two three four five")
let fixed = pad_sequence(tokens, 3, "<PAD>")
print(token_count(fixed))

truncate(tokens, max_len) → table

Truncate a token list to a max length

Returns the first max_len tokens from the list, dropping the rest. Does nothing if the list is already shorter than max_len.

use plugin tokenizer::{tokenize_whitespace, truncate}

let tokens = tokenize_whitespace("one two three four five six")
let short = truncate(tokens, 3)
print(token_count(short))

detokenize(tokens, separator) → string

Join tokens back into a string

Joins a token list back into a single string using separator between tokens. Defaults to a single space if separator is omitted.

use plugin tokenizer::{tokenize_whitespace, lowercase, detokenize}

let tokens = tokenize_whitespace("Hello World")
let lower = lowercase(tokens)
let text = detokenize(lower, " ")
print(text)

word_count(text) → int

Count whitespace-separated words in text

Counts whitespace-separated words in a raw string without producing a token list. Faster than tokenize_whitespace followed by token_count when you only need the count.

use plugin tokenizer::{word_count}

print(word_count("the quick brown fox"))

unique_tokens(tokens) → table

Remove duplicate tokens preserving order

Removes duplicate tokens from the list while preserving the original insertion order. The first occurrence of each token is kept.

use plugin tokenizer::{tokenize_whitespace, unique_tokens}

let tokens = tokenize_whitespace("a b a c b d")
let uniq = unique_tokens(tokens)
print(token_count(uniq))

token_count(tokens) → int

Count the number of tokens in a list

Returns the number of tokens in a token list. Equivalent to checking the list length.

use plugin tokenizer::{tokenize_whitespace, token_count}

let tokens = tokenize_whitespace("one two three")
print(token_count(tokens))

lowercase(tokens) → table

Lowercase all tokens in a list

Returns a new token list with every token converted to lowercase. Does not modify the original list.

use plugin tokenizer::{tokenize_whitespace, lowercase}

let tokens = tokenize_whitespace("Hello World FOO")
let lower = lowercase(tokens)
print(lower[1])

sentence_tokenize(text) → table

Split text into sentences

Splits text into sentences by breaking on ., !, and ?. Ellipses (...) are treated as continuations and not split. Returns a list of sentence strings.

use plugin tokenizer::{sentence_tokenize}

let sentences = sentence_tokenize("Hello world. How are you? Fine!")
print(sentences[1])
print(sentences[2])

encode(tokens, vocab, unk_id) → table

Map tokens to IDs using a vocabulary table

Maps each token to an integer ID using a vocab table (string keys, integer values). Tokens not found in the vocabulary are mapped to unk_id (default -1).

use plugin tokenizer::{tokenize_whitespace, build_vocab, encode}

let tokens = tokenize_whitespace("the cat sat")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, 0)
print(ids[1])

Unknown tokens fall back to unk_id. Here a hand-built vocabulary leaves "fox" out, so it encodes to the chosen out-of-vocabulary id:

use plugin tokenizer::{tokenize_whitespace, encode}

let tokens = tokenize_whitespace("the fox")
let ids = encode(tokens, #{"the": 1}, 99)
print(ids[1])
print(ids[2])

decode(ids, vocab, unk_token) → table

Map IDs back to tokens using a vocabulary table

Maps a list of integer IDs back to tokens using a vocab table (integer keys, string values). IDs not found in the vocabulary are replaced with unk_token (default "<UNK>").

use plugin tokenizer::{tokenize_whitespace, build_vocab, encode, decode}

let tokens = tokenize_whitespace("hello world")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
let back = decode(ids, #{0: "hello", 1: "world"}, "<UNK>")
print(back[1])

build_vocab(tokens) → table

Build a token-to-ID mapping from a token list

Assigns a unique integer ID to each distinct token in the list, in order of first appearance. Returns a table mapping token strings to integer IDs, suitable for use with encode.

use plugin tokenizer::{tokenize_whitespace, build_vocab, encode}

let tokens = tokenize_whitespace("the quick brown fox the quick")
let vocab = build_vocab(tokens)
let ids = encode(tokens, vocab, -1)
print(ids[1])

split_on(text, delimiter) → table

Split text by an arbitrary delimiter

Splits the text at every occurrence of delimiter (which can be multiple characters) and returns the parts as a token list. Unlike tokenize_whitespace, empty parts are preserved.

use plugin tokenizer::{split_on}

let parts = split_on("a,b,,c", ",")
print(parts[1])
print(parts[3])

View source code