I had a realization today about a common flaw in how concepts are taught, a flaw that often leaves me feeling unsatisfied. It’s the tendency to present knowledge as a menu of options rather than a story of cause and effect.
My case in point was learning about tokenization in large language models (LLMs). Most explanations follow a standard script:
-
The Menu: There are three main types: character-level, word-level, and subword-level tokenization.
-
The Comparison: Each has a list of pros and cons.
-
The Verdict: Subword tokenization is presented as the “best” option—a clever compromise between vocabulary size and the ability to handle new words.
This framing is neat but hollow. It portrays tokenization as a set of arbitrary choices, making “subword” seem like a convenient middle ground rather than an inevitable solution. What’s missing is the generative path—the fundamental reason why subwords must exist.
From First Principles: Tokenization as Compression
To truly understand tokenization, we must first see LLMs for what they are: statistical compressors of text. Their goal is to minimize prediction error, and in doing so, they build a compressed model of language’s structure—its syntax, morphology, and semantics. Tokenization is simply the first step: finding efficient units for this compression.
-
Primitives: We could, in theory, use characters. A character-level model can learn everything, as all language is built from them.
-
Pressure: However, this is incredibly inefficient. A character-level model must learn from scratch that “play” and “played” are related or that “-ing” is a common verb ending. The learning burden is enormous, and the required sequence lengths become unmanageable.
-
Mechanism: Subword tokenization (like BPE or SentencePiece) solves this with a data-driven compression algorithm. It starts with characters and iteratively merges the most frequent adjacent pairs. Patterns like “ing,” “est,” “play,” and even “New_York” naturally emerge from the statistical properties of the text itself.
This reveals the real picture: subword tokenization isn’t just a “third option.” It’s the emergent result of applying compression to characters to overcome the inefficiency of pure character-level modeling and the rigidity of word-level tokenization (which fails on unknown words). The resulting units—characters, morphemes, words, or phrases—are not arbitrary; they are the natural consequence of the compression process.
The Deeper Problem: Teaching the “What,” Not the “Why”
This brings us to the broader pedagogical issue. We are too often shown the final products—the categories, formulas, and neatly organized “three options”—instead of the process that makes the result inevitable. We memorize lists of pros and cons but aren’t guided along the path from primitive → generative → emergent.
The result is a polished surface with a conceptual gap hidden underneath. You can repeat the argument and solve routine problems, but you can’t rebuild the idea from scratch when the context changes, because you were never shown how it was built in the first place.
A Better Way to Teach 💡
A more robust teaching method would follow the path of discovery itself.
-
Start with Primitives: Identify the simplest building blocks and constraints. For tokenization, it’s characters and the computational cost of long sequences.
-
Expose the Driving Mechanism: Reveal the core process that generates complexity. For tokenization, it’s compressing text by merging frequent pairs.
-
Let the Concept Emerge: Show that the formal concept is the only stable outcome. Subwords emerge because compression and language statistics naturally form reusable units.
-
Introduce Categories and Shortcuts: Only now, introduce the formal names and formulas. They no longer feel like items on a menu but like concise summaries of a path you can retrace at any time.
-
Test for Reconstructability: The ultimate test of understanding is not recitation but reconstruction. Can the learner apply the logic, not just the label, to a new problem? For instance, how would the tokenization logic apply to a new language or script?
A Template for Deeper Teaching
Here’s a compact model for putting this into practice:
-
Primitives: Name the raw materials and their limitations.
-
Goal/Pressure: State the core challenge that needs to be solved (e.g., long sequences, out-of-vocabulary words).
-
Mechanism: Describe the key operation that resolves the pressure (e.g., merge frequent pairs).
-
Emergence: Show how the recognizable concept (subwords) naturally falls out from this process.
-
Compression: Present the final formula or category as a shorthand for the path just walked.
-
Rebuild: Pose an unfamiliar problem and ask the learner to regenerate the idea from first principles.
Good teaching doesn’t start with a menu of answers. It starts with the inevitability of the answer, showing how a few primitive facts and one core mechanism, when set in motion, make the concept fall out on its own. Once you reveal that path, the formalism stops being arbitrary. It becomes something the learner can reconstruct.
Comments
Note: Comments are powered by GitHub Discussions. You can comment using your GitHub account.