0.5 Tokenization

project-notes 2026-06-13

Tokenization

Tokenization is a term for how a computer assigns a numerical value to an associated word. Because a deep learning model is unable to comprehend text directly, it must convert the text into a numerical approximation. Each word is assigned to a unique numerical ID from a vocabulary.

But this is often not enough. If a user inputs a misspelt word - or heavens forbid, a jumble of misspelt unstructured sentences - the Model will not be able to understand the intended instruction from a sentence. Therefore, a technique called Subword tokenization is used to breakdown the words into smaller “Tokens”.