Classical Text
Representation
Bag of Words, N-grams, and TF-IDF — the frequency-based family of text representation methods. Results benchmarked on 50,000 IMDb reviews. These methods require no GPU and no training, making them the essential fast baseline before reaching for neural embeddings.
The simplest text representation. Each document becomes a counting vector: how many times does each vocabulary word appear? Word order is completely ignored — the document is treated as a "bag" with no sequence.
Build a global vocabulary from all documents. Each document is then a vector of length = |vocabulary|, where each entry is the word count in that document.
| Review | movie | brilliant | terrible | not | good |
|---|---|---|---|---|---|
| "This film was brilliant" | 1 | 1 | 0 | 0 | 0 |
| "This film was not good" | 1 | 0 | 0 | 1 | 1 |
| "Terrible film, not good" | 1 | 0 | 1 | 1 | 1 |
| Review | True Label | BoW Predicts | Analysis |
|---|---|---|---|
| "This film was absolutely brilliant and I loved it" | Positive | ✓ Positive | Keywords 'brilliant', 'loved' → strong positive signal |
| "Terrible film, a complete waste of time and money" | Negative | ✓ Negative | Keywords 'terrible', 'waste' → strong negative signal |
| "The film was not good at all" | Negative | ✗ Positive | 'not', 'good' seen separately — ambiguous signal |
| "Not bad, actually quite enjoyable" | Positive | ✓ Positive | 'enjoyable' compensates — correct by chance! |
• Quick to implement and use
• Easy to interpret — see exactly which word scores
• No training required
• Matrix 99% sparse → fast with any standard ML classifier
• Negation invisible: "not good" ≈ "good" for BoW
• Synonyms unlinked: "great" and "excellent" are separate features
• Order lost: "John beats Paul" = "Paul beats John"
• Matrix 99% sparse/hollow (high memory)
| Use Case | Example | Why BoW Works |
|---|---|---|
| Anti-spam filter | Gmail, Outlook | "lottery", "click here", "win" → spam. Keywords are enough. |
| Article categorisation | BBC News, Reuters | "football", "goal" → Sport. Domain-specific vocabulary. |
| Document search | Elasticsearch, Solr | Query = doc, TF-IDF similarity. Fast on millions of docs. |
An N-gram is a sequence of N consecutive words. By including multi-word sequences (bigrams, trigrams), we capture some local context that pure BoW misses — especially negation patterns.
| N-gram Type | IMDb Accuracy (50k) | Notes |
|---|---|---|
| Unigram (1) | 86.18% | Same as basic BoW |
| Bigram (2) | ~88–89% | Captures 2-word phrases and simple negation |
| Trigram (3) | 90.11% | Best for sentence-level patterns |
• Fast and interpretable
• Handles negation: "not good", "really bad" captured as a unit
• Easy to implement on top of BoW pipeline
• Vocabulary explosion: V² to V³ dimensions — memory intensive
• Still blind to synonyms and polysemy
• Long-range dependencies still missed
TF-IDF (Term Frequency – Inverse Document Frequency) improves BoW by weighting words: a word is important if it is frequent in this document but rare in the rest of the collection. This naturally downweights stop words and upweights distinctive vocabulary.
$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare = informative
| Word Type | Example | Real IDF (50k corpus) | Role |
|---|---|---|---|
| Stop word | 'is', 'it', 'in' | ~1.1 | Nearly ignored — noise |
| Neutral word | 'film', 'movie' | ~3–4 | Lightweight — too common in ALL reviews |
| Discriminatory (+) | 'brilliant', 'masterpiece' | High IDF | Strong positive signal |
| Discriminatory (−) | 'terrible', 'awful' | High IDF | Strong negative signal |
- IDF reduces stop-word noise ("is", "the" get near-zero weight automatically)
- Bigrams add +3.94% vs simple BoW by capturing "not good", "absolutely brilliant"
- Distinctive words like 'brilliant', 'masterpiece', 'terrible', 'awful' receive high weight
- 5-fold cross-validation confirms stability (0.9026 ± 0.0044)
• 140× faster than BERT — processes 1M documents/hour on a CPU
• Total interpretability: see exactly which word drives the outcome
• Clever automatic stop-word filtering via IDF
• No GPU, no neural training required
• Synonyms ignored: "great" and "excellent" are two separate features
• Word order lost (same as BoW)
• Polysemy blind: "bank" = bank AND river — same weight
• No semantic understanding at all
| Domain | Example | Why TF-IDF Works |
|---|---|---|
| Document search | Elasticsearch · Wikipedia · internal search | Query-document similarity in sparse vector space |
| Press classification | BBC News → 96% accuracy with TF-IDF alone | Domain vocabulary is stable and distinctive |
| Legal analysis | Contract review, clause identification | Precise, standardised vocabulary |
| Medical records | ICD codes, diagnostic classification | Rare medical terms have very high IDF → informative |
BoW, N-grams, and TF-IDF all belong to the frequency-based family. They share the same fundamental ceiling:
| Limitation | What It Means | Example |
|---|---|---|
| No semantic meaning | "film" and "movie" are completely unrelated features | Synonym vocabulary explosion without semantic clustering |
| No context | Each word is independent — word order ignored | "John beats Paul" = "Paul beats John" in all classical methods |
| Polysemy ignored | "bank" gets the same representation in all contexts | Financial bank = river bank → no disambiguation |
| OOV (Out of Vocabulary) | Unknown words = ignored entirely | Typos, slang, neologisms invisible to the model |
| Sparse representation | Vocabulary can be 50k+ dimensions; each document has mostly zeros | Memory-intensive; distances in high-dimensional sparse space are noisy |
Despite their limitations, classical methods remain the correct first choice in many real-world scenarios:
• No GPU available (CPU-only inference)
• Speed is critical (real-time, high-volume pipelines)
• Interpretability is required (legal, medical, compliance)
• Domain vocabulary is stable and distinctive
• Data is domain-specific with little polysemy
• Context and meaning matter (negation, irony, polysemy)
• OOV words appear frequently (social media, medical)
• Semantic similarity is needed beyond exact keywords
• GPU is available and inference latency is acceptable
• 90%+ is not good enough for your task
Step 1: TF-IDF with bigrams — fast baseline, no GPU
Step 2: FastText — if OOV/noisy text is a problem
Step 3: BERT fine-tuned — if polysemy, negation, or nuance is critical