Deep Learning & NLP — ENSAM Casablanca

Classical Text
Representation

Bag of Words, N-grams, and TF-IDF — the frequency-based family of text representation methods. Results benchmarked on 50,000 IMDb reviews. These methods require no GPU and no training, making them the essential fast baseline before reaching for neural embeddings.

Module05 of 07

DatasetIMDb 50k reviews (25k/25k)

Best resultTF-IDF 1+2gram: 90.13%

Part A

Classical Methods — Bag of Words · N-grams · TF-IDF

01Bag of Words (BoW)

The simplest text representation. Each document becomes a counting vector: how many times does each vocabulary word appear? Word order is completely ignored — the document is treated as a "bag" with no sequence.

Principle

Build a global vocabulary from all documents. Each document is then a vector of length = |vocabulary|, where each entry is the word count in that document.

Review	movie	brilliant	terrible	not	good
"This film was brilliant"	1	1	0	0	0
"This film was not good"	1	0	0	1	1
"Terrible film, not good"	1	0	1	1	1

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(max_features=20000, min_df=2) X_train = vectorizer.fit_transform(X_train_txt) X_test = vectorizer.transform(X_test_txt) # IMDb 50k Result: 86.18% accuracy with Logistic Regression

IMDb Predictions — Where BoW Fails

Review	True Label	BoW Predicts	Analysis
"This film was absolutely brilliant and I loved it"	Positive	✓ Positive	Keywords 'brilliant', 'loved' → strong positive signal
"Terrible film, a complete waste of time and money"	Negative	✓ Negative	Keywords 'terrible', 'waste' → strong negative signal
"The film was not good at all"	Negative	✗ Positive	'not', 'good' seen separately — ambiguous signal
"Not bad, actually quite enjoyable"	Positive	✓ Positive	'enjoyable' compensates — correct by chance!

Advantages

• Quick to implement and use
• Easy to interpret — see exactly which word scores
• No training required
• Matrix 99% sparse → fast with any standard ML classifier

Disadvantages

• Negation invisible: "not good" ≈ "good" for BoW
• Synonyms unlinked: "great" and "excellent" are separate features
• Order lost: "John beats Paul" = "Paul beats John"
• Matrix 99% sparse/hollow (high memory)

Real-World Use Cases Where BoW Is Optimal

Use Case	Example	Why BoW Works
Anti-spam filter	Gmail, Outlook	"lottery", "click here", "win" → spam. Keywords are enough.
Article categorisation	BBC News, Reuters	"football", "goal" → Sport. Domain-specific vocabulary.
Document search	Elasticsearch, Solr	Query = doc, TF-IDF similarity. Fast on millions of docs.

02N-grams — Capturing Sequences

An N-gram is a sequence of N consecutive words. By including multi-word sequences (bigrams, trigrams), we capture some local context that pure BoW misses — especially negation patterns.

N-gram Type	IMDb Accuracy (50k)	Notes
Unigram (1)	86.18%	Same as basic BoW
Bigram (2)	~88–89%	Captures 2-word phrases and simple negation
Trigram (3)	90.11%	Best for sentence-level patterns

from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(ngram_range=(1, 3), max_features=50000) X = vect.fit_transform(corpus) # ngram_range=(1,3) includes unigrams, bigrams, and trigrams

Advantages

• Fast and interpretable
• Handles negation: "not good", "really bad" captured as a unit
• Easy to implement on top of BoW pipeline

Disadvantages

• Vocabulary explosion: V² to V³ dimensions — memory intensive
• Still blind to synonyms and polysemy
• Long-range dependencies still missed

03TF-IDF — Weighting Words by Their Rarity

TF-IDF (Term Frequency – Inverse Document Frequency) improves BoW by weighting words: a word is important if it is frequent in this document but rare in the rest of the collection. This naturally downweights stop words and upweights distinctive vocabulary.

TF-IDF Formula $$\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|} \qquad \text{IDF}(t) = \log\!\left(\frac{N}{df(t)}\right) + 1$$ $$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare = informative

Real IDF Values on 50k IMDb

Word Type	Example	Real IDF (50k corpus)	Role
Stop word	'is', 'it', 'in'	~1.1	Nearly ignored — noise
Neutral word	'film', 'movie'	~3–4	Lightweight — too common in ALL reviews
Discriminatory (+)	'brilliant', 'masterpiece'	High IDF	Strong positive signal
Discriminatory (−)	'terrible', 'awful'	High IDF	Strong negative signal

from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), sublinear_tf=True) X = vect.fit_transform(corpus) # IMDb Results (50k, 25k/25k split): # BoW unigram: 86.18% # TF-IDF unigram: 88.89% (+2.71% improvement) # TF-IDF 1+2 gram: 90.13% ← Best classical method # 5-Fold CV: 0.9026 ± 0.0044 (stable)

Interpretation — Why 90.13%?

IDF reduces stop-word noise ("is", "the" get near-zero weight automatically)
Bigrams add +3.94% vs simple BoW by capturing "not good", "absolutely brilliant"
Distinctive words like 'brilliant', 'masterpiece', 'terrible', 'awful' receive high weight
5-fold cross-validation confirms stability (0.9026 ± 0.0044)

Advantages

• 140× faster than BERT — processes 1M documents/hour on a CPU
• Total interpretability: see exactly which word drives the outcome
• Clever automatic stop-word filtering via IDF
• No GPU, no neural training required

Limitations

• Synonyms ignored: "great" and "excellent" are two separate features
• Word order lost (same as BoW)
• Polysemy blind: "bank" = bank AND river — same weight
• No semantic understanding at all

Production Use Cases for TF-IDF

Domain	Example	Why TF-IDF Works
Document search	Elasticsearch · Wikipedia · internal search	Query-document similarity in sparse vector space
Press classification	BBC News → 96% accuracy with TF-IDF alone	Domain vocabulary is stable and distinctive
Legal analysis	Contract review, clause identification	Precise, standardised vocabulary
Medical records	ICD codes, diagnostic classification	Rare medical terms have very high IDF → informative

04Shared Limitations of All Classical Methods

BoW, N-grams, and TF-IDF all belong to the frequency-based family. They share the same fundamental ceiling:

Limitation	What It Means	Example
No semantic meaning	"film" and "movie" are completely unrelated features	Synonym vocabulary explosion without semantic clustering
No context	Each word is independent — word order ignored	"John beats Paul" = "Paul beats John" in all classical methods
Polysemy ignored	"bank" gets the same representation in all contexts	Financial bank = river bank → no disambiguation
OOV (Out of Vocabulary)	Unknown words = ignored entirely	Typos, slang, neologisms invisible to the model
Sparse representation	Vocabulary can be 50k+ dimensions; each document has mostly zeros	Memory-intensive; distances in high-dimensional sparse space are noisy

Why these limitations matter: "The film was not good at all" — classical methods (except TF-IDF with bigrams) predict Positive because "good" appears and the negation "not" is treated as a separate, low-signal token. Moving to neural embeddings (Module 06) addresses polysemy and context; moving to BERT addresses negation and nuance.

05When to Stop at Classical Methods

Despite their limitations, classical methods remain the correct first choice in many real-world scenarios:

Use Classical When

• No GPU available (CPU-only inference)
• Speed is critical (real-time, high-volume pipelines)
• Interpretability is required (legal, medical, compliance)
• Domain vocabulary is stable and distinctive
• Data is domain-specific with little polysemy

Move to Neural When

• Context and meaning matter (negation, irony, polysemy)
• OOV words appear frequently (social media, medical)
• Semantic similarity is needed beyond exact keywords
• GPU is available and inference latency is acceptable
• 90%+ is not good enough for your task

Decision Ladder (from Module 06)

Step 1: TF-IDF with bigrams — fast baseline, no GPU
Step 2: FastText — if OOV/noisy text is a problem
Step 3: BERT fine-tuned — if polysemy, negation, or nuance is critical

Classical TextRepresentation

Classical Text
Representation