MOD 06 Word Embeddings — Static & Contextual
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Word Embeddings
& Neural Representations

Static embeddings (Word2Vec, GloVe, FastText) and contextual embeddings (BERT). From dense vector spaces and vector arithmetic to bidirectional context and pre-training. Benchmarked on 50,000 IMDb reviews. Based on ENSAM 2025/2026 lecture PDFs.

Module06 of 07
DatasetIMDb 50k (25k train / 25k test)
Best resultBERT fine-tuned: 93.9%
Part A
Static Embeddings — Word2Vec · GloVe · FastText
01Why Dense Vectors? — From Counting to Meaning

Classical methods (BoW, TF-IDF) represent each word as a unique dimension in a sparse vocabulary-sized vector. No two words are ever "close" to each other — "film" and "movie" are as distant as "film" and "elephant".

The key insight of neural word embeddings:

The Distributional Hypothesis (Firth, 1957)

"A word is characterized by the company it keeps."

"film" and "movie" appear in nearly the same contexts → they should have similar vectors.

PropertyClassical (TF-IDF)Static Embeddings (Word2Vec)
Vector size|Vocabulary| (50,000+)100–300 dimensions
Sparsity99%+ zerosDense — all dimensions meaningful
Synonyms"great" ≠ "excellent" (different dims)"great" ≈ "excellent" (cosine ~ 0.85)
SemanticsNo semantic meaningSemantic clusters in vector space
Training requiredNo (counting)Yes (neural network on large corpus)
30 Years of Evolution
EraMethodsKey Innovation
1990s–2000sBoW, TF-IDFFrequency counting — fast, interpretable
2013Word2Vec (Google)Dense vectors, semantic similarity, vector arithmetic
2014GloVe (Stanford)Global co-occurrence statistics
2016FastText (Facebook)Subword decomposition — handles OOV
2018ELMo, BERT (Google)Contextual representations — different vector per occurrence
02Word2Vec — The Vector Space of Meaning

Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict words from their context. The actual task (prediction) is discarded — the learned weight matrix becomes the embedding lookup table.

Two Training Objectives
Skip-gram

Given the center word, predict the surrounding context words. Better for rare words. Works well on smaller datasets.

CBOW (Continuous Bag of Words)

Given the context words, predict the center word. Faster to train. Better for frequent words and larger datasets.

Vector Arithmetic — Captured Analogies
Famous Analogies $$\vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}}$$ $$\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Italy}} \approx \vec{\text{Rome}}$$
import gensim.downloader model = gensim.downloader.load('word2vec-google-news-300') # Vector arithmetic model.most_similar(positive=['king', 'woman'], negative=['man']) # → [('queen', 0.7118), ('princess', 0.6517), ...] # Semantic similarity on IMDb vocabulary: # film ↔ movie: cosine 0.887 (equivalent — same contexts) # terrible ↔ awful: cosine 0.880 (synonyms — negative cluster) # brilliant ↔ super: cosine 0.820 (synonyms — positive cluster) vec = model['film'] # 300-dimensional dense vector
Use Cases Where Word2Vec Excels
Use CaseExample
Film Recommendations"Inception" → vector → find similar films in embedding space
Semantic SearchQuery "voiture" finds results mentioning "auto", "automobile"
Chatbots"problem" cluster: "bug", "error", "issue" — all mapped nearby
Entity disambiguationNames of similar people/places cluster together
Strengths

• Captures semantic similarity and synonyms
• Vector analogies (king−man+woman≈queen)
• Dense vectors: 100–300 dimensions
• Pre-trained models available (Google News 300d)

Limitations

One vector per word → "bank" (financial) = "bank" (river) — polysemy ignored
• OOV: unknown words = no representation
• Local context only (fixed window)
• Needs billions of words for quality vectors

03GloVe & FastText
GloVe — Global Vectors (Stanford, 2014)

GloVe factorizes a global co-occurrence matrix using Singular Value Decomposition (SVD). Instead of predicting local context (like Word2Vec), GloVe leverages overall corpus statistics — how often every word pair co-occurs across the entire corpus.

  • Analogy champion: Best performance on WordSim-353 and SimLex-999 benchmarks.
  • Limitation: The V×V co-occurrence matrix becomes impossible on large vocabularies (RAM).
import gensim.downloader model = gensim.downloader.load('glove-wiki-gigaword-100') # 100d GloVe vectors # Trained on Wikipedia + Gigaword (6B tokens)
FastText — Subword Embeddings (Facebook/Meta, 2016)

FastText extends Word2Vec by breaking words into character n-grams (subwords). The embedding of a word is the sum of its subword embeddings. This solves the OOV (Out-of-Vocabulary) problem — any word, even one unseen during training, can be represented using its subword components.

"acting" → subwords: <ac · act · cti · tin · ing · ng> # Each subword has its own embedding — sum them for the word vector # OOV performance on IMDb (words not in training vocabulary): # Word Word2Vec FastText # "amazing" OOV ❌ similarity 0.919 ✅ # "terrible" OOV ❌ similarity 0.907 ✅ # "unbelievable" OOV ❌ similarity 0.951 ✅ # Typo robustness: # "terrible" (typo) → subword overlap with "terrible" (correct) → similar vector
FastText Use Cases
DomainWhy FastText
Social media ("amazingg", "luv", "gr8")Constant OOV from slang, abbreviations, typos
Medical NLP ("hypertension", "cardiomyopathy")Rare medical terms → subword decomposition guarantees a vector
Morphologically rich languages (Arabic, Turkish, Finnish)Complex suffixes handled via subwords — FastText global standard
Noisy OCR / SMS / forums"necesary" → subword overlap with "necessary" → correct cluster
Word2Vec vs GloVe vs FastText — When to Choose
CriteriaWord2VecGloVeFastText
Handles OOV wordsNoNoYes
Analogy performanceGoodExcellentGood
Training speedFastSlowerFast
Morphological richnessNoNoYes
Noisy text robustnessPoorPoorStrong
Best use caseSemantic similarity, clean textBenchmarks, analogiesSocial media, medical, OOV
Key shared limitation of all three: Static — one vector per word regardless of context. "I went to the bank to deposit money" and "I sat on the river bank" → same vector for "bank". This is the polysemy problem.
Part B
Contextual Embeddings — ELMo · BERT · GPT
04The Fundamental Problem — Polysemy
"I'm going to the bank to withdraw some money."
BANK = Financial Institution

"He's fishing on the bank's shore."
BANK = Riverbank

Word2Vec · GloVe · FastText → "bank" has the SAME vector in both sentences. This is fundamentally wrong for any task requiring understanding of meaning in context.

Polysemy is the phenomenon where a single word has multiple meanings. All static embeddings collapse all senses into one average vector. The solution requires reading the entire sentence and computing a different vector for each occurrence of the word — a vector that reflects the specific meaning in that context.

This is exactly what contextual embeddings (ELMo, BERT, GPT) do. The transformer attention mechanism allows every word to "look at" every other word in the sentence before its representation is computed.

05BERT — Bidirectional Encoder Representations from Transformers

BERT (Devlin et al., Google, 2018) — based on the Transformer encoder architecture. Instead of a fixed vector per word, each word receives a different vector depending on its full bidirectional context. BERT reads the entire sentence from left-to-right AND right-to-left simultaneously (not sequentially like RNNs).

Pre-Training Tasks
Masked Language Model (MLM)

"The [MASK] was brilliant" → predicts "film". 15% of tokens are masked randomly. Forces the model to use bidirectional context to fill in blanks — cannot cheat by looking only left or right.

Next Sentence Prediction (NSP)

Given sentences A and B, predict: does B logically follow A? Teaches the model inter-sentence relationships — useful for QA, inference, summarisation tasks.

BERT Architecture Facts
SpecificationBERT-BaseBERT-Large
Transformer encoder layers1224
Attention heads1216
Hidden dimension7681024
Total parameters~110M~340M
Pre-training dataWikipedia + BooksCorpus (3.3 billion words)
TokenisationWordPiece — handles OOV via subword splitting
Fine-Tuning Strategy — Complete Fine-Tuning
# Strategy B: Complete fine-tuning (all layers trained on IMDb) ← Best strategy from transformers import BertForSequenceClassification, BertTokenizer from torch.optim import AdamW tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # All 12 layers + classification head trained on 25k IMDb reviews # Training: 3 epochs, GPU required, ~30-120 min # IMDb Results (50k, 25k train / 25k test): # Accuracy: 93.9% # F1 score: 0.939 # 3 epochs, GPU (Colab/Kaggle)
Why 93.9%? — BERT's Advantages
Unique Advantages

Bidirectional context: "not good" (negative) vs "not bad" (positive) — BERT distinguishes them; W2V cannot
Polysemy resolved: "bank" financial ≠ "bank" geographical → different vectors
Massive pre-training: 3.3B words → rich linguistic knowledge via transfer learning
• Handles irony, negation, nuance naturally

Costs & Limitations

140× slower than FastText — impossible for real-time applications without a GPU
• GPU almost mandatory (fine-tuning: 30–120 min; production: GPU inference)
• Large labeled dataset needed for fine-tuning
• Black box: attention patterns are hard to interpret

BERT Use Cases
DomainWhy BERT Is Essential
Chatbot & Question AnsweringNuanced understanding of interconnected, multi-sentence context
Legal analysisCritical semantic nuances in contracts and clauses
Medical diagnosis NLPIrony, negation in clinical notes ("patient shows no signs of...")
Hate speech / sentiment detectionSarcasm and irony impossible without bidirectional context
Automatic translationBERT Multilingual: 104 languages in one model
Part C
Comparison, Demo & Decision Guide
06IMDb 50k Sentiment — Full Results Comparison

All methods tested on the same 50,000 IMDb reviews (25k train / 25k test), sentiment classification task.

MethodTypeAccuracyF1-macroLatencyGPU?InterpretabilityOOV
BoWClassical86.2%0.86~1 msNoExcellentNo
N-gram (1,2)Classical~89%~0.89~3 msNoVery goodNo
TF-IDFClassical90.1%0.90~3 msNoExcellentNo
Word2VecStatic85.7%0.86~12 msNoWeakNo
GloVeStatic76.8%0.77~10 msNoWeakNo
FastTextStatic86.0%0.86~15 msNoWeakYes
BERT fine-tunedContextual93.9%0.94~370 ms (CPU)YesVery weakYes
Counter-intuitive results: GloVe (76.8%) performs worse than BoW (86.2%) on sentiment — global co-occurrence statistics are poorly suited for sentiment. Word2Vec (85.7%) and FastText (86.0%) also underperform TF-IDF (90.1%). Why? Mean-pooling destroys word order — "not good" becomes indistinguishable from "good not". TF-IDF bigrams capture this negation exactly. BERT's 370ms CPU latency makes it impractical without a GPU for high-volume applications.
076-Review Demo — Where Each Method Fails
ReviewTrueBoWTF-IDFW2VFastTextBERT
"This film was absolutely brilliant"Pos
"This film was terrible and boring"Neg
"This movie was not good at all"Neg
"Not bad, quite enjoyable"Pos
"I expected something better"Neg
"A film that was surprisingly good"Pos
  • TF-IDF (with bigrams): 6/6 correct — bigram "not good" = strong negative signal. Best classical method.
  • BERT: 6/6 correct — bidirectional context resolves "not good" (negative) vs "not bad" (positive).
  • BoW: 5/6 — fails on "not good" (treats "not" and "good" separately).
  • W2V / FastText: 4/6 — fail on negation ("not good", "not bad") because mean pooling loses word order.
08Decision Guide — Which Model to Choose?
Golden Rule

Step 1: TF-IDF (fast baseline, no GPU, fully interpretable)
Step 2: FastText (if OOV words are a problem — noisy text, medical, social media)
Step 3: BERT fine-tuned (if polysemy, negation, or nuance is critical + GPU available)

Task RequirementBest ChoiceReason
Need context / polysemy / negationBERT fine-tunedOnly method that gives a different vector per occurrence
Noisy data / OOV words (typos, slang)FastTextSubword decomposition handles any word — never gets OOV
Need similarity / analogiesWord2Vec or GloVeSemantic vector space captures word relationships
Fast, interpretable, no GPU, no training dataTF-IDF140× faster than BERT, fully explainable, CPU-only
Don't jump to BERT first. TF-IDF with bigrams achieves 90.13% on IMDb — only 3.77 percentage points below BERT (93.9%). For many production tasks, that gap does not justify the GPU cost and inference latency. Always start at the cheapest method that meets your accuracy threshold.
09Three Families — Conceptual Differences

Each family solves the fundamental limitations of the previous one. Increasing complexity → better meaning capture → more resources required.

FamilyMethodsRepresentationSemanticsContextOOVKey Strength
Classical BoW · N-grams · TF-IDF Frequency vector / weight — each word = one dimension None — "film" and "movie" are unrelated None — word order ignored Word ignored entirely Speed · Interpretability · No GPU · No training
Static Embeddings Word2Vec · GloVe · FastText Dense vector 100–300d learned from co-occurrence Semantic similarity captured ("film" ≈ "movie", cosine 0.89) One vector per word — "bank" = same in all sentences FastText ✅ · W2V/GloVe ❌ Semantic similarity · Vector analogies · Compact representation
Contextual ELMo · BERT · GPT Dynamic vector 768–1024d — recomputed per context Deep semantic meaning: nuances, irony, polysemy Bidirectional — "bank" gets different vectors in different sentences WordPiece subwords ✅ Negation · Irony · Polysemy · Bidirectional context