Module 06 — Word Embeddings

01Why Dense Vectors? — From Counting to Meaning

Classical methods (BoW, TF-IDF) represent each word as a unique dimension in a sparse vocabulary-sized vector. No two words are ever "close" to each other — "film" and "movie" are as distant as "film" and "elephant".

The key insight of neural word embeddings:

The Distributional Hypothesis (Firth, 1957)

"A word is characterized by the company it keeps."

"film" and "movie" appear in nearly the same contexts → they should have similar vectors.

Property	Classical (TF-IDF)	Static Embeddings (Word2Vec)
Vector size	\|Vocabulary\| (50,000+)	100–300 dimensions
Sparsity	99%+ zeros	Dense — all dimensions meaningful
Synonyms	"great" ≠ "excellent" (different dims)	"great" ≈ "excellent" (cosine ~ 0.85)
Semantics	No semantic meaning	Semantic clusters in vector space
Training required	No (counting)	Yes (neural network on large corpus)

30 Years of Evolution

Era	Methods	Key Innovation
1990s–2000s	BoW, TF-IDF	Frequency counting — fast, interpretable
2013	Word2Vec (Google)	Dense vectors, semantic similarity, vector arithmetic
2014	GloVe (Stanford)	Global co-occurrence statistics
2016	FastText (Facebook)	Subword decomposition — handles OOV
2018	ELMo, BERT (Google)	Contextual representations — different vector per occurrence

02Word2Vec — The Vector Space of Meaning

Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict words from their context. The actual task (prediction) is discarded — the learned weight matrix becomes the embedding lookup table.

Two Training Objectives

Skip-gram

Given the center word, predict the surrounding context words. Better for rare words. Works well on smaller datasets.

CBOW (Continuous Bag of Words)

Given the context words, predict the center word. Faster to train. Better for frequent words and larger datasets.

Vector Arithmetic — Captured Analogies

Famous Analogies $$\vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}}$$ $$\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Italy}} \approx \vec{\text{Rome}}$$

import gensim.downloader model = gensim.downloader.load('word2vec-google-news-300') # Vector arithmetic model.most_similar(positive=['king', 'woman'], negative=['man']) # → [('queen', 0.7118), ('princess', 0.6517), ...] # Semantic similarity on IMDb vocabulary: # film ↔ movie: cosine 0.887 (equivalent — same contexts) # terrible ↔ awful: cosine 0.880 (synonyms — negative cluster) # brilliant ↔ super: cosine 0.820 (synonyms — positive cluster) vec = model['film'] # 300-dimensional dense vector

Use Cases Where Word2Vec Excels

Use Case	Example
Film Recommendations	"Inception" → vector → find similar films in embedding space
Semantic Search	Query "voiture" finds results mentioning "auto", "automobile"
Chatbots	"problem" cluster: "bug", "error", "issue" — all mapped nearby
Entity disambiguation	Names of similar people/places cluster together

Strengths

• Captures semantic similarity and synonyms
• Vector analogies (king−man+woman≈queen)
• Dense vectors: 100–300 dimensions
• Pre-trained models available (Google News 300d)

Limitations

• One vector per word → "bank" (financial) = "bank" (river) — polysemy ignored
• OOV: unknown words = no representation
• Local context only (fixed window)
• Needs billions of words for quality vectors

03GloVe & FastText

GloVe — Global Vectors (Stanford, 2014)

GloVe factorizes a global co-occurrence matrix using Singular Value Decomposition (SVD). Instead of predicting local context (like Word2Vec), GloVe leverages overall corpus statistics — how often every word pair co-occurs across the entire corpus.

Analogy champion: Best performance on WordSim-353 and SimLex-999 benchmarks.
Limitation: The V×V co-occurrence matrix becomes impossible on large vocabularies (RAM).

import gensim.downloader model = gensim.downloader.load('glove-wiki-gigaword-100') # 100d GloVe vectors # Trained on Wikipedia + Gigaword (6B tokens)

FastText — Subword Embeddings (Facebook/Meta, 2016)

FastText extends Word2Vec by breaking words into character n-grams (subwords). The embedding of a word is the sum of its subword embeddings. This solves the OOV (Out-of-Vocabulary) problem — any word, even one unseen during training, can be represented using its subword components.

"acting" → subwords: <ac · act · cti · tin · ing · ng> # Each subword has its own embedding — sum them for the word vector # OOV performance on IMDb (words not in training vocabulary): # Word Word2Vec FastText # "amazing" OOV ❌ similarity 0.919 ✅ # "terrible" OOV ❌ similarity 0.907 ✅ # "unbelievable" OOV ❌ similarity 0.951 ✅ # Typo robustness: # "terrible" (typo) → subword overlap with "terrible" (correct) → similar vector

FastText Use Cases

Domain	Why FastText
Social media ("amazingg", "luv", "gr8")	Constant OOV from slang, abbreviations, typos
Medical NLP ("hypertension", "cardiomyopathy")	Rare medical terms → subword decomposition guarantees a vector
Morphologically rich languages (Arabic, Turkish, Finnish)	Complex suffixes handled via subwords — FastText global standard
Noisy OCR / SMS / forums	"necesary" → subword overlap with "necessary" → correct cluster

Word2Vec vs GloVe vs FastText — When to Choose

Criteria	Word2Vec	GloVe	FastText
Handles OOV words	No	No	Yes
Analogy performance	Good	Excellent	Good
Training speed	Fast	Slower	Fast
Morphological richness	No	No	Yes
Noisy text robustness	Poor	Poor	Strong
Best use case	Semantic similarity, clean text	Benchmarks, analogies	Social media, medical, OOV

Key shared limitation of all three: Static — one vector per word regardless of context. "I went to the bank to deposit money" and "I sat on the river bank" → same vector for "bank". This is the polysemy problem.

04The Fundamental Problem — Polysemy

"I'm going to the bank to withdraw some money."
BANK = Financial Institution

"He's fishing on the bank's shore."
BANK = Riverbank

Word2Vec · GloVe · FastText → "bank" has the SAME vector in both sentences. This is fundamentally wrong for any task requiring understanding of meaning in context.

Polysemy is the phenomenon where a single word has multiple meanings. All static embeddings collapse all senses into one average vector. The solution requires reading the entire sentence and computing a different vector for each occurrence of the word — a vector that reflects the specific meaning in that context.

This is exactly what contextual embeddings (ELMo, BERT, GPT) do. The transformer attention mechanism allows every word to "look at" every other word in the sentence before its representation is computed.

05BERT — Bidirectional Encoder Representations from Transformers

BERT (Devlin et al., Google, 2018) — based on the Transformer encoder architecture. Instead of a fixed vector per word, each word receives a different vector depending on its full bidirectional context. BERT reads the entire sentence from left-to-right AND right-to-left simultaneously (not sequentially like RNNs).

Pre-Training Tasks

Masked Language Model (MLM)

"The [MASK] was brilliant" → predicts "film". 15% of tokens are masked randomly. Forces the model to use bidirectional context to fill in blanks — cannot cheat by looking only left or right.

Next Sentence Prediction (NSP)

Given sentences A and B, predict: does B logically follow A? Teaches the model inter-sentence relationships — useful for QA, inference, summarisation tasks.

BERT Architecture Facts

Specification	BERT-Base	BERT-Large
Transformer encoder layers	12	24
Attention heads	12	16
Hidden dimension	768	1024
Total parameters	~110M	~340M
Pre-training data	Wikipedia + BooksCorpus (3.3 billion words)
Tokenisation	WordPiece — handles OOV via subword splitting

Fine-Tuning Strategy — Complete Fine-Tuning

# Strategy B: Complete fine-tuning (all layers trained on IMDb) ← Best strategy from transformers import BertForSequenceClassification, BertTokenizer from torch.optim import AdamW tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # All 12 layers + classification head trained on 25k IMDb reviews # Training: 3 epochs, GPU required, ~30-120 min # IMDb Results (50k, 25k train / 25k test): # Accuracy: 93.9% # F1 score: 0.939 # 3 epochs, GPU (Colab/Kaggle)

Why 93.9%? — BERT's Advantages

Unique Advantages

• Bidirectional context: "not good" (negative) vs "not bad" (positive) — BERT distinguishes them; W2V cannot
• Polysemy resolved: "bank" financial ≠ "bank" geographical → different vectors
• Massive pre-training: 3.3B words → rich linguistic knowledge via transfer learning
• Handles irony, negation, nuance naturally

Costs & Limitations

• 140× slower than FastText — impossible for real-time applications without a GPU
• GPU almost mandatory (fine-tuning: 30–120 min; production: GPU inference)
• Large labeled dataset needed for fine-tuning
• Black box: attention patterns are hard to interpret

BERT Use Cases

Domain	Why BERT Is Essential
Chatbot & Question Answering	Nuanced understanding of interconnected, multi-sentence context
Legal analysis	Critical semantic nuances in contracts and clauses
Medical diagnosis NLP	Irony, negation in clinical notes ("patient shows no signs of...")
Hate speech / sentiment detection	Sarcasm and irony impossible without bidirectional context
Automatic translation	BERT Multilingual: 104 languages in one model

06IMDb 50k Sentiment — Full Results Comparison

All methods tested on the same 50,000 IMDb reviews (25k train / 25k test), sentiment classification task.

Method	Type	Accuracy	F1-macro	Latency	GPU?	Interpretability	OOV
BoW	Classical	86.2%	0.86	~1 ms	No	Excellent	No
N-gram (1,2)	Classical	~89%	~0.89	~3 ms	No	Very good	No
TF-IDF	Classical	90.1%	0.90	~3 ms	No	Excellent	No
Word2Vec	Static	85.7%	0.86	~12 ms	No	Weak	No
GloVe	Static	76.8%	0.77	~10 ms	No	Weak	No
FastText	Static	86.0%	0.86	~15 ms	No	Weak	Yes
BERT fine-tuned	Contextual	93.9%	0.94	~370 ms (CPU)	Yes	Very weak	Yes

Counter-intuitive results: GloVe (76.8%) performs worse than BoW (86.2%) on sentiment — global co-occurrence statistics are poorly suited for sentiment. Word2Vec (85.7%) and FastText (86.0%) also underperform TF-IDF (90.1%). Why? Mean-pooling destroys word order — "not good" becomes indistinguishable from "good not". TF-IDF bigrams capture this negation exactly. BERT's 370ms CPU latency makes it impractical without a GPU for high-volume applications.

076-Review Demo — Where Each Method Fails

Review	True	BoW	TF-IDF	W2V	FastText	BERT
"This film was absolutely brilliant"	Pos	✓	✓	✓	✓	✓
"This film was terrible and boring"	Neg	✓	✓	✓	✓	✓
"This movie was not good at all"	Neg	✗	✓	✗	✗	✓
"Not bad, quite enjoyable"	Pos	✓	✓	✗	✗	✓
"I expected something better"	Neg	✓	✓	✓	✓	✓
"A film that was surprisingly good"	Pos	✓	✓	✓	✓	✓

TF-IDF (with bigrams): 6/6 correct — bigram "not good" = strong negative signal. Best classical method.
BERT: 6/6 correct — bidirectional context resolves "not good" (negative) vs "not bad" (positive).
BoW: 5/6 — fails on "not good" (treats "not" and "good" separately).
W2V / FastText: 4/6 — fail on negation ("not good", "not bad") because mean pooling loses word order.

08Decision Guide — Which Model to Choose?

Golden Rule

Step 1: TF-IDF (fast baseline, no GPU, fully interpretable)
Step 2: FastText (if OOV words are a problem — noisy text, medical, social media)
Step 3: BERT fine-tuned (if polysemy, negation, or nuance is critical + GPU available)

Task Requirement	Best Choice	Reason
Need context / polysemy / negation	BERT fine-tuned	Only method that gives a different vector per occurrence
Noisy data / OOV words (typos, slang)	FastText	Subword decomposition handles any word — never gets OOV
Need similarity / analogies	Word2Vec or GloVe	Semantic vector space captures word relationships
Fast, interpretable, no GPU, no training data	TF-IDF	140× faster than BERT, fully explainable, CPU-only

Don't jump to BERT first. TF-IDF with bigrams achieves 90.13% on IMDb — only 3.77 percentage points below BERT (93.9%). For many production tasks, that gap does not justify the GPU cost and inference latency. Always start at the cheapest method that meets your accuracy threshold.

09Three Families — Conceptual Differences

Each family solves the fundamental limitations of the previous one. Increasing complexity → better meaning capture → more resources required.

Family	Methods	Representation	Semantics	Context	OOV	Key Strength
Classical	BoW · N-grams · TF-IDF	Frequency vector / weight — each word = one dimension	None — "film" and "movie" are unrelated	None — word order ignored	Word ignored entirely	Speed · Interpretability · No GPU · No training
Static Embeddings	Word2Vec · GloVe · FastText	Dense vector 100–300d learned from co-occurrence	Semantic similarity captured ("film" ≈ "movie", cosine 0.89)	One vector per word — "bank" = same in all sentences	FastText ✅ · W2V/GloVe ❌	Semantic similarity · Vector analogies · Compact representation
Contextual	ELMo · BERT · GPT	Dynamic vector 768–1024d — recomputed per context	Deep semantic meaning: nuances, irony, polysemy	Bidirectional — "bank" gets different vectors in different sentences	WordPiece subwords ✅	Negation · Irony · Polysemy · Bidirectional context

Word Embeddings& Neural Representations

Word Embeddings
& Neural Representations