Link Search Menu Expand Document
Word token
Occurrence of a word.
Word type
Unique word as a dictionary entry (i.e., unique tokens).
Vocabulary
The set of word types.
Zipf’s law
Given word count f and word rank r, then f * r = constant.
Bag of Words
Let m denote the size of the vocabulary. Given a document d, let c(w,d) denote the #occurrence of w in d. Then Bag-of-Words representation of the document is v_d = [c(w_1,d),c(w_2,d),…,c(w_m,d)]/Z_d, where Z_d=∑w c(w,d)
tf
Normalized term frequency tf_w = c(w,d) / max_v c(v,d)
idf
Inverse document frequency idf_w = log (total #documents / #documents containing w)
tf-idf
tf_w * idf_w
Cosine similarity
Cosine of the angle between two vectors sim(x,y) = x^T y / [sqrt(x^T x) * sqrt(y^T y)]
Unigram model
Define the probability of the sequence as the product of the probabilities of tokens in the sequence.
n-gram model
Define the conditional probability of n-th token given the proceeding n-1 tokens.
Smoothing
Adding non-zero probability mass to zero entries.