Natural Language Processing

Word token: Occurrence of a word.
Word type: Unique word as a dictionary entry (i.e., unique tokens).
Vocabulary: The set of word types.
Zipf’s law: Given word count f and word rank r, then f * r = constant.
Bag of Words: Let m denote the size of the vocabulary. Given a document d, let c(w,d) denote the #occurrence of w in d. Then Bag-of-Words representation of the document is v_d = [c(w_1,d),c(w_2,d),…,c(w_m,d)]/Z_d, where Z_d=∑w c(w,d)
tf: Normalized term frequency tf_w = c(w,d) / max_v c(v,d)
idf: Inverse document frequency idf_w = log (total #documents / #documents containing w)
tf-idf: tf_w * idf_w
Cosine similarity: Cosine of the angle between two vectors sim(x,y) = x^T y / [sqrt(x^T x) * sqrt(y^T y)]
Unigram model: Define the probability of the sequence as the product of the probabilities of tokens in the sequence.
n-gram model: Define the conditional probability of n-th token given the proceeding n-1 tokens.
Smoothing: Adding non-zero probability mass to zero entries.