Unique word as a dictionary entry (i.e., unique tokens).
Vocabulary
The set of word types.
Zipf’s law
Given word count f and word rank r, then f * r = constant.
Bag of Words
Let m denote the size of the vocabulary. Given a document d, let c(w,d) denote the #occurrence of w in d. Then Bag-of-Words representation of the document is v_d = [c(w_1,d),c(w_2,d),…,c(w_m,d)]/Z_d, where Z_d=∑w c(w,d)
tf
Normalized term frequency tf_w = c(w,d) / max_v c(v,d)