Naive Bayes

Maximum Likelihood Estimation: Suppose we have data D = {x(i)} (i = 1, … , N), θ(MLE) = argmax_θ (Π(i=1 to N)P(x(i) | θ).
Maximum A Posteriori: Suppose we have data D = {x(i)} (i = 1, … , N), θ(MLE) = argmax_θ (Π(i=1 to N)P(x(i) |θ) P(θ).
Generative approaches: hypothesis h(x,y) = p(x,y) specifies a generative story for how the data was created, then pick a hypothesis by maximum likelihood estimation (MLE) or Maximum A Posteriori (MAP).
Discriminative approaches: hypothesis h directly predicts the label given the features y = h(x) or more generally, p(y|x) = h(x), then define a loss function and find hypothesis with minimum loss.
Naïve Bayes: P(x1, x2, …, xk, y) = P(x1, x2, …, xk | y) P(y) = (Π(i=1 to k)P(xi|y)) P(y)
Bernoulli Naïve Bayes: y ~ Bernoulli(Φ), x_k ~ Bernoulli(θ_k, y) k = 1,2,…,K, p_Φ,θ(x,y) = p_Φ(y)Π(k=1 to K)P_(θ_k)(x_k | y).
Multinomial Naïve Bayes: for i in {1,2,…, N}, y(i) ~ Bernoulli(Φ), for j in {1,…,M_i}, x_j(i)
Gaussian Naïve Bayes: Gaussian Naive Bayes assumes that p(xk | y) is given by a normal distribution.
Multiclass Naïve Bayes: The only change is that we permit y to range over the C classes, and we have a separate distribution p(xk | y) for each of the C classes.