0505 Naive Bayes
中文版:朴素贝叶斯
In Depth: Naive Bayes Classification
The previous four sections have given a general overview of the concepts of machine learning. In this section andtheonesthat follow, wewillbe taking a closer look at several specific algorithms for supervised and unsupervised learning, starting herewithnaive Bayes classification.
Naive Bayes models areagroup of extremely fast and simple classification algorithms thatareoften suitable forveryhigh-dimensional datasets. Because theyaresofastandhavesofew tunable parameters, theyendupbeing very useful asaquick-and-dirty baseline for a classification problem. This section will focus on an intuitive explanation ofhownaive Bayes classifiers work, followed by a couple examples ofthemin action on some datasets.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()Gaussian Naive Bayes
Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes. In this classifier, the assumption is that datafromeachlabel is drawn from a simple Gaussian distribution. Imagine thatyouhavethe following data:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');One extremely fastwayto create a simple model is to assume thatthedatais described by a Gaussian distribution with no covariance between dimensions. This model canbefitby simply finding themeanand standard deviation of the points within each label, which isallyouneedto define such a distribution. The result ofthisnaive Gaussian assumption is shown in the following figure:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);Now let’s generate somenewdataand predict the label:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)Nowwecanplotthisnewdatatogetanideaofwhere the decision boundary is:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);Weseea slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic.
Anicepiece of this Bayesian formalism isthatit naturally allows for probabilistic classification, which we can compute using the predict_proba method:
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)The columns give the posterior probabilities ofthefirst and second label, respectively. Ifyouare looking for estimates of uncertainty in your classification, Bayesian approaches likethiscanbea useful approach.
Of course, the final classification willonlybeasgoodasthemodel assumptions thatleadtoit, which is why Gaussian naive Bayes often does not produce very good results. Still, inmanycases—especially as the number of features becomes large—this assumption is not detrimental enough to prevent Gaussian naive Bayes from being a useful method.