On the Linearity of Bayesian Classifiers

July 17, 2016

In his book, Neural Networks - A Comprehensive Foundation, Simon Haykin has an entire section (3.10) dedicated to how perceptrons and Bayesian classifiers are closely related when operating in a Gaussian environment. However, it is not until the end of the section that Haykin mentions that the relation is only limited to linearity. What is interesting about this is that a Perceptron can produce the same classification “model” as a Bayesian classifier, provided that the underlying data is drawn from a Gaussian distribution. This post is an experimental verification of that.

All linear classifiers, either implicitly or explicitly take the following form:

$$ y = \sum_{i=1}^{m} w_{i}x_{i} + b = \mathbf{w^{T}x} + b $$

where $\mathbf{w}$ is the weight vector (or coefficients) of the classifier, $b$ is the bias (or intercept), $\mathbf{x}$ is the input vector and $y$ is the scalar output. The sign of $y$ denotes the predicted class for a given input $x$.

A perceptron can easily be characterized by such an expression, but a Bayesian classifier has no concept of weights and biases or slopes and intercepts. It makes decisions by computing the log-likelihood ratio $log\Lambda(\mathbf{x})$ of the input vector $\mathbf{x}$ and compares it to a threshold $\xi$. The comparison essentially decides the predicted class for input $\mathbf{x}$.

After some amount of straightforward but somewhat lengthy algebra, Haykin is able to show that the slope and intercept form and the log-likelohood and threshold form are related in a simple manner. Suppose we define the following equations,

$$ y = log\Lambda(\mathbf{x}) $$

$$ \mathbf{w} = \mathbf{C}^{-1}(\mathbf{\mu_{1}} - \mathbf{\mu_{2}}) $$

$$ b = \frac{1}{2}(\mathbf{\mu_{2}^{T}}\mathbf{C}^{-1}\mathbf{\mu_{2}} - \mathbf{\mu_{1}^{T}}\mathbf{C}^{-1}\mathbf{\mu_{1}}) $$

where $\mathbf{C}$ is the covariance matrix of the dataset $X = [\mathbf{x_{1}}, \mathbf{x_{2}}, … \mathbf{x_{n}}]$, and $\mathbf{\mu_{i}}$ is the mean of the input vectors belonging to the $i$th class, for $i \in [1, 2]$, then the decision function can be rewritten as

$$ y = \mathbf{w^{T}x} + b $$

To verify this, let’s make up a classification problem and train a Gaussian Bayes classifier on it.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB


# make some data
x1 = np.random.multivariate_normal([0, 0], [[0.05, 0], [0, 0.05]], size=(1000,))
x2 = np.random.multivariate_normal([-1, -1], [[0.05, 0], [0, 0.05]], size=(1000,))
plt.figure(figsize=(8, 6))
plt.scatter(x1[:, 0], x1[:, 1], marker="o", c="g")
plt.scatter(x2[:, 0], x2[:, 1], marker="x", c="r")

# adding labels and normalizing data
x1 = np.c_[x1, np.ones((1000, 1))]
x2 = np.c_[x2, np.zeros((1000, 1))]
X = np.r_[x1, x2]
X = StandardScaler().fit_transform(X)
np.random.shuffle(X)
y = X[:, -1]
X = X[:, :-1]
y[y != 1] = 0

# define a function to help us draw a decision boundary
def draw_decision_boundary(clf, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.4)
    plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)

# train a Gaussian Naive Bayes classifier on the dataset
clf = GaussianNB()
clf.fit(X, y)
draw_decision_boundary(clf, X, y)

Estimation of the weights and the bias

With a little bit of numpy.ndarray hacking, we can produce $\mathbf{w}$ and $b$ in the equations above from the covariance matrix of $\mathbf{X}$ and its classwise means.

C = np.cov(X.T)
mu1 = X[y == 1, :].mean(0)
mu2 = X[y == 0, :].mean(0)

c_inverse = np.linalg.inv(C)
weights = np.dot(c_inverse, mu1 - mu2).reshape(-1, 1)
intercept = np.dot(np.dot(mu2.reshape(1, -1), c_inverse).ravel(), mu2) - \
            np.dot(np.dot(mu1.reshape(1, -1), c_inverse).ravel(), mu1)

Now that we have the estimated weights and the intercept, let’s create a Perceptron classifier from these and see its performance.

from sklearn.linear_model import Perceptron

clf = Perceptron()
clf.fit(X, y) # Note that this is only necessary to fool sklearn
              # - it won't draw the decision surface otherwise
clf.coef_ = weights.T
clf.intercept_ = intercept
draw_decision_boundary(clf, X, y)

Observe that both the classifiers draw almost the same decision line.

I think it needs to be emphasized that the relationship between a Bayesian classifier and a perceptron ends at linearity. They are not equivalent or even complementary beyond the assumption of linearity and a normal distribution. A Bayesian classifier is parametric, a perceptron is not. A Bayesian classifier is generative, a perceptron is discriminative. So nothing practical may come of this excercise of deriving one from the other - it just goes to show that a Bayesian classifier can be expressed as a dot product.