如何使用机器学习和Python构建垃圾邮件检测器

所有现代垃圾邮件检测系统都依赖于机器学习。机器学习在许多分类任务中已被证明在有足够训练数据的情况下具有优越性。

本教程将向您展示如何使用监督学习构建一个垃圾邮件检测器。更具体地说,您将使用Python训练一个逻辑回归模型,将电子邮件分类为垃圾邮件和非垃圾邮件。

前提条件

您将使用NumPy、SciPy、scikit-learn和Matplotlib:

import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt

下载包含4,601封电子邮件的垃圾邮件数据集。该数据集按3:1的比例分为训练集和测试集。每封电子邮件被标记为0(非垃圾邮件)或1(垃圾邮件),并包含57个特征(关于连续大写字母的3个长度统计、48个单词和6个字符的频率百分比):

features = np.array(
    [
        "word_freq_make",
        "word_freq_address",
        "word_freq_all",
        "word_freq_3d",
        "word_freq_our",
        "word_freq_over",
        "word_freq_remove",
        "word_freq_internet",
        "word_freq_order",
        "word_freq_mail",
        "word_freq_receive",
        "word_freq_will",
        "word_freq_people",
        "word_freq_report",
        "word_freq_addresses",
        "word_freq_free",
        "word_freq_business",
        "word_freq_email",
        "word_freq_you",
        "word_freq_credit",
        "word_freq_your",
        "word_freq_font",
        "word_freq_000",
        "word_freq_money",
        "word_freq_hp",
        "word_freq_hpl",
        "word_freq_george",
        "word_freq_650",
        "word_freq_lab",
        "word_freq_labs",
        "word_freq_telnet",
        "word_freq_857",
        "word_freq_data",
        "word_freq_415",
        "word_freq_85",
        "word_freq_technology",
        "word_freq_1999",
"word_freq_parts",
        "word_freq_pm",
        "word_freq_direct",
        "word_freq_cs",
        "word_freq_meeting",
        "word_freq_original",
        "word_freq_project",
        "word_freq_re",
        "word_freq_edu",
        "word_freq_table",
        "word_freq_conference",
        "char_freq_;",
        "char_freq_(",
        "char_freq_[",
        "char_freq_!",
        "char_freq_$",
        "char_freq_#",
        "capital_run_length_average",
        "capital_run_length_longest",
        "capital_run_length_total",
    ]
)

加载数据

首先,将数据加载到适当的训练/测试变量中:

data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)

接下来,通过计算 z 分数来规范化每个特征的尺度:

Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)

定义逻辑回归模型

定义辅助函数和对数似然:

def logsumexp(x):
    offset = np.max(x, axis=0)
    return offset + np.log(np.sum(np.exp(x - offset), axis=0))

def logsigma(x):
    if not isinstance(x, np.ndarray):
        return -logsumexp(np.array([0, -x]))
    else:
        return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))

def l(y, X, w):
    return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))

 

定义对数似然的梯度:

def sigma(x):
    return np.exp(x)/(1 + np.exp(x))

def dl(y, X, w):
    return (y - sigma(X.dot(w))).dot(X)

 

使用最大似然估计(MLE)和梯度下降(GD)确定参数

以下是实现梯度下降的Python框架:

def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
    f, update = obj_up
    theta = theta0
    values = np.zeros(nepochs + 1)
    eps = np.zeros(nepochs + 1)
    values[0] = f(theta0)
    eps[0] = eps0

    for epoch in range(nepochs):
        if verbose:
            print(
                "Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
                    epoch, values[epoch], eps[epoch]
                )
            )
        theta = update(theta, eps[epoch])

        values[epoch + 1] = f(theta)
        if values[epoch] < values[epoch + 1]:
            eps[epoch + 1] = eps[epoch] / 2.0
        else:
            eps[epoch + 1] = eps[epoch] * 1.05

    if verbose:
        print("Result after {} epochs: f={}".format(nepochs, values[-1]))
    return theta, values, eps

def gd(y, X):
    def objective(w):
        return -(l(y, X, w))

    def update(w, eps):
        return w + eps * dl(y, X, w)

    return (objective, update)

您现在可以运行 GD 来获取优化后的权重:

np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)

预测

最后,您可以定义一个垃圾邮件置信度值的预测器和分类器:

def predict(Xtest, w):
    return sigma(Xtest.dot(w))

def classify(Xtest, w):
    threshold = 0.5 # 初始阈值
    return (sigma(Xtest.dot(w)) > threshold).astype(int)

yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)

绘制精确率-召回率曲线以找到更好的阈值:

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
    index = int(x * (precision.size - 1))
    plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("召回率")
plt.ylabel("精确率")
# 0.44 看起来不错

查看最大的权重:

features[np.where(wz_gd>2)]

毫不奇怪,你会发现 char_freq_$capital_run_length_longest 对结果有着显著的影响,也就是说,垃圾邮件中经常包含美元符号和大写字母单词。

结论

在本教程中,你学习了如何使用机器学习和Python构建一个电子邮件垃圾邮件检测器。如果你想进一步练习,可以尝试寻找另一个数据集,并使用这里介绍的框架构建一个二分类模型。

更多