机器学习--逻辑回归

机器学习–逻辑回归

一、认知革命：从线性回归到逻辑回归

1.1 本质差异对比

维度	线性回归	逻辑回归
输出类型	连续值	概率值 (0-1)
目标函数	最小二乘法	极大似然估计
数学表达式	$y=w^Tx+b$	$p=\frac{1}{1+e^{-(w^Tx+b)}}$
应用场景	房价预测	信用评分、疾病诊断

1.2 Sigmoid函数的数学推导

# Sigmoid函数特性验证
def sigmoid(z):return 1 / (1 + np.exp(-z))print("输入10时概率：", sigmoid(10))  # 输出0.99995
print("输入-10时概率：", sigmoid(-10)) # 输出0.000045

1.3 逻辑回归

逻辑回归需要提前设定一个阈值p，用来控制分类界限，

二、判断客户是否患病

其中提供了众多参数如：

age：年龄。
sex：性别。
cp：胸痛类型。
trestbps：休息时血压。

去判断是否患病，这就变成了一个分类问题

或者去判断一个人是否考试通过

三、损失函数

线性回归

对于线性回归的损失函数
$L(w,b)=\frac{1}{N}\sum_{(x,y)\in D}Loss(h(x),y)=\frac{1}{N}\sum_{(x,y)\in D}Loss(y^{\prime},y)$

$\bar{e}=\frac{1}{n}\sum_{i=1}^n(wx_i-y_i)^2$

$\bar{e}=\frac{1}{n}\sum_{i=1}^nx_i^2w^2-2\frac{1}{n}\sum_{i=1}^nx_iy_iw+\frac{1}{n}\sum_{i=1}^ny_i^2$

$\bar{e}=aw^2+bw+c$

这可以把e当作一个二次函数，可以求最小值，其对于w的梯度可以最为梯度下降中的步长，但是对于逻辑回归，采用此方法可能无法到达最底端。

逻辑回归

$\begin{cases}y=1,Loss(h(x),y)=-\log(h(x))\\y=0,Loss(h(x),y)=-\log(1-h(x))&\end{cases}$

如果真值是1，假设函数预测概率接近于0，损失值是巨大的。
如果真值是0，假设函数预测概率接近于1，损失值是巨大的。

如果将两种情况归为一类，即可获得以下算式：
$loss=y^{*}\mathrm{log}(h(x))+(1-y)^{*}\mathrm{log}(1-h(x))$
我们设定损失函数为：
$L(w,b)=-\frac{1}{N}\sum_{(x,y)\in D}[y^*\mathrm{log}(h(x))+(1-y)^*\mathrm{log}(1-h(x))]$

四、梯度下降

$\text{梯度}=h^{\prime}(x)=\frac{\partial}{\partial w}L(w,b)=\frac{\partial}{\partial w}\left\{-\frac{1}{N}\sum_{(x,y)\in D}[y*\log(h(x))+(1-y)*\log(1-h(x))]\right\}$

步骤1：明确损失函数形式（交叉熵损失）

$-\frac{1}{N}\sum_{i=1}^N \left[ y^{(i)} \log(h^{(i)}) + (1-y^{(i)})\log(1-h^{(i)}) \right]$

步骤2：拆解单样本损失分量（以第i个样本为例）

$\text{单样本损失} = -\left[ y \log(h) + (1-y)\log(1-h) \right]$

步骤3：对参数w求导的核心运算

$\frac{\partial}{\partial w} \text{单样本损失} = \frac{\partial}{\partial h}\left(-\left[y\log(h)+(1-y)\log(1-h)\right]\right) \cdot \frac{\partial h}{\partial w}$

分项求导运算结果：

$\begin{aligned} &\text{a项导}：\frac{\partial}{\partial h}[-y \log(h)] = -\frac{y}{h} \\ &\text{b项导}：\frac{\partial}{\partial h}[-(1-y)\log(1-h)] = \frac{1-y}{1-h} \end{aligned}$

合并结果：

$\frac{\partial}{\partial h} = \frac{1-y}{1-h} - \frac{y}{h}$

步骤4：Sigmoid函数导数计算（链式法则核心步骤）

$\begin{aligned} h &= \sigma(w^Tx + b) = \frac{1}{1+e^{-(w^Tx + b)}} \\ \frac{\partial h}{\partial w} &= h(1-h)x \quad \text{（Sigmoid函数的优雅性质）} \end{aligned}$

步骤5：梯度分量组合与化简（见证数学之美）

$\begin{aligned} \text{梯度分量} &= \left( \frac{1-y}{1-h} - \frac{y}{h} \right) \cdot h(1-h)x \\ &= \left[ (1-y)h - y(1-h) \right]x \\ &= (h - y)x \quad \text{（所有复杂项神奇抵消！）} \end{aligned}$

步骤6：整合全局梯度（全体样本协同计算）

$\frac{\partial L}{\partial w} = \frac{1}{N}\sum_{i=1}^N (h^{(i)} - y^{(i)})x^{(i)}$

重要性质总结表

性质	数学表达式	物理意义
梯度公式	$\frac{\partial L}{\partial w} = \frac{1}{N}\sum (h-y)x$	预测误差驱动物体参数调整
Sigmoid导数	$\frac{d\sigma(z)}{dz} = \sigma(z)(1-\sigma(z))$	自动生成正则化效果
概率计算	$\frac{1}{1+e^{-w^Tx}}$	完美映射到[0,1]概率空间

省略大量微分细节，可得：
$\text{梯度}=\frac{1}{N}\sum_{i=1}^N(y^{(i)}-h(x^{(i)}))\bullet x^{(i)}$
所以加入更新的速率以后可得：
$w=w-\alpha\cdot\frac{\partial}{\partial w}L(w)$

$w=w-\frac{\alpha}{N}\sum_{i=1}^N(y^{(i)}-(w\bullet x^{(i)}))\bullet x^{(i)}$

五、通过逻辑回归解决二元分类问题

5.1数据准备

5.1.1 数据读取

import numpy as np # 导入NumPy数学工具箱
import pandas as pd # 导入Pandas数据处理工具箱
df_heart = pd.read_csv("/kaggle/input/logistic-regression/heart.csv")  # 读取文件
df_heart.head() # 显示前5行数据

df_heart.target.value_counts() # 输出分类值，及各个类别数目

import matplotlib.pyplot as plt # 导入绘图工具
# 以年龄+最大心率作为输入，查看分类结果散点图
plt.scatter(x=df_heart.age[df_heart.target==1],y=df_heart.thalach[(df_heart.target==1)], c="red")
plt.scatter(x=df_heart.age[df_heart.target==0],y=df_heart.thalach[(df_heart.target==0)], marker='^')
plt.legend(["Disease", "No Disease"]) # 显示图例
plt.xlabel("Age") # X轴-Age
plt.ylabel("Heart Rate") # Y轴-Heart Rate
plt.show() # 显示散点图

5.1.2 构建特征集和标签集

X = df_heart.drop(['target'], axis = 1) # 构建特征集
y = df_heart.target.values # 构建标签集
y = y.reshape(-1,1) # -1是	相对索引，等价于len(y)
print("张量X的形状:", X.shape)
print("张量X的形状:", y.shape)

5.1.3 拆分数据集

按照80%/20%的比例准备训练集和测试集:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

5.1.4 数据特征缩放

from sklearn.preprocessing import MinMaxScaler # 导入数据缩放器
scaler = MinMaxScaler() # 选择归一化数据缩放器，MinMaxScaler
X_train = scaler.fit_transform(X_train) # 特征归一化 训练集fit_transform
X_test = scaler.transform(X_test) # 特征归一化 测试集transform

仅就这个数据集而言，Min Max Scaler进行的数据特征缩放不仅不会提高效率，似乎还会令预测准确率下降。

这个结果提示我们：没有绝对正确的理论，实践才是检验真理的唯一标准。

5.2 建立逻辑回归模型

5.1.1 逻辑函数

# 首先定义一个Sigmoid函数，输入Z，返回y'
def sigmoid(z):    y_hat = 1/(1+ np.exp(-z))return y_hat

5.1.2 损失函数

# 然后定义损失函数
def loss_function(X,y,w,b):y_hat = sigmoid(np.dot(X,w) + b) # Sigmoid逻辑函数 + 线性函数(wX+b)得到y'loss = -(y*np.log(y_hat) + (1-y)*np.log(1-y_hat)) # 计算损失cost = np.sum(loss) / X.shape[0]  # 整个数据集平均损失    return cost # 返回整个数据集平均损失

$L(w,b)=-\frac{1}{N}\sum_{(x,y)\in D}\left[y^*\mathrm{log}(h(x))+(1-y)^*\mathrm{log}(1-h(x))\right]$

5.2.3 梯度下降的实现

# 然后构建梯度下降的函数
def gradient_descent(X,y,w,b,lr,iter) : #定义逻辑回归梯度下降函数l_history = np.zeros(iter) # 初始化记录梯度下降过程中误差值(损失)的数组w_history = np.zeros((iter,w.shape[0],w.shape[1])) # 初始化权重记录的数组b_history = np.zeros(iter) # 初始化记录梯度下降过程中偏置的数组  for i in range(iter): #进行机器训练的迭代y_hat = sigmoid(np.dot(X,w) + b) #Sigmoid逻辑函数+线性函数(wX+b)得到y'derivative_w = np.dot(X.T,((y_hat-y)))/X.shape[0]  # 给权重向量求导derivative_b = np.sum(y_hat-y)/X.shape[0] # 给偏置求导w = w - lr * derivative_w # 更新权重向量，lr即学习速率alphab = b - lr * derivative_b   # 更新偏置，lr即学习速率alphal_history[i] =  loss_function(X,y,w,b) # 梯度下降过程中的损失print ("轮次", i+1 , "当前轮训练集损失：",l_history[i]) w_history[i] = w # 梯度下降过程中权重的历史 请注意w_history和w的形状b_history[i] = b # 梯度下降过程中偏置的历史return l_history, w_history, b_history

$\text{梯度}=\frac{1}{N}\sum_{i=1}^N(y^{(i)}-h(x^{(i)}))\bullet x^{(i)}$

$w=w-\alpha\cdot\frac{\partial}{\partial w}L(w)$

5.2.4 分类预测的实现

定义一个负责分类预测的函数：

def predict(X,w,b): # 定义预测函数z = np.dot(X,w) + b # 线性函数y_hat = sigmoid(z) # 逻辑函数转换y_pred = np.zeros((y_hat.shape[0],1)) # 初始化预测结果变量  for i in range(y_hat.shape[0]):if y_hat[i,0] < 0.5:y_pred[i,0] = 0 # 如果预测概率小于0.5，输出分类0else:y_pred[i,0] = 1 # 如果预测概率大于0.5，输出分类1return y_pred # 返回预测分类的结果

5.3 开始训练机器

def logistic_regression(X,y,w,b,lr,iter): # 定义逻辑回归模型l_history,w_history,b_history = gradient_descent(X,y,w,b,lr,iter)#梯度下降print("训练最终损失:", l_history[-1]) # 打印最终损失y_pred = predict(X,w_history[-1],b_history[-1]) # 进行预测traning_acc = 100 - np.mean(np.abs(y_pred - y_train))*100 # 计算准确率print("逻辑回归训练准确率: {:.2f}%".format(traning_acc))  # 打印准确率return l_history, w_history, b_history # 返回训练历史记录

#初始化参数
dimension = X.shape[1] # 这里的维度 len(X)是矩阵的行的数，维度是列的数目
weight = np.full((dimension,1),0.1) # 权重向量，向量一般是1D，但这里实际上创建了2D张量
bias = 0 # 偏置值
#初始化超参数
alpha = 1 # 学习速率
iterations = 500 # 迭代次数

# 用逻辑回归函数训练机器
loss_history, weight_history, bias_history =  \logistic_regression(X_train,y_train,weight,bias,alpha,iterations)

y_pred = predict(X_test,weight_history[-1],bias_history[-1]) # 预测测试集
testing_acc = 100 - np.mean(np.abs(y_pred - y_test))*100 # 计算准确率
print("逻辑回归测试准确率: {:.2f}%".format(testing_acc))

逻辑回归测试准确率: 85.25%

5.4 测试分类结果

print ("逻辑回归预测分类值:",predict(X_test,weight_history[-1],bias_history[-1]))

5.5 绘制损失曲线

loss_history_test = np.zeros(iterations) # 初始化历史损失
for i in range(iterations): #求训练过程中不同参数带来的测试集损失loss_history_test[i] = loss_function(X_test,y_test,weight_history[i],bias_history[i])
index = np.arange(0,iterations,1)
plt.plot(index, loss_history,c='blue',linestyle='solid')
plt.plot(index, loss_history_test,c='red',linestyle='dashed')
plt.legend(["Training Loss", "Test Loss"])
plt.xlabel("Number of Iteration")
plt.ylabel("Cost")
plt.show() # 同时显示显示训练集和测试集损失曲线

在迭代80～100次后，训练集的损失进一步下降，但是测试集的损失并没有跟着下降，反而显示呈上升趋势。

这是明显的过拟合现象。因此迭代应该在80–100结束。

六、工业级代码实现

真正做项目的时候，其实没多少人这么去写代码，大家会直接调用库函数进搞定项目

from sklearn.linear_model import LogisticRegression #导入逻辑回归模型
lr = LogisticRegression() # lr,就代表是逻辑回归模型
lr.fit(X_train,y_train) # fit,就相当于是梯度下降
print("SK-learn逻辑回归测试准确率{:.2f}%".format(lr.score(X_test,y_test)*100))

SK-learn逻辑回归测试准确率85.25%

这个准确率比我们之前的手写代码好很多，这是为什么呢？

6.1哑特征

cp这个字段，它的意义是“胸痛类型”，取值为0、1、2、3。这些分类值，是大小无关的。

但是问题在于，计算机会把它们理解为数值，认为1和2与1和3之间的关系不是并列的，是后者差值比前者要大。

解决的方法，是把这种类别特征拆分成多个哑特征，比如cp有0、1、2、3这4类，就拆分成个4特征，cp_0为一个特征、cp_1为一个特征、cp_2为一个特征、cp_3为一个特征。每一个特征都还原成二元分类，答案是Yes或者No，也就是数值1或0。

# 把3个文本型变量转换为哑变量
a = pd.get_dummies(df_heart['cp'], prefix = "cp")
b = pd.get_dummies(df_heart['thal'], prefix = "thal")
c = pd.get_dummies(df_heart['slope'], prefix = "slope")
# 把哑变量添加进dataframe
frames = [df_heart, a, b, c]
df_heart = pd.concat(frames, axis = 1)
df_heart = df_heart.drop(columns = ['cp', 'thal', 'slope'])
df_heart.head() # 显示新的dataframe

6.2二分类实战：信用卡欺诈检测

import pandas as pd
from sklearn.model_selection import StratifiedKFold# 读取真实工业数据集
df = pd.read_csv('creditcard.csv')
X = df.drop(['Class','Time'], axis=1)
y = df['Class']# 分层抽样保持样本分布
skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):X_train, X_test = X.iloc[train_index], X.iloc[test_index]y_train, y_test = y.iloc[train_index], y.iloc[test_index]# 带类别权重的模型
model = LogisticRegression(class_weight={0:1, 1:10}, penalty='l1', solver='saga')
model.fit(X_train, y_train)# 输出特征重要性排序
importance = pd.DataFrame({'feature':X.columns, 'coef':model.coef_[0]})
print(importance.sort_values('coef', ascending=False))

6.3 多分类实战：手写数字识别（MNIST）

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import MinMaxScaler# 加载MNIST数据集
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"].astype(np.uint8)# 数据标准化
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X[:10000]) # 抽样加速训练# OVR多分类策略
model = LogisticRegression(multi_class='ovr', penalty='l2', max_iter=1000)
model.fit(X_scaled, y[:10000])# 显示预测结果样例
plt.figure(figsize=(12,6))
for i in range(10):plt.subplot(2,5,i+1)plt.imshow(X[i].reshape(28,28), cmap='gray')plt.title(f'Pred: {model.predict([X_scaled[i]])[0]}')
plt.tight_layout()
get"].astype(np.uint8)# 数据标准化
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X[:10000]) # 抽样加速训练# OVR多分类策略
model = LogisticRegression(multi_class='ovr', penalty='l2', max_iter=1000)
model.fit(X_scaled, y[:10000])# 显示预测结果样例
plt.figure(figsize=(12,6))
for i in range(10):plt.subplot(2,5,i+1)plt.imshow(X[i].reshape(28,28), cmap='gray')plt.title(f'Pred: {model.predict([X_scaled[i]])[0]}')
plt.tight_layout()