【大数据】机器学习-----线性模型

一、线性模型基本形式

线性模型旨在通过线性组合输入特征来预测输出。其一般形式为：

在这里插入图片描述

其中：

$\mathbf{x}=(x_1,x_2,\cdots,x_d)$ 是输入特征向量，包含 $d$ 个特征。
$\mathbf{w}=(w_1,w_2,\cdots,w_d)$ 是权重向量，每个元素 $w_i$ 表示对应特征的重要性。
$w_0 = b$ 是偏置项，允许模型在没有任何输入特征时也能进行预测。

二、线性回归

线性回归用于预测连续值，其目标是找到最佳的 $\mathbf{w}$ 和 $b$ 以最小化预测值 $\hat{y}$ 与真实值 $y$ 之间的均方误差（MSE）。给定一组包含 $m$ 个样本的数据集 $\{(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),\cdots,(\mathbf{x}_m,y_m)\}$ ，均方误差的计算公式如下：

在这里插入图片描述

通常使用梯度下降法来优化这个目标函数，其更新规则如下：

对于权重 $w_j$ （ $1,2,\cdots,d$ ）：
在这里插入图片描述

对于偏置项 $b$ ：
在这里插入图片描述

其中 $\alpha$ 是学习率，控制每次更新的步长。

三、对数几率回归（逻辑回归）

逻辑回归用于二分类问题，将线性函数的输出通过逻辑函数（sigmoid 函数）转换为概率。逻辑函数定义为：
在这里插入图片描述

其目标是最大化似然函数，等价于最小化对数似然损失函数：

在这里插入图片描述

四、多分类学习

对于多分类问题，常用 softmax 函数将线性函数的结果转化为概率分布。假设类别数为 $K$ ，对于样本 $i$ ，首先计算线性函数的输出 $z_{ik}=\mathbf{w}_k^T\mathbf{x}_i + b_k$ ，然后使用 softmax 函数：
在这里插入图片描述

其交叉熵损失函数为：

在这里插入图片描述

其中 $y_{ik}$ 是一个 one-hot 编码向量，如果样本 $i$ 属于类别 $k$ ，则 $y_{ik}=1$ ，否则 $y_{ik}=0$ 。

五、类别不平衡问题

类别不平衡问题发生在不同类别样本数量差异较大时，这可能导致模型偏向于多数类。常见的解决方法包括：

1. 重采样：

过采样：复制少数类样本以增加其数量。
欠采样：删除多数类样本以减少其数量。

2. 代价敏感学习：

在损失函数中为不同类别赋予不同的权重，使得少数类的错误分类代价更高。

代码示例：

线性回归示例：

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt# 生成线性回归数据
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# 初始化线性回归模型
model = LinearRegression()# 训练模型
model.fit(X_train, y_train)# 预测
y_pred = model.predict(X_test)# 打印模型参数
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")# 可视化结果
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.show()

在这里插入图片描述

逻辑回归示例：

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt# 生成二分类数据
np.random.seed(42)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# 初始化逻辑回归模型
model = LogisticRegression()# 训练模型
model.fit(X_train, y_train)# 预测
y_pred = model.predict(X_test)# 计算准确率
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")# 可视化决策边界
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Logistic Regression')
plt.show()

在这里插入图片描述

多分类逻辑回归示例：

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt# 生成多分类数据
# 调整 n_clusters_per_class 为 1 或调整 n_classes 为 2 或调整 n_informative 为 3 等
x, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=42)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)# 初始化多分类逻辑回归模型
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')# 训练模型
model.fit(X_train, y_train)# 预测
y_pred = model.predict(X_test)# 计算准确率
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")# 可视化决策边界
h = 0.02
x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Multiclass Logistic Regression')
plt.show()

在这里插入图片描述

类别不平衡问题示例（过采样）：

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils import resample# 生成类别不平衡数据
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, weights=[0.9, 0.1], random_state=42)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# 原始模型
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Original Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Original F1-score: {f1_score(y_test, y_pred)}")# 过采样少数类
X_minority = X_train[y_train == 1]
y_minority = y_train[y_train == 1]
X_minority_upsampled, y_minority_upsampled = resample(X_minority, y_minority, replace=True, n_samples=X_train[y_train == 0].shape[0], random_state=42)
X_train_upsampled = np.vstack((X_train[y_train == 0], X_minority_upsampled))
y_train_upsampled = np.hstack((y_train[y_train == 0], y_minority_upsampled))# 过采样后的模型
model_upsampled = LogisticRegression()
model_upsampled.fit(X_train_upsampled, y_train_upsampled)
y_pred_upsampled = model_upsampled.predict(X_test)
print(f"Upsampled Accuracy: {accuracy_score(y_test, y_pred_upsampled)}")
print(f"Upsampled F1-score: {f1_score(y_test, y_pred_upsampled)}")