机器学习-1:线性回归

常用的线性回归模型主要有以下这些

简单线性回归
多元线性回归
多项式回归
岭回归
套索回归
弹性网络回归
逐步回归

一.简单的一元线性回归

1.导入必备的库

#导入必备的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

2.设置显示选项

# 设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    #显示最大行数
pd.set_option('display.max_columns', None)  #显示最大列数
pd.set_option('display.max_colwidth', None)  #显示的最大列宽
pd.set_option('display.width', None)  #显示的最宽度

3.导入数据

data=pd.read_excel("汽车制造行业收入表.xlsx")
data=pd.DataFrame(data)
x=pd.DataFrame(data["工龄"])
y=pd.DataFrame(data["薪水"])

4.划分训练集和测试集

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

变量名:

x_train: 训练集的特征数据。用于训练模型。

x_test: 测试集的特征数据。用于评估模型的性能。

y_train: 训练集的目标变量。与 x_train 对应，用于训练模型。

y_test: 测试集的目标变量。与 x_test 对应，用于评估模型的性能。

train_test_split 函数参数:

x: 输入特征数据。可以是 NumPy 数组、Pandas DataFrame 或其他可迭代的数据结构。

y: 目标变量。与 x 对应，表示要预测的值。

test_size=0.2: 指定测试集的比例。0.2 表示 20% 的数据将被分配到测试集，剩余 80% 的数据将被分配到训练集。

random_state=42: 随机种子，用于确保数据分割的可重复性。指定一个整数（如 42）可以使每次运行代码时，数据分割的结果相同。这对于调试和结果复现非常有用

5.数据预处理

根据数据情况进行数据的标准化/归一化/二值化

6.模型建立

6.1创建线性回归模型:y=a*x+b

model = LinearRegression()

6.2训练模型

model.fit(x_train, y_train)

6.3输出线性回归系数

print("线性回归系数a:",model.coef_)
print("线性回归截距b:",model.intercept_)

线性回归系数a: [[1114.15442257]]
线性回归截距b: [7680.390625]

6.4预测数据

y_test_pred= model.predict(x_test)

6.5评估

mse = mean_squared_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_test - y_test_pred))
# 计算调整后的R平方
n = len(y_test)
p = x_train.shape[1]
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
cv_scores = cross_val_score(model, x, y, cv=5, scoring='r2')

1. mse = mean_squared_error(y_test, y_test_pred)

功能：计算测试集上的均方误差（Mean Squared Error, MSE）。
y_test: 测试集的真实目标值。
y_test_pred: 模型在测试集上的预测值。

MSE 越小，表示模型的预测值与实际值越接近

2. r2 = r2_score(y_test, y_test_pred)

功能：计算测试集上的R平方（R2）得分。
y_test: 测试集的真实目标值。
y_test_pred: 模型在测试集上的预测值
R2 越接近1，表示模型越好

3.rmse = np.sqrt(mse)

功能：计算均方误差的平方根。
RMSE 提供了与原始数据相同单位的误差度量，便于解释。与 MSE 类似，RMSE 越小，模型越好。

4.mae = np.mean(np.abs(y_test - y_test_pred))

功能：计算测试集上预测值与实际值之间的平均绝对误差。
MAE 是预测值与实际值之间绝对差异的平均值，MAE 越小，表示模型的预测值与实际值越接近,MAE 对异常值不如 MSE 敏感

5.adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

功能：计算调整后的 R2 值。
调整后的 R2 在自变量数量较多时更为可靠，因为它惩罚了不必要的复杂性

6. cv_scores = cross_val_score(model, x, y, cv=5, scoring='r2')

功能：使用5折交叉验证来评估模型的R平方得分。
model: 要评估的模型对象，例如 LinearRegression()。
x: 特征数据。

y: 目标变量。

cv=5: 使用5折交叉验证，将数据分成5个子集。

scoring='r2': 使用R平方作为评估指标。

输出：cv_scores 是一个数组，包含每次交叉验证的R平方得分。

其中交叉验证的评估指标主要有以下几种

1. R平方（R2）

范围：[−∞,1]

𝑅2=1:表示模型完美拟合数据。

𝑅2=0:表示模型没有解释任何方差，预测的效果与简单的均值预测相同。

𝑅2<0:表示模型比简单的均值预测效果还差。

2. 均方误差（Mean Squared Error, MSE）

范围：[0,∞)

MSE 越小，表示模型的预测值与实际值越接近。

MSE 为 0 表示模型预测完全准确。

3. 均方根误差（Root Mean Squared Error, RMSE）

范围：[0,∞)

RMSE 是 MSE 的平方根，提供了与原始数据相同单位的误差度量。

RMSE 越小，表示模型的预测值与实际值越接近。

4. 平均绝对误差（Mean Absolute Error, MAE）

范围：[0,∞)

MAE 越小，表示模型的预测值与实际值越接近。

MAE 为 0 表示模型预测完全准确。

5. 分类准确率（Accuracy）

范围：[0,1]

准确率为 1 表示模型预测完全准确。

准确率为 0 表示模型预测完全错误。

6. F1分数（F1 Score）

范围：[0,1]

F1 分数为 1 表示完美的精确率和召回率。

F1 分数为 0 表示模型没有正确预测任何正类样本。

对于回归问题，常用的指标包括 R2、MSE、RMSE 和 MAE；对于分类问题，常用的指标包括准确率和 F1 分数

5.7可视化

plt.scatter(x_train, y_train, color='blue', label='训练数据')
plt.scatter(x_test, y_test, color='green', label='测试数据')
plt.plot(x_test, y_test_pred, color='red', linewidth=2, label='预测数据')
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.title('简单的线性回归')
plt.legend()
plt.show()

5.8代码汇总

#1.导入必备的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#2.设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    #显示最大行数
pd.set_option('display.max_columns', None)  #显示最大列数
pd.set_option('display.max_colwidth', None)  #显示的最大列宽
pd.set_option('display.width', None)  #显示的最宽度
#3.导入数据
data=pd.read_excel("汽车制造行业收入表.xlsx")
data=pd.DataFrame(data)
x=pd.DataFrame(data["工龄"])
y=pd.DataFrame(data["薪水"])
#4.划分训练集与测试集
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
#5数据预处理
#6.1创建线性回归模型
model = LinearRegression()
#6.2模型训练
model.fit(x_train, y_train)
#6.3输出回归数值
print("线性回归系数a:",model.coef_)
print("线性回归截距b:",model.intercept_)
#6.4预测数据
y_test_pred= model.predict(x_test)
#6.5模型评估
mse = mean_squared_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_test - y_test_pred))
# 计算调整后的R平方
n = len(y_test)
p = x_train.shape[1]
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
cv_scores = cross_val_score(model, x, y, cv=5, scoring='r2')
# 输出结果
print("交叉验证评估:", cv_scores)#用于评估模型的泛化能力和稳定性
print("平均交叉验证:", np.mean(cv_scores))
print("均方误差:", mse)#它表示预测值与实际值之间误差的平方的平均值
print("决定系数:", r2)
print("均方根误差 (RMSE):", rmse)
print("平均绝对误差 (MAE):", mae)
print("调整后的R^2:", adjusted_r2)
# 数据可视化
plt.scatter(x_train, y_train, color='blue', label='训练数据')
plt.scatter(x_test, y_test, color='green', label='测试数据')
plt.plot(x_test, y_test_pred, color='red', linewidth=2, label='预测数据')
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.title('简单的线性回归')
plt.legend()
plt.show()

二.多元线性回归:客户价值数据表

#1.导入必备的库
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import StandardScaler
from scipy import stats
#2.显示设置
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    #显示最大行数
pd.set_option('display.max_columns', None)  #显示最大列数
pd.set_option('display.max_colwidth', None)  #显示的最大列宽
pd.set_option('display.width', None)  #显示的最宽度
#3.数据导入
data=pd.read_excel("客户价值数据表.xlsx")#4.数据预处理
#4.1使用均值填写缺失值
print("缺失值统计:\n",data.isnull().sum())
data = data.apply(lambda col: col.fillna(col.mean()), axis=0)#使用每一列的平均值填充
# print(data.head())
#4.2异常值处理
numeric_data = data.select_dtypes(include=[np.number])
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))  # 仅对数值型数据计算 Z-score
threshold = 3  # Z-score 阈值 3个标准差
outliers = (z_scores > threshold).any(axis=1)  # 检测异常值
print("检测到的异常值行索引:\n", data[outliers].index.tolist())  # 输出异常值的行索引
print(data[outliers])
data = data[~outliers]  # 移除异常值#4.3训练集与测试集的划分
x=data.drop(["客户价值"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
#4.4创建标准化训练集与测试集
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)  # 对训练集进行标准化
x_test = scaler.transform(x_test)#5建立模型
#5.1建立线性回归模型(多元)
linear=LinearRegression()
#5.2模型训练
linear.fit(x_train,y_train)  #使用标准化后的数据进行模型训练
#5.3输出各项系数
print("线性回归的系数a是:",linear.coef_)  #
print("线性回归的截距b是:",linear.intercept_)
#5.4数据预测
y_pred=linear.predict(x_test)
#5.5模型评估
print("回归得分:",linear.score(x_test,y_test).__round__(2))  #保留两位小数
print("mse线性回归评估:",mse(y_test, y_pred).__round__(2))
#5.6可视化
plt.bar(range(len(linear.coef_)), linear.coef_)
plt.xlabel("特征")
plt.ylabel("系数")
plt.title("特征重要性")plt.show()
plt.figure(figsize=(10, 6))
plt.boxplot(numeric_data.values, tick_labels=numeric_data.columns)
plt.title("箱线图检测异常值")
plt.xticks(rotation=45)
plt.show()

三.多项式回归

适用于一元和多元的非线性关系

#1.导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
#2.设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    #显示最大行数
pd.set_option('display.max_columns', None)  #显示最大列数
pd.set_option('display.max_colwidth', None)  #显示的最大列宽
pd.set_option('display.width', None)  #显示的最宽度#3.读取数据
data=pd.read_excel("客户价值数据表.xlsx")#4.数据预处理
#4.1使用均值填写缺失值
print("缺失值统计:\n",data.isnull().sum())
data = data.apply(lambda col: col.fillna(col.mean()), axis=0)#使用每一列的平均值填充
# print(data.head())
#4.2异常值处理
numeric_data = data.select_dtypes(include=[np.number])
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))  # 仅对数值型数据计算 Z-score
threshold = 3  # Z-score 阈值 3个标准差
outliers = (z_scores > threshold).any(axis=1)  # 检测异常值
print("检测到的异常值行索引:\n", data[outliers].index.tolist())  # 输出异常值的行索引
print(data[outliers])
data = data[~outliers]  # 移除异常值
x=data.drop(["客户价值","性别"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]
#4.3多项式特征转换
degree = 2
poly = PolynomialFeatures(degree=degree)
x_poly = poly.fit_transform(x)
#4.4划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.2, random_state=42)
#4.5标准化训练集与测试集
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)  # 对训练集进行标准化
x_test = scaler.transform(x_test)#5模型建立
#5.1建立多项式回归模型
model = LinearRegression()
#5.2训练模型
model.fit(x_train, y_train)
#5.3输出模型参数
print("模型系数（权重）:", model.coef_)
print("模型截距:", model.intercept_)
#5.4预测
y_pred = model.predict(x_test)
#5.5模型评估(计算均方误差（MSE）和 R² 得分)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("均方误差 (MSE):", mse)
print("R² 得分:", r2)

四.岭回归

1.使用L2正则化

正则化（Regularization）是一种用于防止机器学习模型过拟合的技术。

过拟合是指模型在训练集上表现很好，但在测试集上表现较差

# 1. 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
# 2. 设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    # 显示最大行数
pd.set_option('display.max_columns', None)  # 显示最大列数
pd.set_option('display.max_colwidth', None)  # 显示的最大列宽
pd.set_option('display.width', None)  # 显示的最宽度# 3. 读取数据
data=pd.read_excel("客户价值数据表.xlsx")# 4. 数据预处理
#4.1使用均值填写缺失值
print("缺失值统计:\n",data.isnull().sum())
data = data.apply(lambda col: col.fillna(col.mean()), axis=0)#使用每一列的平均值填充
# print(data.head())
#4.2异常值处理
numeric_data = data.select_dtypes(include=[np.number])
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))  # 仅对数值型数据计算 Z-score
threshold = 3  # Z-score 阈值 3个标准差
outliers = (z_scores > threshold).any(axis=1)  # 检测异常值
print("检测到的异常值行索引:\n", data[outliers].index.tolist())  # 输出异常值的行索引
print(data[outliers])
data = data[~outliers]  # 移除异常值
x=data.drop(["客户价值"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]
# 4.3 将数据分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# 4.4 标准化
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)  # 对训练集进行标准化
x_test = scaler.transform(X_test)# 5. 建立模型
# 5.1 定义参数网格
param_grid = {'alpha': np.logspace(-4, 4, 100)}  # 从 0.0001 到 10000 的 100 个值
# 5.2 使用 GridSearchCV 寻找最佳 alpha
ridge = Ridge()
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
# 5.3 输出最佳参数和对应的模型
best_alpha = grid_search.best_params_['alpha']
print("最佳 alpha:", best_alpha)
# 5.4 使用最佳 alpha 训练最终模型
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)
# 5.5 预测
y_pred = ridge_best.predict(X_test)
# 5.6 模型评估
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
# 6. 残差分析
residuals = y_test - y_pred
# 6.1 绘制残差图
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("预测值")
plt.ylabel("残差")
plt.title("残差分析")
plt.show()
# 7. 计算 AIC 和 BIC
n = len(y_test)  # 样本数量
k = X_train.shape[1]  # 自变量数量
# 计算 AIC 和 BIC
aic = n * np.log(mse) + 2 * (k + 1)  # +1 是因为有截距项
bic = n * np.log(mse) + np.log(n) * (k + 1)
print("AIC:", aic)
print("BIC:", bic)
# 8. 计算调整后的 R²
r2 = r2_score(y_test, y_pred)
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
print("调整后的 R²:", adjusted_r2)

五.套索回归

使用L1正则化

# 1. 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
# 2. 设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    # 显示最大行数
pd.set_option('display.max_columns', None)  # 显示最大列数
pd.set_option('display.max_colwidth', None)  # 显示的最大列宽
pd.set_option('display.width', None)  # 显示的最宽度# 3. 读取数据
data=pd.read_excel("客户价值数据表.xlsx")# 4. 数据预处理
#4.1使用均值填写缺失值
print("缺失值统计:\n",data.isnull().sum())
data = data.apply(lambda col: col.fillna(col.mean()), axis=0)#使用每一列的平均值填充
# print(data.head())
#4.2异常值处理
numeric_data = data.select_dtypes(include=[np.number])
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))  # 仅对数值型数据计算 Z-score
threshold = 3  # Z-score 阈值 3个标准差
outliers = (z_scores > threshold).any(axis=1)  # 检测异常值
print("检测到的异常值行索引:\n", data[outliers].index.tolist())  # 输出异常值的行索引
print(data[outliers])
data = data[~outliers]  # 移除异常值
x=data.drop(["客户价值"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]
# 4.3 将数据分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#4.4创建标准化训练集与测试集
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)  # 对训练集进行标准化
x_test = scaler.transform(X_test)# 5. 建立模型
# 5.1 使用 LassoCV 寻找最佳 alpha
lasso_cv = LassoCV(alphas=np.logspace(-4, 4, 100), cv=5)  # 100 个 alpha 值，5 折交叉验证
lasso_cv.fit(X_train, y_train)
# 5.2 输出最佳参数和对应的模型
best_alpha = lasso_cv.alpha_
print("最佳 alpha:", best_alpha)
# 5.3 使用最佳 alpha 训练最终模型
lasso_best = Lasso(alpha=best_alpha)
lasso_best.fit(X_train, y_train)
# 5.4 预测
y_pred = lasso_best.predict(X_test)
# 5.5 模型评估
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
# 5.6 输出模型系数
print("Coefficients:", lasso_best.coef_)
print("Intercept:", lasso_best.intercept_)
# 6. 计算 AIC 和 BIC
n = len(y_test)  # 样本数量
k = np.sum(lasso_best.coef_ != 0)  # 非零系数的数量
# 计算 AIC 和 BIC
aic = n * np.log(mse) + 2 * (k + 1)  # +1 是因为有截距项
bic = n * np.log(mse) + np.log(n) * (k + 1)
print("AIC:", aic)
print("BIC:", bic)
# 7. 计算调整后的 R²
r2 = r2_score(y_test, y_pred)
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
print("调整后的 R²:", adjusted_r2)
# 8. 残差分析
residuals = y_test - y_pred
# 8.1 绘制残差图
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("预测值")
plt.ylabel("残差")
plt.title("残差分析")
plt.show()

六.弹性网络回归

使用L1与L2正则化相结合

# 1. 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNetCV, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
# 2. 设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    # 显示最大行数
pd.set_option('display.max_columns', None)  # 显示最大列数
pd.set_option('display.max_colwidth', None)  # 显示的最大列宽
pd.set_option('display.width', None)  # 显示的最宽度# 3. 读取数据
data=pd.read_excel("客户价值数据表.xlsx")# 4. 数据预处理
#4.1使用均值填写缺失值
print("缺失值统计:\n",data.isnull().sum())
data = data.apply(lambda col: col.fillna(col.mean()), axis=0)#使用每一列的平均值填充
# print(data.head())
#4.2异常值处理
numeric_data = data.select_dtypes(include=[np.number])
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))  # 仅对数值型数据计算 Z-score
threshold = 3  # Z-score 阈值 3个标准差
outliers = (z_scores > threshold).any(axis=1)  # 检测异常值
print("检测到的异常值行索引:\n", data[outliers].index.tolist())  # 输出异常值的行索引
print(data[outliers])
data = data[~outliers]  # 移除异常值
x=data.drop(["客户价值"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]
# 4.3 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#4.4创建标准化训练集与测试集
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)  # 对训练集进行标准化
x_test = scaler.transform(X_test)# 5. 建立模型
# 5.1 使用 ElasticNetCV 寻找最佳 alpha 和 l1_ratio
alphas = np.logspace(-4, 4, 100)  # 100 个 alpha 值
l1_ratios = np.linspace(0.1, 1, 10)  # 10 个 l1_ratio 值，确保大于 0
model_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, random_state=42, max_iter=5000, tol=1e-5)
model_cv.fit(X_train, y_train)# 5.2 输出最佳参数和对应的模型
best_alpha = model_cv.alpha_
best_l1_ratio = model_cv.l1_ratio_
print("最佳 alpha:", best_alpha)
print("最佳 l1_ratio:", best_l1_ratio)# 5.3 使用最佳参数训练最终模型
model = ElasticNet(alpha=best_alpha, l1_ratio=best_l1_ratio, random_state=42)
model.fit(X_train, y_train)# 5.4 预测
y_pred = model.predict(X_test)# 5.5 评估
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("均方误差 (MSE):", mse)
print("R² 得分:", r2)# 6. 可视化
# 绘制系数
plt.figure(figsize=(10, 6))
plt.bar(range(len(model.coef_)), model.coef_)
plt.xlabel("特征")
plt.ylabel("系数")
plt.title("弹性网络回归系数")
plt.show()
# 8. 残差分析
residuals = y_test - y_pred# 8.1 绘制残差图
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("预测值")
plt.ylabel("残差")
plt.title("残差分析")
plt.show()
# 8.2 绘制残差的直方图
plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, edgecolor='k')
plt.xlabel("残差")
plt.ylabel("频率")
plt.title("残差的直方图")
plt.show()
# 8.3 绘制 Q-Q 图
import scipy.stats as stats
plt.figure(figsize=(10, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q 图")
plt.show()

特征工程1

#1.导入必备的库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import mutual_info_regression
#2.显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False"""
1.皮尔逊相关系数：用于衡量两个连续变量之间的线性关系，值范围在 -1 到 1 之间。
值接近 1 表示强正相关，接近 -1 表示强负相关，接近 0 表示无相关性。   
2.斯皮尔曼等级相关系数：用于衡量两个变量之间的单调关系，适用于非正态分布的数据。
3.肯德尔相关系数：另一种用于衡量两个变量之间的相关性的方法，适用于小样本数据。
"""
df=pd.read_excel("客户价值数据表.xlsx")
pearson = df.corr(method='pearson')  # 计算皮尔逊相关系数
spearman =df.corr(method='spearman') # 计算斯皮尔曼等级相关系数
kendall = df.corr(method='kendall')  # 计算肯德尔相关系数
correlation_matrices = [pearson, spearman, kendall]
names = ["pearson", "spearman", "kendall"]
# 遍历列表并绘制热力图
for matrix, name in zip(correlation_matrices, names):plt.figure(figsize=(10, 8))sns.heatmap(matrix, annot=True, fmt=".2f", cmap='coolwarm')plt.title(f"{name}相关性矩阵")plt.show()#2.VIF 用于检测多重共线性，计算每个特征与其他特征的相关性。VIF 值越高，表示该特征与其他特征的相关性越强，通常 VIF > 10 被认为存在严重的多重共线性
# 计算 VIF
X = df.drop('客户价值', axis=1)  # 特征
vif = pd.DataFrame()
vif['特征'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)# 互信息用于衡量两个变量之间的信息共享程度，适用于分类和连续变量。值越高，表示两个变量之间的相关性越强。
y=df["客户价值"]
mi = mutual_info_regression(X, y)
mi_scores = pd.Series(mi, index=X.columns)
print(mi_scores.sort_values(ascending=False))

特征选择方法:

一.逐步回归

#1.导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler#2.设置显示选项
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_rows', None)    #显示最大行数
pd.set_option('display.max_columns', None)  #显示最大列数
pd.set_option('display.max_colwidth', None)  #显示的最大列宽
pd.set_option('display.width', None)  #显示的最宽度#3.导入数据
data=pd.read_excel("客户价值数据表.xlsx")
x=data.drop(["客户价值"],axis=1)  #去掉"客户价值这一列"
y=data["客户价值"]#4.数据预处理
#4.1标准化
scaler = StandardScaler()
x=scaler.fit_transform(x)
x=pd.DataFrame(x,columns=["历史贷款金额","贷款次数","学历","月收入","性别"])
#4.2划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)#5.建立模型
def stepwise_selection(X, y, initial_list=[], threshold_in=0.01, threshold_out=0.05, verbose=True):"""逐步回归特征选择:param X: 特征数据（DataFrame）:param y: 目标变量（Series）:param initial_list: 初始特征列表:param threshold_in: 添加特征的显著性阈值:param threshold_out: 删除特征的显著性阈值:param verbose: 是否打印过程:return: 最终选择的特征列表"""included = list(initial_list)while True:changed = False# 前向选择excluded = list(set(X.columns) - set(included))new_pval = pd.Series(index=excluded, dtype=float)for new_column in excluded:model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()new_pval[new_column] = model.pvalues[new_column]best_pval = new_pval.min()if best_pval < threshold_in:best_feature = new_pval.idxmin()included.append(best_feature)changed = Trueif verbose:print(f"Add {best_feature} with p-value {best_pval:.6f}")# 后向消除model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()pvalues = model.pvalues.iloc[1:]  # 忽略截距worst_pval = pvalues.max()if worst_pval > threshold_out:changed = Trueworst_feature = pvalues.idxmax()included.remove(worst_feature)if verbose:print(f"Remove {worst_feature} with p-value {worst_pval:.6f}")if not changed:breakreturn included# 运行逐步回归
selected_features = stepwise_selection(X_train, y_train)
# 输出最终选择的特征
print("最终选择的特征:", selected_features)

二.主成分分析