信用卡欺诈检测

信用卡欺诈检测

信用卡欺诈检测是kaggle上一个项目,数据来源是2013年欧洲持有信用卡的交易数据,详细内容见https://www.kaggle.com/mlg-ulb/creditcardfraud
这个项目所要实现的目标是对一个交易预测它是否存在信用卡欺诈,和大部分机器学习项目的区别在于正负样本的不均衡,而且是极不均衡的,所以这是特征工程需要处理的第一个问题。
除此之外,在数据预处理上减轻的负担是缺失值的处理,并且大多数特征是经过了均值化处理的。

项目背景与数据初探

# 导入基础的库,其他的模型库在需要时再导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# 设置显示的宽度,避免有过长行不显示的问题
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.width', 10000)
# 导入数据并查看基本数据情况
data = pd.read_csv('D:/数据分析/kaggle/信用卡欺诈/creditcard.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):#   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  0   Time    284807 non-null  float641   V1      284807 non-null  float642   V2      284807 non-null  float643   V3      284807 non-null  float644   V4      284807 non-null  float645   V5      284807 non-null  float646   V6      284807 non-null  float647   V7      284807 non-null  float648   V8      284807 non-null  float649   V9      284807 non-null  float6410  V10     284807 non-null  float6411  V11     284807 non-null  float6412  V12     284807 non-null  float6413  V13     284807 non-null  float6414  V14     284807 non-null  float6415  V15     284807 non-null  float6416  V16     284807 non-null  float6417  V17     284807 non-null  float6418  V18     284807 non-null  float6419  V19     284807 non-null  float6420  V20     284807 non-null  float6421  V21     284807 non-null  float6422  V22     284807 non-null  float6423  V23     284807 non-null  float6424  V24     284807 non-null  float6425  V25     284807 non-null  float6426  V26     284807 non-null  float6427  V27     284807 non-null  float6428  V28     284807 non-null  float6429  Amount  284807 non-null  float6430  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
data.shape
(284807, 31)
data.describe()
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
count284807.0000002.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+05284807.000000284807.000000
mean94813.8595753.919560e-155.688174e-16-8.769071e-152.782312e-15-1.552563e-152.010663e-15-1.694249e-15-1.927028e-16-3.137024e-151.768627e-159.170318e-16-1.810658e-151.693438e-151.479045e-153.482336e-151.392007e-15-7.528491e-164.328772e-169.049732e-165.085503e-161.537294e-167.959909e-165.367590e-164.458112e-151.453003e-151.699104e-15-3.660161e-16-1.206049e-1688.3496190.001727
std47488.1459551.958696e+001.651309e+001.516255e+001.415869e+001.380247e+001.332271e+001.237094e+001.194353e+001.098632e+001.088850e+001.020713e+009.992014e-019.952742e-019.585956e-019.153160e-018.762529e-018.493371e-018.381762e-018.140405e-017.709250e-017.345240e-017.257016e-016.244603e-016.056471e-015.212781e-014.822270e-014.036325e-013.300833e-01250.1201090.041527
min0.000000-5.640751e+01-7.271573e+01-4.832559e+01-5.683171e+00-1.137433e+02-2.616051e+01-4.355724e+01-7.321672e+01-1.343407e+01-2.458826e+01-4.797473e+00-1.868371e+01-5.791881e+00-1.921433e+01-4.498945e+00-1.412985e+01-2.516280e+01-9.498746e+00-7.213527e+00-5.449772e+01-3.483038e+01-1.093314e+01-4.480774e+01-2.836627e+00-1.029540e+01-2.604551e+00-2.256568e+01-1.543008e+010.0000000.000000
25%54201.500000-9.203734e-01-5.985499e-01-8.903648e-01-8.486401e-01-6.915971e-01-7.682956e-01-5.540759e-01-2.086297e-01-6.430976e-01-5.354257e-01-7.624942e-01-4.055715e-01-6.485393e-01-4.255740e-01-5.828843e-01-4.680368e-01-4.837483e-01-4.988498e-01-4.562989e-01-2.117214e-01-2.283949e-01-5.423504e-01-1.618463e-01-3.545861e-01-3.171451e-01-3.269839e-01-7.083953e-02-5.295979e-025.6000000.000000
50%84692.0000001.810880e-026.548556e-021.798463e-01-1.984653e-02-5.433583e-02-2.741871e-014.010308e-022.235804e-02-5.142873e-02-9.291738e-02-3.275735e-021.400326e-01-1.356806e-025.060132e-024.807155e-026.641332e-02-6.567575e-02-3.636312e-033.734823e-03-6.248109e-02-2.945017e-026.781943e-03-1.119293e-024.097606e-021.659350e-02-5.213911e-021.342146e-031.124383e-0222.0000000.000000
75%139320.5000001.315642e+008.037239e-011.027196e+007.433413e-016.119264e-013.985649e-015.704361e-013.273459e-015.971390e-014.539234e-017.395934e-016.182380e-016.625050e-014.931498e-016.488208e-015.232963e-013.996750e-015.008067e-014.589494e-011.330408e-011.863772e-015.285536e-011.476421e-014.395266e-013.507156e-012.409522e-019.104512e-027.827995e-0277.1650000.000000
max172792.0000002.454930e+002.205773e+019.382558e+001.687534e+013.480167e+017.330163e+011.205895e+022.000721e+011.559499e+012.374514e+011.201891e+017.848392e+007.126883e+001.052677e+018.877742e+001.731511e+019.253526e+005.041069e+005.591971e+003.942090e+012.720284e+011.050309e+012.252841e+014.584549e+007.519589e+003.517346e+003.161220e+013.384781e+0125691.1600001.000000
data.head().append(data.tail())
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990
284802172786.0-11.88111810.071785-9.834783-2.066656-5.364473-2.606837-4.9182157.3053341.9144284.356170-1.5931052.711941-0.6892564.626942-0.9244591.1076411.9916910.510632-0.6829201.4758290.2134540.1118641.014480-0.5093481.4368070.2500340.9436510.8237310.770
284803172787.0-0.732789-0.0550802.035030-0.7385890.8682291.0584150.0243300.2948690.584800-0.975926-0.1501890.9158021.214756-0.6751431.164931-0.711757-0.025693-1.221179-1.5455560.0596160.2142050.9243840.012463-1.016226-0.606624-0.3952550.068472-0.05352724.790
284804172788.01.919565-0.301254-3.249640-0.5578282.6305153.031260-0.2968270.7084170.432454-0.4847820.4116140.063119-0.183699-0.5106021.3292840.1407160.3135020.395652-0.5772520.0013960.2320450.578229-0.0375010.6401340.265745-0.0873710.004455-0.02656167.880
284805172788.0-0.2404400.5304830.7025100.689799-0.3779610.623708-0.6861800.6791450.392087-0.399126-1.933849-0.962886-1.0420820.4496241.962563-0.6085770.5099281.1139812.8978490.1274340.2652450.800049-0.1632980.123205-0.5691590.5466680.1088210.10453310.000
284806172792.0-0.533413-0.1897330.703337-0.506271-0.012546-0.6496171.577006-0.4146500.486180-0.915427-1.040458-0.031513-0.188093-0.0843160.041333-0.302620-0.6603770.167430-0.2561170.3829480.2610570.6430780.3767770.008797-0.473649-0.818267-0.0024150.013649217.000
data.Class.value_counts()
0    284315
1       492
Name: Class, dtype: int64
  • 数据包含284807样本,30个属性特征和一个所属类别,数据完整没有缺失值,所以不需要缺失值的处理。非匿名特征包括时间和交易金额,以及所属的类别,匿名特征28个从v1-v28,统计上的均值(基本上等于0)和方差(1左右)可以看出是已经进行了归一化处理。(在Kaggle上的介绍说:匿名特征是经过了脱敏和PCA处理的,时间特征Time包含数据集中每个交易和第一个交易之间经过的秒数,应该是距离开始采集数据的时间,总共是两天,172792正好是差不多48小时)
  • 正负样本不均衡的问题,从描述性统计结果上的class列,均值为0.001727,说明极大多数样本都是0,也可以通过data.value_counts()查看,正常交易284315项,而异常的只有492项,这种极不平衡的样本处理将是特征工程的主要任务

探索性数据分析(EDA)

* 单一属性分析

数据属性都是数值型,所以不需要区分数值属性和类别属性,也不需要对类别属性的重新编码,下面分析单一属性的特点,从预测类别开始

通过类别的分布图以及峰度、偏度的计算,可以更直观的看到样本分布的不均衡

# 欺诈与非欺诈类别分布的直方图
sns.countplot('Class', data=data, color='blue')
plt.xlabel('values')
plt.ylabel('Counts')
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)')
Text(0.5, 1.0, 'Class Distributions \n (0: No Fraud || 1: Fraud)')

在这里插入图片描述

print('Kurtosis:', data.Class.kurt())
print('Skewness:', data.Class.skew())
Kurtosis: 573.887842782971
Skewness: 23.99757931064749

下面分析两个没有经过标准化的属性:Time和Amount

# 数值属性时间与金额分布图,金额用正态分布和对数分布两种方式拟合
import scipy.stats as st
fig, ax = plt.subplots(1, 3, figsize=(18, 4))
print(ax)
sns.distplot(data.Amount, color='blue', ax=ax[0],kde = False,fit=st.norm)
ax[0].set_title('Distribution of transaction amount_normal')sns.distplot(data.Amount,color='blue',ax=ax[1],fit=st.lognorm)
ax[1].set_title('Distribution of transaction amount_lognorm')sns.distplot(data.Time, color='r', ax=ax[2])
ax[2].set_title('Distribution of transaction time')

在这里插入图片描述

print(data.Amount.value_counts())
1.00       13688
1.98        6044
0.89        4872
9.99        4747
15.00       3280...  
192.63         1
218.84         1
195.52         1
793.50         1
1080.06        1
Name: Amount, Length: 32767, dtype: int64
print('the ratio of Amount<5:', data.Amount[data.Amount < 5].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount<10:', data.Amount[data.Amount < 10].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount<20:', data.Amount[data.Amount < 20].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount<30:', data.Amount[data.Amount < 30].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount<50:', data.Amount[data.Amount < 50].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount<100:', data.Amount[data.Amount < 100].value_counts(
).sum()/data.Amount.value_counts().sum())
print('the ratio of Amount>5000:', data.Amount[data.Amount > 5000].value_counts(
).sum()/data.Amount.value_counts().sum())
the ratio of Amount<5: 0.2368726892246328
the ratio of Amount<10: 0.3416840175978821
the ratio of Amount<20: 0.481476929991187
the ratio of Amount<30: 0.562022703093674
the ratio of Amount<50: 0.6660791342909409
the ratio of Amount<100: 0.7985126770058321
the ratio of Amount>5000: 0.00019311323106524768

在金额属性上,绝大多数金额都是小于50美元较小的,存在少量的较大数值如1080美元;时间上在15000s和100000s附近出现了两次低峰,距离开始采样数据的时间分别是4小时和27小时,猜测这时候是凌晨3、4点,这也符合现实,毕竟深夜购物的人还是少数。由于其他的匿名属性都进行了均值化处理,并且金额的拟合效果正态分布优于取对数,下面也将金额Amount进行标准化处理,同样的,时间转化为小时之后再作均值化处理。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data['Amount'] = sc.fit_transform(data.Amount.values.reshape(-1, 1))
# reshape()函数的两个参数,-1表示不知道多少行,1表示一列
data['Hour'] = data.Time.apply(lambda x: divmod(x, 3600)[0])
data['Hour'] = data.Hour.apply(lambda x: divmod(x, 24)[1])
# 时间进一步转换成24小时制,因为考虑到交易密度的周期性分布
data['Hour'] = sc.fit_transform(data['Hour'].values.reshape(-1, 1))
data.drop(columns='Time', inplace=True)
data.head().append(data.tail())
V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour
0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.0210530.2449640-2.40693
11.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.014724-0.3424750-2.40693
2-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.0597521.1606860-2.40693
3-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.0614580.1405340-2.40693
4-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.215153-0.0734030-2.40693
284802-11.88111810.071785-9.834783-2.066656-5.364473-2.606837-4.9182157.3053341.9144284.356170-1.5931052.711941-0.6892564.626942-0.9244591.1076411.9916910.510632-0.6829201.4758290.2134540.1118641.014480-0.5093481.4368070.2500340.9436510.823731-0.35015101.53423
284803-0.732789-0.0550802.035030-0.7385890.8682291.0584150.0243300.2948690.584800-0.975926-0.1501890.9158021.214756-0.6751431.164931-0.711757-0.025693-1.221179-1.5455560.0596160.2142050.9243840.012463-1.016226-0.606624-0.3952550.068472-0.053527-0.25411701.53423
2848041.919565-0.301254-3.249640-0.5578282.6305153.031260-0.2968270.7084170.432454-0.4847820.4116140.063119-0.183699-0.5106021.3292840.1407160.3135020.395652-0.5772520.0013960.2320450.578229-0.0375010.6401340.265745-0.0873710.004455-0.026561-0.08183901.53423
284805-0.2404400.5304830.7025100.689799-0.3779610.623708-0.6861800.6791450.392087-0.399126-1.933849-0.962886-1.0420820.4496241.962563-0.6085770.5099281.1139812.8978490.1274340.2652450.800049-0.1632980.123205-0.5691590.5466680.1088210.104533-0.31324901.53423
284806-0.533413-0.1897330.703337-0.506271-0.012546-0.6496171.577006-0.4146500.486180-0.915427-1.040458-0.031513-0.188093-0.0843160.041333-0.302620-0.6603770.167430-0.2561170.3829480.2610570.6430780.3767770.008797-0.473649-0.818267-0.0024150.0136490.51435501.53423
# 将样本顺序打乱,因为Hour属性是有顺序的,为后期样本训练集测试集的划分做准备
data = data.sample(frac=1)
data.head().append(data.tail())
V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour
1072161.324299-1.0855630.237279-1.479829-1.161581-0.372790-0.742945-0.049038-2.3765631.5162491.6315160.1370250.6279110.0493870.025037-0.5329860.575155-0.556925-0.065486-0.214061-0.526897-1.3424340.251389-0.065977-0.029430-0.6363250.0206830.022199-0.05513200.848811
168047-0.2168300.175845-0.1269170.1161603.3483834.2926720.2967030.6221670.3899990.133185-0.6514320.098311-0.612388-0.477757-1.049531-1.5588860.199438-0.7729571.2040000.094183-0.318508-0.357454-0.1127930.685842-0.469426-0.769896-0.187049-0.232898-0.3452330-0.864737
32891.2502490.019063-1.326108-0.0390592.2323413.300602-0.3264350.757703-0.1563520.062703-0.1616490.029727-0.0184570.5309511.1075120.377201-0.9263630.295044-0.0067830.033750-0.009900-0.189322-0.1577341.0053260.838403-0.3155820.0114390.018031-0.2332870-2.406930
279921-4.8147084.736862-4.817819-1.103622-2.256585-2.425710-1.6570503.493293-0.2078190.013773-2.2603130.744399-0.9738003.250789-0.2825990.3134081.121537-0.022531-0.416665-0.1179940.3753610.5546910.3497640.0261270.2755420.1481780.1023200.185414-0.31996501.362876
1782972.100716-0.778343-0.596761-0.557506-0.5752070.293849-0.9582460.146132-0.1369210.7989090.4286480.9584040.715275-0.043584-0.140031-1.084006-0.6497641.591873-0.423356-0.580698-0.562834-1.0444070.4598200.203965-0.508869-0.6864750.053956-0.034934-0.3531890-0.693382
2249442.0533110.089735-1.6818360.4542120.298310-0.9535260.152003-0.2070710.587335-0.362047-0.589598-0.174712-0.621127-0.7035130.2719570.3186880.549365-0.2577860.016256-0.187421-0.361158-0.9842620.3541980.620709-0.2971380.166736-0.068299-0.029585-0.31728700.334747
1447941.9633770.175655-1.7915051.1803710.493289-1.1572600.678691-0.322212-0.2731200.5533500.9006650.380835-1.1977761.219675-0.368170-0.251083-0.5450040.080937-0.199778-0.2961840.1884540.525823-0.019335-0.0077000.374702-0.503352-0.043733-0.070953-0.2338870-2.406930
565611.1802640.668819-0.2423821.2847480.029324-1.0395430.202761-0.102430-0.340626-0.5083671.9780500.556068-0.337921-1.0959690.3914290.7408690.7266371.040940-0.411526-0.103660-0.0044790.014954-0.1089920.3978770.628531-0.3569310.0306630.049046-0.3492310-0.179318
37115-2.0910271.2490320.841086-0.777488-0.176500-0.077257-0.118603-0.2567510.178740-0.0003050.9918560.698911-0.9019700.341906-0.643972-0.011763-0.069715-0.449297-0.255400-0.5172880.631502-0.4132650.293367-0.000012-0.3186880.224045-0.725597-0.392266-0.3492310-0.693382
589511.1624470.2724580.6151651.058086-0.262004-0.359390-0.012728-0.0156200.066470-0.087054-0.0614590.4703500.2153490.2932571.308914-0.006260-0.233542-0.816695-0.644417-0.141656-0.243925-0.6937250.1683490.0312700.216868-0.6651920.0458230.031301-0.3052520-0.179318
# 下面查看匿名属性的分布
# 匿名属性的峰度与偏度
numerical_columns = data.columns.drop(['Class','Hour','Amount'])
for num_col in numerical_columns:print('{:10}'.format(num_col), 'Skewness:', '{:8.2f}'.format(data[num_col].skew()), '         Kurtosis:', '{:8.2f}'.format(data[num_col].kurt()))
V1         Skewness:    -3.28          Kurtosis:    32.49
V2         Skewness:    -4.62          Kurtosis:    95.77
V3         Skewness:    -2.24          Kurtosis:    26.62
V4         Skewness:     0.68          Kurtosis:     2.64
V5         Skewness:    -2.43          Kurtosis:   206.90
V6         Skewness:     1.83          Kurtosis:    42.64
V7         Skewness:     2.55          Kurtosis:   405.61
V8         Skewness:    -8.52          Kurtosis:   220.59
V9         Skewness:     0.55          Kurtosis:     3.73
V10        Skewness:     1.19          Kurtosis:    31.99
V11        Skewness:     0.36          Kurtosis:     1.63
V12        Skewness:    -2.28          Kurtosis:    20.24
V13        Skewness:     0.07          Kurtosis:     0.20
V14        Skewness:    -2.00          Kurtosis:    23.88
V15        Skewness:    -0.31          Kurtosis:     0.28
V16        Skewness:    -1.10          Kurtosis:    10.42
V17        Skewness:    -3.84          Kurtosis:    94.80
V18        Skewness:    -0.26          Kurtosis:     2.58
V19        Skewness:     0.11          Kurtosis:     1.72
V20        Skewness:    -2.04          Kurtosis:   271.02
V21        Skewness:     3.59          Kurtosis:   207.29
V22        Skewness:    -0.21          Kurtosis:     2.83
V23        Skewness:    -5.88          Kurtosis:   440.09
V24        Skewness:    -0.55          Kurtosis:     0.62
V25        Skewness:    -0.42          Kurtosis:     4.29
V26        Skewness:     0.58          Kurtosis:     0.92
V27        Skewness:    -1.17          Kurtosis:   244.99
V28        Skewness:    11.19          Kurtosis:   933.40
f = pd.melt(data, value_vars=numerical_columns)
g = sns.FacetGrid(f, col='variable', col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')

在这里插入图片描述

因为数据经过了标准化处理,所以匿名特征的偏度相对较小,峰度较大,分布上v_5, v_7, v_8, v_20, v_21, v_23, v_27,v_28的峰度值较高,说明这些特征的分布十分集中,其他的特征分布相对均匀。

* 分析两个属性之间的关系
# 相关性分析
plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), square=True,cmap='coolwarm_r', annot_kws={'size': 20})
plt.show()
data.corr()

在这里插入图片描述

V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour
V11.000000e+001.717515e-16-9.049057e-16-2.483769e-163.029761e-161.242968e-164.904952e-17-2.809306e-175.255584e-175.521194e-172.590354e-161.898607e-16-3.778206e-174.132544e-16-1.522953e-163.007488e-16-3.106689e-171.645281e-161.045442e-169.852066e-17-1.808291e-167.990046e-171.056762e-16-5.113541e-17-2.165870e-16-1.408070e-161.664818e-162.419798e-16-0.227709-0.101347-0.005214
V21.717515e-161.000000e+009.206734e-17-1.226183e-161.487342e-163.492970e-16-4.501183e-17-5.839820e-17-1.832049e-16-2.598347e-163.312314e-16-3.228806e-16-1.091369e-16-4.657267e-165.392819e-17-3.749559e-18-5.591430e-162.877862e-16-1.672445e-173.898167e-174.667231e-171.203109e-163.242403e-16-1.254911e-168.784216e-172.450901e-16-5.467509e-16-6.910935e-17-0.5314090.0912890.007802
V3-9.049057e-169.206734e-171.000000e+00-2.981464e-16-6.943096e-161.308147e-152.120327e-16-8.586741e-179.780262e-172.764966e-161.500352e-162.182812e-16-4.679364e-176.942636e-16-5.333200e-175.460118e-162.134100e-162.870116e-163.728807e-161.267443e-161.189711e-16-2.343257e-16-8.182206e-17-3.300147e-171.123060e-16-2.136494e-164.752587e-166.073110e-16-0.210880-0.192961-0.021569
V4-2.483769e-16-1.226183e-16-2.981464e-161.000000e+00-1.903391e-15-4.169652e-16-6.535390e-175.942856e-166.175719e-16-6.910284e-17-2.936726e-16-1.448546e-163.050372e-17-8.547776e-172.459280e-16-8.218577e-17-4.443050e-165.369916e-18-2.842719e-16-2.222520e-161.390687e-172.189964e-161.663593e-161.403733e-166.312530e-16-4.009636e-16-6.309346e-17-2.064064e-160.0987320.133447-0.035063
V53.029761e-161.487342e-16-6.943096e-16-1.903391e-151.000000e+001.159613e-157.659742e-177.328495e-164.435269e-161.632311e-166.784587e-164.520778e-16-2.979964e-162.516209e-161.016075e-166.264287e-164.535815e-164.196874e-16-1.277261e-16-2.414281e-169.325965e-17-6.982655e-17-1.848644e-16-9.892370e-16-1.561416e-163.403172e-163.299056e-16-3.491468e-16-0.386356-0.094974-0.035134
V61.242968e-163.492970e-161.308147e-15-4.169652e-161.159613e-151.000000e+00-2.949670e-16-3.474079e-16-1.008735e-161.322142e-168.380230e-162.570184e-16-1.251524e-163.531769e-16-6.825844e-17-1.823748e-161.161080e-166.313161e-176.136340e-17-1.318056e-16-4.925144e-17-9.729827e-17-3.176032e-17-1.125379e-155.563670e-16-2.627057e-16-4.040640e-164.612882e-170.215981-0.043643-0.018945
V74.904952e-17-4.501183e-172.120327e-16-6.535390e-177.659742e-17-2.949670e-161.000000e+003.038794e-17-5.250969e-173.186953e-16-3.362622e-167.265464e-16-1.485108e-167.720708e-17-1.845909e-164.901078e-167.173458e-161.638629e-16-1.132423e-161.889527e-16-7.597231e-17-6.887963e-161.393022e-162.078026e-17-1.507689e-17-7.709408e-16-2.647380e-182.115388e-170.397311-0.187257-0.009729
V8-2.809306e-17-5.839820e-17-8.586741e-175.942856e-167.328495e-16-3.474079e-163.038794e-171.000000e+004.683000e-16-3.022868e-161.499830e-163.887009e-17-3.213252e-16-2.288651e-161.109628e-162.500367e-16-3.808536e-16-3.119192e-16-3.559019e-161.098800e-17-2.338214e-16-6.701600e-182.701514e-16-2.444390e-16-1.792313e-161.092765e-173.921512e-16-5.158971e-16-0.1030790.0198750.032106
V95.255584e-17-1.832049e-169.780262e-176.175719e-164.435269e-16-1.008735e-16-5.250969e-174.683000e-161.000000e+00-4.733827e-163.289603e-16-1.339732e-159.374886e-169.287436e-16-8.883532e-16-5.409347e-167.071023e-161.471108e-161.293082e-16-3.112119e-162.755460e-16-2.171404e-16-1.011218e-16-2.940457e-162.137255e-16-1.039639e-16-1.499396e-167.982292e-16-0.044246-0.097733-0.189830
V105.521194e-17-2.598347e-162.764966e-16-6.910284e-171.632311e-161.322142e-163.186953e-16-3.022868e-16-4.733827e-161.000000e+00-3.633385e-168.563304e-16-4.013607e-166.638602e-163.932439e-161.882434e-166.617837e-164.829483e-164.623218e-17-1.340974e-151.048675e-15-2.890990e-161.907376e-16-7.312196e-17-3.457860e-16-4.117783e-16-3.115507e-163.949646e-16-0.101502-0.2168830.024177
V112.590354e-163.312314e-161.500352e-16-2.936726e-166.784587e-168.380230e-16-3.362622e-161.499830e-163.289603e-16-3.633385e-161.000000e+00-7.116039e-164.369928e-16-1.283496e-161.903820e-161.158881e-166.624541e-169.910529e-17-1.093636e-15-1.478641e-166.632474e-181.312323e-171.404725e-161.672342e-15-6.082687e-16-1.240097e-16-1.519253e-16-2.909057e-160.0001040.154876-0.135131
V121.898607e-16-3.228806e-162.182812e-16-1.448546e-164.520778e-162.570184e-167.265464e-163.887009e-17-1.339732e-158.563304e-16-7.116039e-161.000000e+00-2.297323e-144.486162e-16-3.033543e-164.714076e-16-3.797286e-16-6.830564e-161.782434e-162.673446e-165.724276e-16-3.587155e-173.029886e-164.510178e-166.970336e-181.653468e-16-2.721798e-167.065902e-16-0.009542-0.2605930.352459
V13-3.778206e-17-1.091369e-16-4.679364e-173.050372e-17-2.979964e-16-1.251524e-16-1.485108e-16-3.213252e-169.374886e-16-4.013607e-164.369928e-16-2.297323e-141.000000e+001.415589e-15-1.185819e-164.849394e-168.705885e-172.432753e-16-6.331767e-17-3.200986e-171.428638e-16-4.602453e-17-7.174408e-16-6.376621e-16-1.142909e-16-1.478991e-16-5.300185e-161.043260e-150.005293-0.004570-0.187981
V144.132544e-16-4.657267e-166.942636e-16-8.547776e-172.516209e-163.531769e-167.720708e-17-2.288651e-169.287436e-166.638602e-16-1.283496e-164.486162e-161.415589e-151.000000e+00-2.864454e-16-8.191302e-161.131442e-15-3.009169e-162.138702e-16-5.239826e-17-2.462983e-166.492362e-162.160339e-16-1.258007e-17-7.178656e-17-2.488490e-17-1.739150e-172.414117e-150.033751-0.302544-0.162918
V15-1.522953e-165.392819e-17-5.333200e-172.459280e-161.016075e-16-6.825844e-17-1.845909e-161.109628e-16-8.883532e-163.932439e-161.903820e-16-3.033543e-16-1.185819e-16-2.864454e-161.000000e+009.678376e-16-5.606434e-166.692616e-16-1.423455e-152.118638e-166.349939e-17-3.516820e-161.024768e-16-4.337014e-162.281677e-161.108681e-16-1.246909e-15-9.799748e-16-0.002986-0.0042230.112251
V163.007488e-16-3.749559e-185.460118e-16-8.218577e-176.264287e-16-1.823748e-164.901078e-162.500367e-16-5.409347e-161.882434e-161.158881e-164.714076e-164.849394e-16-8.191302e-169.678376e-161.000000e+001.641102e-15-2.666175e-151.138371e-154.407936e-16-4.180114e-162.653008e-167.410993e-16-3.508969e-16-3.341605e-16-4.690618e-168.147869e-167.042089e-16-0.003910-0.1965390.005517
V17-3.106689e-17-5.591430e-162.134100e-16-4.443050e-164.535815e-161.161080e-167.173458e-16-3.808536e-167.071023e-166.617837e-166.624541e-16-3.797286e-168.705885e-171.131442e-15-5.606434e-161.641102e-151.000000e+00-5.251666e-153.694474e-16-8.921672e-16-1.086035e-15-3.486998e-164.072307e-16-1.897694e-167.587211e-172.084478e-166.669179e-16-5.419071e-170.007309-0.326481-0.064803
V181.645281e-162.877862e-162.870116e-165.369916e-184.196874e-166.313161e-171.638629e-16-3.119192e-161.471108e-164.829483e-169.910529e-17-6.830564e-162.432753e-16-3.009169e-166.692616e-16-2.666175e-15-5.251666e-151.000000e+00-2.719935e-15-4.098224e-16-1.240266e-15-5.279657e-16-2.362311e-16-1.869482e-16-2.451121e-163.089442e-162.209663e-168.158517e-160.035650-0.111485-0.003518
V191.045442e-16-1.672445e-173.728807e-16-2.842719e-16-1.277261e-166.136340e-17-1.132423e-16-3.559019e-161.293082e-164.623218e-17-1.093636e-151.782434e-16-6.331767e-172.138702e-16-1.423455e-151.138371e-153.694474e-16-2.719935e-151.000000e+002.693620e-166.052450e-16-1.036140e-155.861740e-16-9.630049e-178.161694e-165.479257e-16-1.243578e-16-1.291833e-15-0.0561510.0347830.021566
V209.852066e-173.898167e-171.267443e-16-2.222520e-16-2.414281e-16-1.318056e-161.889527e-161.098800e-17-3.112119e-16-1.340974e-15-1.478641e-162.673446e-16-3.200986e-17-5.239826e-172.118638e-164.407936e-16-8.921672e-16-4.098224e-162.693620e-161.000000e+00-1.118296e-151.101689e-151.107203e-161.749671e-16-6.786605e-18-3.590893e-16-8.488785e-16-4.584320e-160.3394030.0200900.000978
V21-1.808291e-164.667231e-171.189711e-161.390687e-179.325965e-17-4.925144e-17-7.597231e-17-2.338214e-162.755460e-161.048675e-156.632474e-185.724276e-161.428638e-16-2.462983e-166.349939e-17-4.180114e-16-1.086035e-15-1.240266e-156.052450e-16-1.118296e-151.000000e+003.540128e-154.521934e-161.014531e-16-1.173906e-16-4.337929e-16-1.484206e-151.584856e-160.1059990.040413-0.011915
V227.990046e-171.203109e-16-2.343257e-162.189964e-16-6.982655e-17-9.729827e-17-6.887963e-16-6.701600e-18-2.171404e-16-2.890990e-161.312323e-17-3.587155e-17-4.602453e-176.492362e-16-3.516820e-162.653008e-16-3.486998e-16-5.279657e-16-1.036140e-151.101689e-153.540128e-151.000000e+003.086083e-166.736130e-17-9.827185e-16-2.194486e-171.478149e-16-5.686304e-16-0.0648010.000805-0.016610
V231.056762e-163.242403e-16-8.182206e-171.663593e-16-1.848644e-16-3.176032e-171.393022e-162.701514e-16-1.011218e-161.907376e-161.404725e-163.029886e-16-7.174408e-162.160339e-161.024768e-167.410993e-164.072307e-16-2.362311e-165.861740e-161.107203e-164.521934e-163.086083e-161.000000e+007.328447e-17-7.508801e-161.284451e-154.254579e-161.281294e-15-0.112633-0.0026850.006004
V24-5.113541e-17-1.254911e-16-3.300147e-171.403733e-16-9.892370e-16-1.125379e-152.078026e-17-2.444390e-16-2.940457e-16-7.312196e-171.672342e-154.510178e-16-6.376621e-16-1.258007e-17-4.337014e-16-3.508969e-16-1.897694e-16-1.869482e-16-9.630049e-171.749671e-161.014531e-166.736130e-177.328447e-171.000000e+001.242718e-151.863258e-16-2.894257e-16-2.844233e-160.005146-0.0072210.004328
V25-2.165870e-168.784216e-171.123060e-166.312530e-16-1.561416e-165.563670e-16-1.507689e-17-1.792313e-162.137255e-16-3.457860e-16-6.082687e-166.970336e-18-1.142909e-16-7.178656e-172.281677e-16-3.341605e-167.587211e-17-2.451121e-168.161694e-16-6.786605e-18-1.173906e-16-9.827185e-16-7.508801e-161.242718e-151.000000e+002.449277e-15-5.340203e-162.699748e-16-0.0478370.003308-0.003497
V26-1.408070e-162.450901e-16-2.136494e-16-4.009636e-163.403172e-16-2.627057e-16-7.709408e-161.092765e-17-1.039639e-16-4.117783e-16-1.240097e-161.653468e-16-1.478991e-16-2.488490e-171.108681e-16-4.690618e-162.084478e-163.089442e-165.479257e-16-3.590893e-16-4.337929e-16-2.194486e-171.284451e-151.863258e-162.449277e-151.000000e+00-2.939564e-16-2.558739e-16-0.0032080.0044550.001146
V271.664818e-16-5.467509e-164.752587e-16-6.309346e-173.299056e-16-4.040640e-16-2.647380e-183.921512e-16-1.499396e-16-3.115507e-16-1.519253e-16-2.721798e-16-5.300185e-16-1.739150e-17-1.246909e-158.147869e-166.669179e-162.209663e-16-1.243578e-16-8.488785e-16-1.484206e-151.478149e-164.254579e-16-2.894257e-16-5.340203e-16-2.939564e-161.000000e+00-2.403217e-160.0288250.017580-0.008676
V282.419798e-16-6.910935e-176.073110e-16-2.064064e-16-3.491468e-164.612882e-172.115388e-17-5.158971e-167.982292e-163.949646e-16-2.909057e-167.065902e-161.043260e-152.414117e-15-9.799748e-167.042089e-16-5.419071e-178.158517e-16-1.291833e-15-4.584320e-161.584856e-16-5.686304e-161.281294e-15-2.844233e-162.699748e-16-2.558739e-16-2.403217e-161.000000e+000.0102580.009536-0.007492
Amount-2.277087e-01-5.314089e-01-2.108805e-019.873167e-02-3.863563e-012.159812e-013.973113e-01-1.030791e-01-4.424560e-02-1.015021e-011.039770e-04-9.541802e-035.293409e-033.375117e-02-2.985848e-03-3.909527e-037.309042e-033.565034e-02-5.615079e-023.394034e-011.059989e-01-6.480065e-02-1.126326e-015.146217e-03-4.783686e-02-3.208037e-032.882546e-021.025822e-021.0000000.005632-0.006667
Class-1.013473e-019.128865e-02-1.929608e-011.334475e-01-9.497430e-02-4.364316e-02-1.872566e-011.987512e-02-9.773269e-02-2.168829e-011.548756e-01-2.605929e-01-4.569779e-03-3.025437e-01-4.223402e-03-1.965389e-01-3.264811e-01-1.114853e-013.478301e-022.009032e-024.041338e-028.053175e-04-2.685156e-03-7.220907e-033.307706e-034.455398e-031.757973e-029.536041e-030.0056321.000000-0.017109
Hour-5.214205e-037.802199e-03-2.156874e-02-3.506295e-02-3.513442e-02-1.894502e-02-9.729167e-033.210647e-02-1.898298e-012.417660e-02-1.351310e-013.524592e-01-1.879810e-01-1.629179e-011.122505e-015.517040e-03-6.480333e-02-3.518403e-032.156599e-029.780928e-04-1.191466e-02-1.660982e-026.004232e-034.328237e-03-3.497363e-031.146125e-03-8.676362e-03-7.492140e-03-0.006667-0.0171091.000000

从数值上看,数值变量之间没有明显的相关性,因为样本的不均衡,数值变量与预测类别之间的相关性也不大,这并不是我们想看到的,因而我们考虑先进行正负样本的均衡

但是在进行样本均衡化处理之前,我们需要先对样本进行训练集和测试集的划分,这是为了保证测试的公平性,我们需要在原始数据集上测试,毕竟不能用自己构造的数据训练集测试自己构造的测试集
为了使得训练集和测试集的正负样本分布一致,采用StratifiedKFold来划分

from sklearn.model_selection import StratifiedKFold
X_original = data.drop(columns='Class')
y_original = data['Class']
sss = StratifiedKFold(n_splits=5,random_state=None,shuffle=False)
for train_index,test_index in sss.split(X_original,y_original):print('Train:',train_index,'test:',test_index)X_train,X_test = X_original.iloc[train_index],X_original.iloc[test_index]y_train,y_test = y_original.iloc[train_index],y_original.iloc[test_index]print(X_train.shape)
print(X_test.shape)
Train: [ 52587  52905  53232 ... 284804 284805 284806] test: [    0     1     2 ... 56966 56967 56968]
Train: [     0      1      2 ... 284804 284805 284806] test: [ 52587  52905  53232 ... 113937 113938 113939]
Train: [     0      1      2 ... 284804 284805 284806] test: [104762 104905 105953 ... 170897 170898 170899]
Train: [     0      1      2 ... 284804 284805 284806] test: [162214 162834 162995 ... 227854 227855 227856]
Train: [     0      1      2 ... 227854 227855 227856] test: [222268 222591 222837 ... 284804 284805 284806]
(227846, 30)
(56961, 30)
#查看训练集和测试集的类别分布是否一致
print('Train:',[y_train.value_counts()/y_train.value_counts().sum()])
print('Test:',[y_test.value_counts()/y_test.value_counts().sum()])
Train: [0    0.998271
1    0.001729
Name: Class, dtype: float64]
Test: [0    0.99828
1    0.00172
Name: Class, dtype: float64]

对于处理正负样本极不均衡的情况,主要有欠采样和过采样两种方案.

  • 欠采样是从大样本中随机选择与小样本同样数量的样本,对于极不均衡的问题会出现欠拟合问题,因为样本量太小。采用的方法是:
 from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(random_state=1)X_undersampled,y_undersampled = rus.fit_resampled(X,y)
  • 过采样是利用利用小样本生成与大样本相同数量的样本,有两种方法:随机过采样和SMOTE法过采样
    • 随机过采样是从小样本中随机抽取一定的数量的旧样本,组成一个与大样本相同数量的新样本,这种处理方法容易出现过拟合
    from imblearn.over_sampling import RandomOverSampler
    ros = RandomOverSampler(random_state = 1)
    X_oversampled,y_oversampled = ros.fit_resample(X,y)
    
    • SMOTE,即合成少数类过采样技术,针对随机过采样容易出现过拟合问题的改进方案。根据样本不同,分成数据多和数据少的两类,从数据少的类中随机选一个样本,再找到小类样本中离选定样本点近的几个样本,取这几个样本与选定样本连线上的点,作为新生成的样本,重复步骤直到达到大样本的样本数。
    from imblearn.over_sampling import SMOTE
    smote = SMOTE(random_state = 1)
    X_smotesampled,y_smotesampled = smote.fit_resample(X,y)
    

可以利用Counter(y_smotesampled)查看两种样本的数量是否一致

下面分别用随机下采样和SMOTE与随机下采样的结合,这两种方案来对样本进行处理,并分析对比处理后样本的分布情况

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from collections import CounterX = X_train.copy()
y = y_train.copy()
print('Imblanced samples: ', Counter(y))rus = RandomUnderSampler(random_state=1)
X_rus, y_rus = rus.fit_resample(X, y)
print('Random under sample: ', Counter(y_rus))
ros = RandomOverSampler(random_state=1)
X_ros, y_ros = ros.fit_resample(X, y)
print('Random over sample: ', Counter(y_ros))
smote = SMOTE(random_state=1,sampling_strategy=0.5)
X_smote, y_smote = smote.fit_resample(X, y)
print('SMOTE: ', Counter(y_smote))
under = RandomUnderSampler(sampling_strategy=1)
X_smote, y_smote = under.fit_resample(X_smote,y_smote)
print('SMOTE: ', Counter(y_smote))
Imblanced samples:  Counter({0: 227452, 1: 394})
Random under sample:  Counter({0: 394, 1: 394})
Random over sample:  Counter({0: 227452, 1: 227452})
SMOTE:  Counter({0: 227452, 1: 113726})
SMOTE:  Counter({0: 113726, 1: 113726})

根据结果,可以看出处理后的样本数,欠采样是正负样本都变成了394,而过采样是都变成了227452。下面分别分析不同平衡样本的情况

随机下采样 Random under sample

data_rus = pd.concat([X_rus,y_rus],axis=1)
# 单个数值特征分布情况
f = pd.melt(data_rus, value_vars=X_train.columns)
g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')

在这里插入图片描述

# 单个数值特征分布箱线图
f = pd.melt(data_rus, value_vars=X_train.columns)
g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5)
g = g.map(sns.boxplot, 'value', color='lightskyblue')

在这里插入图片描述

# 单个数值特征分布小提琴图
f = pd.melt(data_rus, value_vars=X_train.columns)
g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5)
g = g.map(sns.violinplot, 'value',color='lightskyblue')

在这里插入图片描述

通过分布图可以看到大部分的数值分布较集中,但也存在一些异常值(箱型图和小提琴图看起来更直观),异常值会带偏模型,所以我们需要去除异常值,判断异常值有两种方法:正态分布和上下四分位数。正态分布是采用 3 σ 3\sigma 3σ原则,上下四分位数是利用分位数,超过一分位数或三分位数一定的范围将被确定为异常值

def outlier_process(data,column):Q1 = data[column].quantile(q=0.25)Q3 = data[column].quantile(q=0.75)low_whisker = Q1-3*(Q3-Q1)high_whisker = Q3+3*(Q3-Q1)# 删除异常值data_drop = data[(data[column]>=low_whisker) & (data[column]<=high_whisker)]#画出删除前和删除后的对比图fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,5))sns.boxplot(y=data[column],ax=ax1,color='lightskyblue')ax1.set_title('before deleting outlier'+' '+column)sns.boxplot(y=data_drop[column],ax=ax2,color='lightskyblue')ax2.set_title('after deleting outlier'+' '+column)return data_drop
numerical_columns = data_rus.columns.drop('Class')
for col_name in numerical_columns:data_rus = outlier_process(data_rus,col_name)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

因为有些属性的分布较集中,没有出现太大变化,比如Hour。

清理完脏数据,再探测变量之间的相关关系

#绘制采样前和采样后的热力图
fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10))
sns.heatmap(data.corr(),cmap = 'coolwarm_r',ax=ax1,vmax=0.8)
ax1.set_title('the relationship on imbalanced samples')
sns.heatmap(data_rus.corr(),cmap = 'coolwarm_r',ax=ax2,vmax=0.8)
ax2.set_title('the relationship on random under samples')
Text(0.5, 1.0, 'the relationship on random under samples')

在这里插入图片描述

# 分析数值属性与Class之间的相关性
data_rus.corr()['Class'].sort_values(ascending=False)
Class     1.000000
V4        0.722609
V11       0.694975
V2        0.481190
V19       0.245209
V20       0.151424
V21       0.126220
Amount    0.100853
V26       0.090654
V27       0.074491
V8        0.059647
V28       0.052788
V25       0.040578
V22       0.016379
V23      -0.003117
V15      -0.009488
V13      -0.055253
V24      -0.070806
Hour     -0.196789
V5       -0.383632
V6       -0.407577
V1       -0.423665
V18      -0.462499
V7       -0.468273
V17      -0.561113
V3       -0.561767
V9       -0.562542
V16      -0.592382
V10      -0.629362
V12      -0.691652
V14      -0.751142
Name: Class, dtype: float64

对比可以看到,处理后的数值属性与分类类别之间的相关性明显增加,根据排名,和预测值Class正相关值较大的有V4,V11,V2,V27,负相关值较大的有V_14,V_10,V_12,V_3,V7,V9,我们画出这些特征与预测值之间的关系图

# 正相关的属性与Class分布图
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6))
sns.violinplot(x='Class',y='V4',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1)
ax1.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2)
ax2.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3)
ax3.set_title('V2 vs Class Positive Correlation')
Text(0.5, 1.0, 'V2 vs Class Positive Correlation')

在这里插入图片描述

# 负相关的属性与Class分布图
fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,12))
sns.violinplot(x='Class',y='V14',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1)
ax1.set_title('V14 vs Class Negative Correlation')sns.violinplot(x='Class',y='V10',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2)
ax2.set_title('V10 vs Class Negative Correlation')sns.violinplot(x='Class',y='V12',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3)
ax3.set_title('V12 vs Class Negative Correlation')sns.violinplot(x='Class',y='V3',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax4)
ax4.set_title('V3 vs Class Negative Correlation')sns.violinplot(x='Class',y='V7',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax5)
ax5.set_title('V7 vs Class Negative Correlation')sns.violinplot(x='Class',y='V9',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax6)
ax6.set_title('V9 vs Class Negative Correlation')
Text(0.5, 1.0, 'V9 vs Class Negative Correlation')

在这里插入图片描述

#其他属性与Class的分布图
other_fea = list(data_rus.columns.drop(['V11','V4','V2','V17','V14','V12','V10','V7','V3','V9','Class']))
fig,ax = plt.subplots(5,4,figsize=(24,36))
for fea in other_fea:sns.violinplot(x='Class',y= fea,data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]])ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]].set_title(fea)

在这里插入图片描述

从以上小提琴图可以看出,不同的正负样本在属性值上取值分布确实存在不同,而其他的属性在正负样本上区别相对不大。

好奇金额和时间是否与Class的关系,金额上正常交易金额更少更集中一些,而欺诈交易金额相对较大且分布更散,而时间上正常交易时间跨度小于欺诈交易的时间跨度,所以是在睡觉的时候更可能产生欺诈交易

查看完特征和类别之间的关系,下面分析特征之间的关系,从热力图上可以看出,属性之间是存在相关性的,下面具体看看是否存在多重共线性

sns.set()
sns.pairplot(data[list(data_rus.columns)],kind='scatter',diag_kind = 'kde')
plt.show()

在这里插入图片描述

从上图可以看出,这些属性之间不是完全相互独立的,有些存在很强的线性相关性,我们利用方差膨胀系数(VIF)作进一步检验

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(data_rus.values, data_rus.columns.get_loc(i)) for i in data_rus.columns]
vif
[12.662373162589994,20.3132501576979,26.78027354608725,9.970255022795625,23.531563683157597,3.4386660732204946,67.84989394913013,5.76519495696649,7.129002458395831,23.226754020950764,11.753104213590975,29.49673779700361,1.3365891898690718,21.57973674600878,1.2669840488461022,27.61485162786757,31.081940593780782,14.088642210869459,2.502857511412321,4.96077803555917,5.169599871511768,3.1235143157354583,2.828445638986856,1.1937601054384332,1.628451339236206,1.1966413137632343,1.959903999050125,1.4573293665681395,6.314999796714301,2.0990707198901117,4.802392100187543]

一般认为 V I F < 10 VIF<10 VIF<10时,该变量与其余变量之间不存在多重共线性,当 10 < V I F < 100 10<VIF<100 10<VIF<100时存在较强的多重共线性,当 V I F > 100 VIF>100 VIF>100时,则认为是存在严重的多重共线性。从以上数值来看,变量之间确实存在多重共线性,也就是存在信息冗余,下面需要进行特征提取或者特征选择

特征提取和特征选择都是降维的两种方法,特征提取所提取的特征是原特征的映射,而特征选择选出的是原特征的子集。主成分分析和线性判别分析是特征提取的两种经典方法。

特征选择:当数据处理完成后,需要选择有意义的特征输入机器学习的算法和模型进行训练,主要从两个方面来选择考虑特征

  • 特征是否发散,如果一个特征不发散,例如方差接近0,也就是说样本在这个特征上基本没有差异,这个特征对于样本的区分并没有什么用

  • 特征与目标的相关性,与目标相关性高的特征,应当优先选择。

    根据特征选择的形式又可以将特征选择分为3种:
    1、过滤法:按照发散性或者相关性对各个特征进行评分,设定阈值或带选择阈值的个数,选择特征(方差选择法、相关系数法、卡方检验、互信息法)
    2、包装法:根据目标函数(通常是预测效果评分),每次选择若干特征,或者排除若干特征(递归特征消除法)
    3、嵌入法:先使用某些机器学习的算法和模型进行训练,得到各个特征的权值系数,根据系数从大到小选择特征,类似于filter方法,但是通过训练来确定特征的优劣(基于惩罚项的特征选择法、基于树模型的特征选择法)

岭回归和Lasso是两种对线性模型特征选择的方法,都是加入正则化项防止过拟合,岭回归加入的是二阶范数的正则化项,而Lasso加入的是一级范式,其中Lasso能够将一些作用比较小的特征的参数训练为0,从而获得稀疏解,也就是在训练模型时实现了降维的目的。
对于树模型,有随机森林分类器对特征的重要性进行排序,可以达到筛选的目的。
本文先采用两种特征选择的方法分别选择出重要的特征,看看特征有什么差别

# 利用Lasso进行特征选择
from sklearn.linear_model import LassoCV
from sklearn.model_selection import cross_val_score
#调用LassoCV函数,并进行交叉验证
model_lasso = LassoCV(alphas=[0.1,0.01,0.005,1],random_state=1,cv=5).fit(X_rus,y_rus)
#输出看模型中最终选择的特征
coef = pd.Series(model_lasso.coef_,index=X_rus.columns)
print(coef[coef != 0].abs().sort_values(ascending = False))
V4        0.062065
V14       0.045851
Amount    0.040011
V26       0.038201
V13       0.031702
V7        0.028889
V22       0.028509
V18       0.028171
V6        0.019226
V1        0.018757
V21       0.016032
V10       0.014742
V28       0.012483
V8        0.011273
V20       0.010726
V9        0.010358
V24       0.010227
V17       0.007217
V2        0.006838
Hour      0.004757
V15       0.003393
V27       0.002588
V19       0.000275
dtype: float64
# 利用随机森林进行特征重要性排序
from sklearn.ensemble import RandomForestClassifier
rfc_fea_model = RandomForestClassifier(random_state=1)
rfc_fea_model.fit(X_rus,y_rus)
fea = X_rus.columns
importance = rfc_fea_model.feature_importances_
a = pd.DataFrame()
a['feature'] = fea
a['importance'] = importance
a = a.sort_values('importance',ascending = False)
plt.figure(figsize=(20,10))
plt.bar(a['feature'],a['importance'])
plt.title('the importance orders sorted by random forest')
plt.show()

在这里插入图片描述

a.cumsum()
featureimportance
11V120.134515
9V12V100.268117
16V12V10V170.388052
13V12V10V17V140.503750
3V12V10V17V14V40.615973
10V12V10V17V14V4V110.687904
1V12V10V17V14V4V11V20.731179
15V12V10V17V14V4V11V2V160.774247
2V12V10V17V14V4V11V2V16V30.813352
6V12V10V17V14V4V11V2V16V3V70.846050
18V12V10V17V14V4V11V2V16V3V7V190.860559
17V12V10V17V14V4V11V2V16V3V7V19V180.873657
20V12V10V17V14V4V11V2V16V3V7V19V18V210.886416
28V12V10V17V14V4V11V2V16V3V7V19V18V21Amount0.896986
26V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV270.906369
19V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V200.915172
14V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V150.923700
22V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V230.931990
12V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V130.939298
7V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V80.946163
5V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V60.952952
25V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V260.959418
8V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V90.965869
0V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V10.971975
4V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V50.977297
21V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V220.982325
27V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V280.986990
23V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V240.991551
29V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24Hour0.995913
24V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24HourV251.000000

从以上结果来看,两种方法的排序,特征重要性差别很大,为了更大程度的保留数据的信息,我们采用两种结合的特征,包括[‘V14’,‘V12’,‘V10’,‘V17’,‘V4’,‘V11’,‘V3’,‘V2’,‘V7’,‘V16’,‘V18’,‘Amount’,‘V19’,‘V20’,‘V23’,‘V21’,‘V15’,‘V9’,‘V6’,‘V27’,‘V25’,‘V5’,‘V13’,‘V22’,‘Hour’,‘V28’,‘V1’,‘V8’,‘V26’],其中选择的标准是随机森林中重要性总和95%以上,如果其中有Lasso回归没有的,则加入,共选出29个特征(只有V24没有被选择)

# 使用选择的特征采进行训练和测试
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression  # 逻辑回归
from sklearn.neighbors import KNeighborsClassifier  # KNN
from sklearn.naive_bayes import GaussianNB  # 朴素贝叶斯
from sklearn.svm import SVC  # 支持向量分类
from sklearn.tree import DecisionTreeClassifier  # 决策树
from sklearn.ensemble import RandomForestClassifier  # 随机森林
from sklearn.ensemble import AdaBoostClassifier  # Adaboost
from sklearn.ensemble import GradientBoostingClassifier  # GBDT
from xgboost import XGBClassifier  # XGBoost
from lightgbm import LGBMClassifier  # lightGBM
from sklearn.metrics import roc_curve  # 绘制ROC曲线
Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1,probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost':AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1)}
def train_test(Classifiers, X_train, y_train, X_test, y_test):y_pred = pd.DataFrame()Accuracy_Score = pd.DataFrame()
#     score.model_name = Classifiers.keysfor model_name, model in Classifiers.items():model.fit(X_train, y_train)y_pred[model_name] = model.predict(X_test)y_pred_pra = model.predict_proba(X_test)Accuracy_Score[model_name] = pd.Series(model.score(X_test, y_test))# 计算召回率print(model_name, '\n', classification_report(y_test, y_pred[model_name]))
#         confu_mat = confusion_matrix(y_test,y_pred[model_name])
#         plt.matshow(confu_mat,cmap = plt.cm.Blues)
#         plt.title(model_name)
#         plt.colorbar()# 画出混淆矩阵fig, ax = plt.subplots(1, 1)plot_confusion_matrix(model, X_test, y_test, labels=[0, 1], cmap='Blues', ax=ax)ax.set_title(model_name)# 画出roc曲线plt.figure()fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))fpr, tpr, thres = roc_curve(y_test, y_pred_pra[:, -1])ax1.plot(fpr, tpr)ax1.set_title(model_name+' ROC')ax1.set_xlabel('fpr')ax1.set_ylabel('tpr')# 画出KS曲线ax2.plot(thres[1:],tpr[1:])ax2.plot(thres[1:],fpr[1:])ax2.plot(thres[1:],tpr[1:]-fpr[1:])ax2.set_xlabel('threshold')ax2.legend(['tpr','fpr','tpr-fpr'])plt.sca(ax2)plt.gca().invert_xaxis()
#         ax2.gca().invert_xaxis()ax2.set_title(model_name+' KS')return y_pred,Accuracy_Score
# test_cols = ['V12', 'V14', 'V10', 'V17', 'V11', 'V4', 'V2', 'V16', 'V7', 'V3',
#                     'V18', 'Amount', 'V19', 'V21', 'V20', 'V8', 'V15', 'V6', 'V27', 'V26', 'V1','V9','V13','V22','Hour','V23','V28']
test_cols = X_rus.columns.drop('V24')
Y_pred,Accuracy_score = train_test(Classifiers, X_rus[test_cols], y_rus, X_test[test_cols], y_test)
Accuracy_score
LGprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.94      0.07        98accuracy                           0.96     56961macro avg       0.52      0.95      0.53     56961
weighted avg       1.00      0.96      0.98     56961KNNprecision    recall  f1-score   support0       1.00      0.97      0.99     568631       0.06      0.93      0.11        98accuracy                           0.97     56961macro avg       0.53      0.95      0.55     56961
weighted avg       1.00      0.97      0.99     56961Bayesprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.88      0.08        98accuracy                           0.96     56961macro avg       0.52      0.92      0.53     56961
weighted avg       1.00      0.96      0.98     56961SVCprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.08      0.91      0.14        98accuracy                           0.98     56961macro avg       0.54      0.94      0.57     56961
weighted avg       1.00      0.98      0.99     56961DecisionTreeprecision    recall  f1-score   support0       1.00      0.88      0.93     568631       0.01      0.95      0.03        98accuracy                           0.88     56961macro avg       0.51      0.91      0.48     56961
weighted avg       1.00      0.88      0.93     56961RandomForestprecision    recall  f1-score   support0       1.00      0.97      0.98     568631       0.05      0.93      0.09        98accuracy                           0.97     56961macro avg       0.52      0.95      0.54     56961
weighted avg       1.00      0.97      0.98     56961Adaboostprecision    recall  f1-score   support0       1.00      0.95      0.98     568631       0.03      0.95      0.07        98accuracy                           0.95     56961macro avg       0.52      0.95      0.52     56961
weighted avg       1.00      0.95      0.97     56961GBDTprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.92      0.07        98accuracy                           0.96     56961macro avg       0.52      0.94      0.53     56961
weighted avg       1.00      0.96      0.98     56961[15:05:57] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGboostprecision    recall  f1-score   support0       1.00      0.97      0.98     568631       0.04      0.93      0.09        98accuracy                           0.97     56961macro avg       0.52      0.95      0.53     56961
weighted avg       1.00      0.97      0.98     56961LightGBMprecision    recall  f1-score   support0       1.00      0.97      0.98     568631       0.05      0.94      0.10        98accuracy                           0.97     56961macro avg       0.53      0.95      0.54     56961
weighted avg       1.00      0.97      0.98     56961
LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM
00.9594460.9735610.9628870.9814790.8778990.9674510.9534070.9604820.9656780.969611

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

# 集成学习
# 根据以上auc以及recall的结果,选择LG,DT以及GBDT当作基模型
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[
#     ('LG', LogisticRegression(random_state=1)),('KNN', KNeighborsClassifier()),
#     ('Bayes',GaussianNB()),('SVC', SVC(random_state=1,probability=True)),
#                    ('DecisionTree', DecisionTreeClassifier(random_state=1)),
#                    ('RandomForest', RandomForestClassifier(random_state=1)),
#                 ('Adaboost',AdaBoostClassifier(random_state=1)),
#                    ('GBDT', GradientBoostingClassifier(random_state=1)),
#                    ('XGboost', XGBClassifier(random_state=1)),('LightGBM', LGBMClassifier(random_state=1))])
voting_clf.fit(X_rus[test_cols], y_rus)
y_final_pred = voting_clf.predict(X_test[test_cols])
print(classification_report(y_test, y_final_pred))
fig, ax = plt.subplots(1, 1)
plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax)
              precision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.08      0.94      0.14        98accuracy                           0.98     56961macro avg       0.54      0.96      0.57     56961
weighted avg       1.00      0.98      0.99     56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x213686f0700>

在这里插入图片描述

通过以上结果,模型的精确度accuracy都挺高的,尤其是支持向量分类的达到了98.1479%,但对于不平衡样本来说,更应该关注召回率,在SVC中,识别出欺诈交易的概率是91%,而对于非欺诈样本的识别率是99%,分别有1046和9份样本识别错误。

在进行模型调优之前,尝试着用默认参数来构成模型集成,选择出精确度高的3个模型,KNN,SVC,和Light GBM,得到的结果的精确度是98%,分别有6份和1114份样本识别错误,在欺诈样本的识别上有所提升,但是非欺诈样本检测成欺诈样本数量增加了。

因为模型中使用的参数都是默认的,考虑利用网格搜索确定最佳参数,当然网格搜索也不一定能找到最好的参数

初步模型已经建立,但是模型的参数采用的是模型默认的,下面进行调参。调参的方法有三种:随机搜索、网格搜索以及贝叶斯优化。而随即搜索和网格搜索当超参数过多时,极其耗时,因为他们的搜索次数是所有参数的组合,而利用贝叶斯优化进行调参会考虑之前的参数信息,不断更新先验,且迭代次数少,速度快。

# 网格搜索找最佳参数
from sklearn.model_selection import GridSearchCV
def reg_best(X_train, y_train):log_reg_parames = {'penalty': ['l1', 'l2'],'C': [0.001, 0.01, 0.05, 0.1, 1, 10]}grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_parames)grid_log_reg.fit(X_train, y_train)log_reg_best = grid_log_reg.best_estimator_print(log_reg_best)return log_reg_bestdef KNN_best(X_train, y_train):KNN_parames = {'n_neighbors': [3, 5, 7, 9, 11, 15], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}grid_KNN = GridSearchCV(KNeighborsClassifier(), KNN_parames)grid_KNN.fit(X_train, y_train)KNN_best_ = grid_KNN.best_estimator_print(KNN_best_)return KNN_best_def SVC_best(X_train, y_train):SVC_parames = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear'], 'probability': [True]}grid_SVC = GridSearchCV(SVC(), SVC_parames)grid_SVC.fit(X_train, y_train)SVC_best = grid_SVC.best_estimator_print(SVC_best)return SVC_bestdef DecisionTree_best(X_train, y_train):DT_parames = {"criterion": ["gini", "entropy"], "max_depth": list(range(2, 4, 1)),"min_samples_leaf": list(range(5, 7, 1))}grid_DT = GridSearchCV(DecisionTreeClassifier(), DT_parames)grid_DT.fit(X_train, y_train)DT_best = grid_DT.best_estimator_print(DT_best)return DT_bestdef RandomForest_best(X_train, y_train):RF_params = {'n_estimators': [10, 50, 100, 150, 200], 'criterion': ['gini', 'entropy'], "min_samples_leaf": list(range(5, 7, 1))}grid_RF = GridSearchCV(RandomForestClassifier(), RF_params)grid_RF.fit(X_train, y_train)RT_best = grid_RF.best_estimator_print(RT_best)return RT_bestdef Adaboost_best(X_train, y_train):Adaboost_params = {'n_estimators': [10, 50, 100, 150, 200], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'algorithm': ['SAMME', 'SAMME.R']}grid_Adaboost = GridSearchCV(AdaBoostClassifier(), Adaboost_params)grid_Adaboost.fit(X_train, y_train)Adaboost_best_ = grid_Adaboost.best_estimator_print(Adaboost_best_)return Adaboost_best_def GBDT_best(X_train, y_train):GBDT_params = {'n_estimators': [10, 50, 100, 150], 'loss': ['deviance', 'exponential'], 'learning_rate': [0.01, 0.05, 0.1], 'criterion': ['friedman_mse', 'mse']}grid_GBDT = GridSearchCV(GradientBoostingClassifier(), GBDT_params)grid_GBDT.fit(X_train, y_train)GBDT_best_ = grid_GBDT.best_estimator_print(GBDT_best_)return GBDT_best_def XGboost_best(X_train, y_train):XGB_params = {'n_estimators': [10, 50, 100, 150, 200], 'max_depth': [5, 10, 15, 20], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1]}grid_XGB = GridSearchCV(XGBClassifier(), XGB_params)grid_XGB.fit(X_train, y_train)XGB_best_ = grid_XGB.best_estimator_print(XGB_best_)return XGB_best_def LGBM_best(X_train, y_train):LGBM_params = {'boosting_type': ['gbdt', 'dart', 'goss', 'rf'], 'num_leaves': [21, 31, 51], 'n_estimators': [10, 50, 100, 150, 200], 'max_depth': [5, 10, 15, 20], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1]}grid_LGBM = GridSearchCV(LGBMClassifier(), LGBM_params)grid_LGBM.fit(X_train, y_train)LGBM_best_ = grid_LGBM.best_estimator_print(LGBM_best_)return LGBM_best_
Classifiers = {'LG': reg_best(X_rus[test_cols], y_rus),'KNN': KNN_best(X_rus[test_cols], y_rus),'Bayes': GaussianNB(),'SVC': SVC_best(X_rus[test_cols], y_rus),'DecisionTree': DecisionTree_best(X_rus[test_cols], y_rus),'RandomForest': RandomForest_best(X_rus[test_cols], y_rus),'Adaboost':Adaboost_best(X_rus[test_cols], y_rus),'GBDT': GBDT_best(X_rus[test_cols], y_rus),'XGboost': XGboost_best(X_rus[test_cols], y_rus),'LightGBM': LGBM_best(X_rus[test_cols], y_rus)}
LogisticRegression(C=0.05)
KNeighborsClassifier(n_neighbors=3)
SVC(C=0.7, probability=True)
DecisionTreeClassifier(criterion='entropy', max_depth=2, min_samples_leaf=5)
RandomForestClassifier(min_samples_leaf=5)
AdaBoostClassifier(algorithm='SAMME', learning_rate=0.5, n_estimators=100)
GradientBoostingClassifier(criterion='mse', loss='exponential')
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.5, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()',n_estimators=10, n_jobs=8, num_parallel_tree=1, random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,tree_method='exact', validate_parameters=1, verbosity=None)
LGBMClassifier(boosting_type='dart', learning_rate=1, max_depth=5,n_estimators=150, num_leaves=21)
# 利用优化后的参数再训练测试
Y_pred,Accuracy_score = train_test(Classifiers, X_rus[test_cols], y_rus, X_test[test_cols], y_test)
Accuracy_score
LGprecision    recall  f1-score   support0       1.00      0.97      0.99     568631       0.05      0.92      0.10        98accuracy                           0.97     56961macro avg       0.53      0.95      0.54     56961
weighted avg       1.00      0.97      0.98     56961KNNprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.93      0.08        98accuracy                           0.96     56961macro avg       0.52      0.95      0.53     56961
weighted avg       1.00      0.96      0.98     56961Bayesprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.88      0.08        98accuracy                           0.96     56961macro avg       0.52      0.92      0.53     56961
weighted avg       1.00      0.96      0.98     56961SVCprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.09      0.91      0.16        98accuracy                           0.98     56961macro avg       0.54      0.95      0.57     56961
weighted avg       1.00      0.98      0.99     56961DecisionTreeprecision    recall  f1-score   support0       1.00      0.90      0.95     568631       0.02      0.95      0.03        98accuracy                           0.90     56961macro avg       0.51      0.92      0.49     56961
weighted avg       1.00      0.90      0.94     56961RandomForestprecision    recall  f1-score   support0       1.00      0.97      0.99     568631       0.06      0.93      0.11        98accuracy                           0.97     56961macro avg       0.53      0.95      0.55     56961
weighted avg       1.00      0.97      0.99     56961Adaboostprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.94      0.08        98accuracy                           0.96     56961macro avg       0.52      0.95      0.53     56961
weighted avg       1.00      0.96      0.98     56961GBDTprecision    recall  f1-score   support0       1.00      0.97      0.98     568631       0.05      0.93      0.09        98accuracy                           0.97     56961macro avg       0.52      0.95      0.53     56961
weighted avg       1.00      0.97      0.98     56961[15:36:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGboostprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.93      0.07        98accuracy                           0.96     56961macro avg       0.52      0.94      0.53     56961
weighted avg       1.00      0.96      0.98     56961LightGBMprecision    recall  f1-score   support0       1.00      0.97      0.98     568631       0.05      0.93      0.10        98accuracy                           0.97     56961macro avg       0.53      0.95      0.54     56961
weighted avg       1.00      0.97      0.98     56961
LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM
00.9725080.9637470.9628870.9831290.8970870.9744740.9621850.9661520.9600430.969769

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

<Figure size 432x288 with 0 Axes>
# 集成学习
# 根据以上auc以及recall的结果,选择LG,DT以及GBDT当作基模型
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[('LG', LogisticRegression(random_state=1,C=0.05)),('SVC', SVC(random_state=1,probability=True,C=0.7)),('RandomForest', RandomForestClassifier(random_state=1,min_samples_leaf=5)),])
voting_clf.fit(X_rus[test_cols], y_rus)
y_final_pred=voting_clf.predict(X_test[test_cols])
print(classification_report(y_test, y_final_pred))
fig, ax=plt.subplots(1, 1)
plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax)
              precision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.07      0.92      0.14        98accuracy                           0.98     56961macro avg       0.54      0.95      0.56     56961
weighted avg       1.00      0.98      0.99     56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2135666d070>

在这里插入图片描述

从精确度上来看,大部分模型都有提升,也有一些模型的精确度是退化的。在召回率上没有明显提升,甚至有些退化比较明显,这说明对于欺诈样本的识别还是不够明显。因为采用的训练样本数远小于测试样本数,这也导致了测试效果不行。

SMOTE上采样与随机下采样相结合

还是对训练样本进行分析再训练模型

data_smote = pd.concat([X_smote,y_smote],axis=1)
# 单个数值特征分布情况
f = pd.melt(data_smote, value_vars=X_train.columns)
g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')

在这里插入图片描述

# 单个数值特征分布箱线图
f = pd.melt(data_smote, value_vars=X_train.columns)
g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5)
g = g.map(sns.boxplot, 'value', color='lightskyblue')

在这里插入图片描述

# 单个数值特征分布小提琴图
f = pd.melt(data_smote, value_vars=X_train.columns)
g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5)
g = g.map(sns.violinplot, 'value',color='lightskyblue')

在这里插入图片描述

numerical_columns = data_smote.columns.drop('Class')
for col_name in numerical_columns:data_smote = outlier_process(data_smote,col_name)print(data_smote.shape)
(214362, 31)
(213002, 31)
(212497, 31)
(212497, 31)
(204320, 31)
(202425, 31)
(198165, 31)
(189719, 31)
(189492, 31)
(189427, 31)
(189020, 31)
(189019, 31)
(189019, 31)
(189019, 31)
(189016, 31)
(188944, 31)
(184836, 31)
(184556, 31)
(184552, 31)
(180426, 31)
(178559, 31)
(178559, 31)
(174526, 31)
(174457, 31)
(174354, 31)
(174098, 31)
(169488, 31)
(166755, 31)
(156423, 31)
(156423, 31)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

#绘制采样前和采样后的热力图
fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10))
sns.heatmap(data.corr(),cmap = 'coolwarm_r',ax=ax1,vmax=0.8)
ax1.set_title('the relationship on imbalanced samples')
sns.heatmap(data_smote.corr(),cmap = 'coolwarm_r',ax=ax2,vmax=0.8)
ax2.set_title('the relationship on random under samples')
Text(0.5, 1.0, 'the relationship on random under samples')

在这里插入图片描述

# 分析数值属性与Class之间的相关性
data_smote.corr()['Class'].sort_values(ascending=False)
Class     1.000000
V11       0.713013
V4        0.705057
V2        0.641477
V21       0.466069
V27       0.459954
V28       0.383395
V20       0.381292
V8        0.253905
V25       0.147601
V26       0.052486
V19       0.008546
Amount   -0.001365
V15      -0.023988
V22      -0.028250
Hour     -0.048206
V5       -0.098754
V13      -0.142258
V24      -0.154369
V23      -0.157138
V18      -0.170994
V1       -0.295297
V17      -0.445826
V6       -0.465282
V16      -0.500016
V7       -0.547495
V9       -0.567131
V3       -0.658294
V12      -0.713942
V10      -0.748285
V14      -0.790148
Name: Class, dtype: float64

根据排名,和预测值Class正相关值较大的有V4,V11,V2,负相关值较大的有V_14,V_10,V_12,V_3,V7,V9,我们画出这些特征与预测值之间的关系图

# 正相关的属性与Class分布图
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6))
sns.violinplot(x='Class',y='V4',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1)
ax1.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2)
ax2.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3)
ax3.set_title('V2 vs Class Positive Correlation')
Text(0.5, 1.0, 'V2 vs Class Positive Correlation')

在这里插入图片描述

# 正相关的属性与Class分布图
fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,14))
sns.violinplot(x='Class',y='V14',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1)
ax1.set_title('V14 vs Class negative Correlation')sns.violinplot(x='Class',y='V10',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2)
ax2.set_title('V4 vs Class negative Correlation')sns.violinplot(x='Class',y='V12',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3)
ax3.set_title('V12 vs Class negative Correlation')sns.violinplot(x='Class',y='V3',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax4)
ax4.set_title('V3 vs Class negative Correlation')sns.violinplot(x='Class',y='V7',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax5)
ax5.set_title('V7 vs Class negative Correlation')sns.violinplot(x='Class',y='V9',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax6)
ax6.set_title('V9 vs Class negative Correlation')
Text(0.5, 1.0, 'V9 vs Class negative Correlation')

在这里插入图片描述

vif = [variance_inflation_factor(data_smote.values, data_smote.columns.get_loc(i)) for i in data_smote.columns]
vif
[8.91488639686892,41.95138589208644,29.439659166987383,9.321076032190051,18.073107065112527,7.88968653431388,38.13243240821064,2.61807436913295,4.202219415722627,20.898417802753006,5.976908659263689,10.856930462897152,1.2514060420970867,20.23958581764367,1.176425772463202,6.444784613229281,6.980222815257359,2.7742773520511372,2.4906782119059176,4.348667463801223,3.409678717638936,1.9626453781659197,2.1167419900555884,1.1352046295467655,1.9935984979230046,1.1029041559046275,3.084861887885401,1.9565486505075638,13.535498930988794,1.7451075607624895,4.64505815138509]

相较于随机下采样来说,在多重共线性上相对有了改善

# 利用Lasso进行特征选择
#调用LassoCV函数,并进行交叉验证
model_lasso = LassoCV(alphas=[0.1,0.01,0.005,1],random_state=1,cv=5).fit(X_smote,y_smote)
#输出看模型中最终选择的特征
coef = pd.Series(model_lasso.coef_,index=X_smote.columns)
print(coef[coef != 0])
V1    -0.019851
V2     0.004587
V3     0.000523
V4     0.052236
V5    -0.000597
V6    -0.012474
V7     0.030960
V8    -0.012043
V9     0.007895
V10   -0.023509
V11    0.005633
V12    0.009648
V13   -0.036565
V14   -0.053919
V15    0.012297
V17   -0.009149
V18    0.030941
V20    0.010266
V21    0.013880
V22    0.019031
V23   -0.009253
V26   -0.068311
V27   -0.003680
V28    0.008911
dtype: float64
# 利用随机森林进行特征重要性排序
from sklearn.ensemble import RandomForestClassifier
rfc_fea_model = RandomForestClassifier(random_state=1)
rfc_fea_model.fit(X_smote,y_smote)
fea = X_smote.columns
importance = rfc_fea_model.feature_importances_
a = pd.DataFrame()
a['feature'] = fea
a['importance'] = importance
a = a.sort_values('importance',ascending = False)
plt.figure(figsize=(20,10))
plt.bar(a['feature'],a['importance'])
plt.title('the importance orders sorted by random forest')
plt.show()

在这里插入图片描述

a.cumsum()
featureimportance
13V140.140330
11V14V120.274028
9V14V12V100.392515
16V14V12V10V170.501863
3V14V12V10V17V40.592110
10V14V12V10V17V4V110.680300
1V14V12V10V17V4V11V20.728448
2V14V12V10V17V4V11V2V30.770334
15V14V12V10V17V4V11V2V3V160.808083
6V14V12V10V17V4V11V2V3V16V70.842341
17V14V12V10V17V4V11V2V3V16V7V180.857924
7V14V12V10V17V4V11V2V3V16V7V18V80.869777
20V14V12V10V17V4V11V2V3V16V7V18V8V210.880431
28V14V12V10V17V4V11V2V3V16V7V18V8V21Amount0.890861
0V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV10.900571
8V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V90.909978
12V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V130.918623
18V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V190.926978
26V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V270.935072
4V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V50.942688
29V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5Hour0.950275
19V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV200.957287
14V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V150.963873
5V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V60.970225
25V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V260.976519
27V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V280.982202
22V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V230.987770
21V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V220.992335
24V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V250.996295
23V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V25V241.000000

从以上结果来看,两种方法的排序,特征重要性差别很大,为了更大程度的保留数据的信息,我们采用两种结合的特征,其中选择的标准是随机森林中重要性总和95%以上,如果其中有Lasso回归没有的,则加入,共选出除去V24,V25之外的28个特征

# test_cols = ['V12', 'V14', 'V10', 'V17', 'V4', 'V11', 'V2', 'V7', 'V16', 'V3', 'V18',
#              'V8', 'Amount', 'V19', 'V21', 'V1', 'V5', 'V13', 'V27','V6','V15','V26']
test_cols = X_smote.columns.drop(['V24','V25'])
Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1, probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost': AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1)
}
Y_pred, Accuracy_score = train_test(Classifiers, X_smote[test_cols], y_smote, X_test[test_cols], y_test)
print(Accuracy_score)
Y_pred.head()
LGprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.06      0.86      0.12        98accuracy                           0.98     56961macro avg       0.53      0.92      0.55     56961
weighted avg       1.00      0.98      0.99     56961KNNprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.29      0.83      0.42        98accuracy                           1.00     56961macro avg       0.64      0.91      0.71     56961
weighted avg       1.00      1.00      1.00     56961Bayesprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.06      0.84      0.11        98accuracy                           0.98     56961macro avg       0.53      0.91      0.55     56961
weighted avg       1.00      0.98      0.99     56961SVCprecision    recall  f1-score   support0       1.00      0.99      0.99     568631       0.09      0.86      0.17        98accuracy                           0.99     56961macro avg       0.55      0.92      0.58     56961
weighted avg       1.00      0.99      0.99     56961DecisionTreeprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.27      0.80      0.40        98accuracy                           1.00     56961macro avg       0.63      0.90      0.70     56961
weighted avg       1.00      1.00      1.00     56961RandomForestprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.81      0.82      0.81        98accuracy                           1.00     56961macro avg       0.90      0.91      0.91     56961
weighted avg       1.00      1.00      1.00     56961Adaboostprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.06      0.89      0.12        98accuracy                           0.98     56961macro avg       0.53      0.93      0.55     56961
weighted avg       1.00      0.98      0.99     56961GBDTprecision    recall  f1-score   support0       1.00      0.99      0.99     568631       0.13      0.85      0.22        98accuracy                           0.99     56961macro avg       0.56      0.92      0.61     56961
weighted avg       1.00      0.99      0.99     56961[15:24:00] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGboostprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.69      0.84      0.76        98accuracy                           1.00     56961macro avg       0.84      0.92      0.88     56961
weighted avg       1.00      1.00      1.00     56961LightGBMprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.54      0.83      0.66        98accuracy                           1.00     56961macro avg       0.77      0.91      0.83     56961
weighted avg       1.00      1.00      1.00     56961LG       KNN     Bayes       SVC  DecisionTree  RandomForest  Adaboost    GBDT  XGboost  LightGBM
0  0.977862  0.996138  0.976335  0.985113      0.995874       0.99935  0.977054  0.9898  0.99907  0.998508
LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM
01110011111
11110001101
21111111111
31111111111
41111111111

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

从精确度上可以看出,经过了SMOTE上采样和随机下采样之后,精确度有了很好的提升,比如随机森林RandomForest达到了99.94%的精确度,在样本数量上分别是13个欺诈样本的没有识别,误将18个非欺诈样本识别为欺诈样本,其他的算法如XGBoost,lightGBM,KNN等都达到了99%的准确率,下面使用准确率高的模型进行集成
因为样本量较大,而参数调优也比较耗时,目前的效果也比较好,因而省略网格调优的过程,时间足够的可以像前面一样调优。

Classifiers = {'LG': reg_best(X_smote[test_cols], y_smote),'KNN': KNN_best(X_smote[test_cols], y_smote),'Bayes': GaussianNB(),'SVC': SVC_best(X_smote[test_cols], y_smote),'DecisionTree': DecisionTree_best(X_smote[test_cols], y_smote),'RandomForest': RandomForest_best(X_smote[test_cols], y_smote),'Adaboost':Adaboost_best(X_smote[test_cols], y_smote),'GBDT': GBDT_best(X_smote[test_cols], y_smote),'XGboost': XGboost_best(X_smote[test_cols], y_smote),'LightGBM': LGBM_best(X_smote[test_cols], y_smote)}
LogisticRegression(C=1)
KNeighborsClassifier(n_neighbors=3)
SVC(C=1, probability=True)
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
RandomForestClassifier(criterion='entropy', min_samples_leaf=5)
AdaBoostClassifier(learning_rate=1, n_estimators=200)
GradientBoostingClassifier(n_estimators=150)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.5, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()',n_estimators=200, n_jobs=8, num_parallel_tree=1, random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,tree_method='exact', validate_parameters=1, verbosity=None)
LGBMClassifier(learning_rate=0.5, max_depth=15, num_leaves=51)
Y_pred,Accuracy_score = train_test(Classifiers, X_smote[test_cols], y_smote, X_test[test_cols], y_test)
print(Accuracy_score)
Y_pred.head().append(Y_pred.tail())
LGprecision    recall  f1-score   support0       1.00      0.97      0.99     568631       0.06      0.91      0.11        98accuracy                           0.97     56961macro avg       0.53      0.94      0.55     56961
weighted avg       1.00      0.97      0.99     56961KNNprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.09      0.92      0.17        98accuracy                           0.98     56961macro avg       0.55      0.95      0.58     56961
weighted avg       1.00      0.98      0.99     56961Bayesprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.07      0.88      0.12        98accuracy                           0.98     56961macro avg       0.53      0.93      0.56     56961
weighted avg       1.00      0.98      0.99     56961SVCprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.08      0.90      0.15        98accuracy                           0.98     56961macro avg       0.54      0.94      0.57     56961
weighted avg       1.00      0.98      0.99     56961DecisionTreeprecision    recall  f1-score   support0       1.00      0.96      0.98     568631       0.04      0.90      0.08        98accuracy                           0.96     56961macro avg       0.52      0.93      0.53     56961
weighted avg       1.00      0.96      0.98     56961RandomForestprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.33      0.90      0.48        98accuracy                           1.00     56961macro avg       0.66      0.95      0.74     56961
weighted avg       1.00      1.00      1.00     56961Adaboostprecision    recall  f1-score   support0       1.00      0.98      0.99     568631       0.08      0.90      0.15        98accuracy                           0.98     56961macro avg       0.54      0.94      0.57     56961
weighted avg       1.00      0.98      0.99     56961GBDTprecision    recall  f1-score   support0       1.00      0.99      0.99     568631       0.11      0.89      0.19        98accuracy                           0.99     56961macro avg       0.55      0.94      0.59     56961
weighted avg       1.00      0.99      0.99     56961[17:58:32] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGboostprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.24      0.90      0.38        98accuracy                           0.99     56961macro avg       0.62      0.95      0.69     56961
weighted avg       1.00      0.99      1.00     56961LightGBMprecision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.30      0.90      0.45        98accuracy                           1.00     56961macro avg       0.65      0.95      0.72     56961
weighted avg       1.00      1.00      1.00     56961LG       KNN     Bayes       SVC  DecisionTree  RandomForest  Adaboost      GBDT   XGboost  LightGBM
0  0.97393  0.984463  0.978477  0.982936      0.963273      0.996629  0.982093  0.987026  0.994944  0.996208
LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM
00000000000
10000000000
20000000000
30000000000
40010000000
569560000000000
569570000000000
569580000000000
569590010000000
569600000000000
# 集成学习
# 根据以上auc以及recall的结果,选择LG,DT以及GBDT当作基模型
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[    
#     ('LG', LogisticRegression(random_state=1)),('KNN', KNeighborsClassifier(n_neighbors=3)),
#     ('Bayes',GaussianNB()),
#                    ('SVC', SVC(random_state=1,probability=True)),('DecisionTree', DecisionTreeClassifier(random_state=1)),('RandomForest', RandomForestClassifier(random_state=1)),
#                 ('Adaboost',AdaBoostClassifier(random_state=1)),
#                    ('GBDT', GradientBoostingClassifier(random_state=1)),('XGboost', XGBClassifier(random_state=1)),('LightGBM', LGBMClassifier(random_state=1))])
voting_clf.fit(X_smote[test_cols], y_smote)
y_final_pred = voting_clf.predict(X_test[test_cols])
print(classification_report(y_test, y_final_pred))
fig, ax = plt.subplots(1, 1)
plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax)
              precision    recall  f1-score   support0       1.00      1.00      1.00     568631       0.71      0.83      0.76        98accuracy                           1.00     56961macro avg       0.86      0.91      0.88     56961
weighted avg       1.00      1.00      1.00     56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1fe3047f3a0>

在这里插入图片描述

最终,根据我们选择的模型,能达到几乎100%的准确率,欺诈样本中有17个没有识别出,而在非欺诈样本中只有33个被识别为存在信用卡欺诈,与随机下采样在准确率上有了极大提升。在欺诈样本的识别上还有提升空间!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/69055.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Thinkphp使用Authorize.Net实现VISA信用卡支付

官方网站&#xff1a;https://developer.authorize.net/ 开发者文档&#xff1a;https://developer.authorize.net/api/reference/index.html 一、注册沙箱账号进行调试 注册成功之后会弹出你的沙箱账号信息 API LOGIN ID 48h4xxxxxePS TRANSACTION KEY 4S9xxxxxxxxxx8Aq K…

4款好用的PC端电子书阅读软件,千万别错过

分享4款好用的电子书阅读软件&#xff0c;支持多种电子书格式阅读&#xff0c;并且阅读界面舒适可随意调整&#xff0c;大家快去试试吧&#xff01; 1、百度阅读器精简版 支持阅读的格式&#xff1a;TXT、PDF 一个百度推出的电子书阅读软件&#xff0c;简单小巧&#xff0c;…

GitBook制作epub电子书,并上传到微信读书

目标&#xff1a;将一本 GitBook&#xff08;SpringBoot2 中文参考指南&#xff09;转换为 epub 电子书&#xff0c;放到微信读书里。 准备工作&#xff1a;Windows 10 X64&#xff0c;NodeJS及版本管理工具nvm、Chrome浏览器 步骤一&#xff1a;打开 https://jack80342.gitbo…

学生党福音 电子教材下载网站推荐

还在购买电子教材&#xff1f;这几个电子教材下载网站可以免费下载下载教材&#xff0c;一起来看看吧。 1.中小学数字教材一站式下载 一个包含小学和中学教科书的网站。从小学一年级到高中三年级的教科书均包括在内。支持在线查看和下载&#xff0c;下载格式为PDF。我们可以滑…

信息时代,为什么还读纸质书

后人进步&#xff0c;是因为脚踩先人的脚印&#xff0c;这是知识进步最重要的途径之一。 唐僧取经&#xff0c;历经千山万水也要把真经取回来&#xff0c;取回来&#xff0c;再翻译&#xff0c;再传播&#xff1b;中国古代四大发明之造纸术、印刷术&#xff0c;承载了古代劳动…

Kindle下线在即 使用cpolar建立自己的电子书图书馆

在电子书风靡的时期&#xff0c;大部分人都购买了一本电子书&#xff0c;虽然这本电子书更多的时候是被搁置在储物架上吃灰&#xff0c;或者成为盖泡面的神器&#xff0c;但当亚马逊发布消息将放弃电子书在中国的服务时&#xff0c;还是有些令人惋惜&#xff0c;毕竟谁也不想大…

推荐一些可以获取免费的国外的原版书籍(电子版)网站

Z-library 推荐指数&#xff1a;★★★★★ 网站&#xff1a;https://z-lib.org/ 这个网站据称是世界最大的电子图书馆&#xff0c;收藏的资源包含725万本书、8075万的文献条目&#xff0c;可以说是相当丰富了。 网站支持中文搜索&#xff0c;不过注册登录就可以直接下载电子书…

彻底凉了!全球最大电子书网站遭封站

公众号关注 「奇妙的 Linux 世界」 设为「星标」&#xff0c;每天带你玩转 Linux &#xff01; 前几天&#xff0c;号称是世界上最大的免费电子图书馆 Z-Library&#xff0c;被美国查封&#xff0c;相关的数个域名全部无法访问&#xff01; 根据 DNS 记录和其他信息显示&#x…

2023年最值得关注的十大科技趋势,这些技术将迎来爆发,把握住风口和掘金机会!

1 月 11 日&#xff0c;InfoQ获悉&#xff0c;达摩院 2023 十大科技趋势发布&#xff0c;生成式 AI、Chiplet 模块化设计封装、全新云计算体系架构等技术入选。 达摩院发布十大科技趋势 达摩院认为&#xff0c;全球科技日趋显现出交叉融合发展的新态势&#xff0c;尤其在信息与…

爆火论文打造《西部世界》雏形:25个AI智能体,在虚拟小镇自由成长

机器之心报道 机器之心编辑部 《西部世界》的游戏逐渐走进现实。 我们能否创造一个世界&#xff1f;在那个世界里&#xff0c;机器人能够像人类一样生活、工作、社交&#xff0c;去复刻人类社会的方方面面。 这种想象&#xff0c;曾在影视作品《西部世界》的设定中被完美地还原…

Android 添加App快捷方式到桌面

原创文章&#xff0c;如有转载&#xff0c;请注明出处&#xff1a;http://blog.csdn.net/myth13141314/article/details/68926849 主要原理是通过向系统发送创建快捷方式的广播 设置Intent&#xff0c;传递快捷方式的信息&#xff0c;名字和图标等 Intent shortcut new Int…

如何把一个网页设置快捷方式放到桌面上去,或者手机桌面当App一样使用

分别讲电脑端和手机端: 电脑端: 在尝试好几种方式后,还是觉得最最简单的方法,还是用电脑自带的方式不借助任何外力方便,利用谷歌的方式也讲一下哈(利用谷歌会有自己的图标这点不错); 其他方式: https://zh.wikihow.com/%E6%8A%8A%E7%BD%91%E7%AB%99%E7%9A%84%E5%BF%AB%E6%8D…

给你的AppImage创建桌面快捷方式

运行环境:Ubuntu 22.04 LTS 1.首先准备好AppImage文件并放在一个你知道的地方 2.打开终端&#xff0c;在/usr/share/applications下新建APP.desktop文件(APP可以改成你的应用名称) cd /usr/share/applications sudo touch APP.desktop 3. root模式下使用vi编辑qi编辑APP.deskto…

iPhone苹果手机如何将百度小程序添加到手机桌面方便使用?

苹果iPhone手机将百度小程序添加到手机桌面后&#xff0c;下次使用直接可以在iPhone苹果手机桌面找到像APP一样的图标&#xff0c;点击直接打开百度小程序方便使用&#xff1b; 如何将百度小程序添加到手机桌面方便使用&#xff1f; 1、打开手机百度APP&#xff0c;搜索要添加…

iOS 添加快捷方式到主屏幕

参考文章&#xff1a; iOS 添加到主屏幕/ iOS Add To Desktop iOS创建桌面快捷方式代码 在上面文章和其他资料基础上实现此功能&#xff0c;详细介绍和技术点可参考上述文章。Demo是以第三方CocoaHTTPServer为基础&#xff0c;建立本机服务器&#xff0c;调起Safari创建快…

OpenAI 直播大秀语音指挥 AI 自动编程

本文转载自IT之家 刚刚&#xff0c;OpenAI 又玩出了一个新高度。 只输入自然语句&#xff0c;AI 就自动做了个小游戏&#xff01; 划重点&#xff1a;不&#xff01; 用&#xff01; 你&#xff01; 编&#xff01; 程&#xff01; 来&#xff0c;感受一下这个 feel。 第一…

直播预告 | 腾讯云工业AI系列直播

随着工业革命的不断推进&#xff0c;人工智能等新技术新理念在各行业兴起。同时&#xff0c;各行业也逐步向数字化、智能化、自动化转型&#xff0c;进入现代化工业新阶段。 工业质检是整个制造中一个非常重要的环节&#xff0c;但工业AI质检的有效落地是我们面临的一个巨大挑…

Steam教育对儿童在幼儿园阶段概念理解

孩子对有关科学领域的探究和学习&#xff0c;往往受到好奇心和兴趣的直接驱使&#xff0c;少儿编程就是从这一点出发&#xff0c;来培养孩子的科学思维与能力的。具体而言&#xff0c;少儿编程是怎样助力培养孩子的科学素养呢&#xff1f; 增强孩子处理信息的能力。现实中充斥着…

聚观早报 | 推特临时培训员工应对世界杯;世界杯足球内置传感器

今日要闻&#xff1a;推特临时培训员工应对世界杯;京东靠降本增效实现转亏为盈;世界杯足球内置传感器;艾格重返迪士尼CEO职位;特斯拉明年或开启收购计划 推特临时培训员工应对世界杯 据消息&#xff0c; 2022年世界杯拉开帷幕&#xff0c;推特的使用量即将激增&#xff0c;其…

Chrome浏览器模拟微信客户端访问网址,方法图文讲解模拟微信

我们访问有的网址&#xff0c;网址里限制了只能微信客户端访问才能打开&#xff0c;要不然就打不开或者跳转到其他页面去了。 下面图文并茂的讲解下怎么用 Chrome 模拟微信UserAgent。 0x0、打开Chrome控制台 打开控制台快捷键在Chrome下Windows系统按下F12&#xff0c;Ma…