酒店预订分析

Hotel Booking Analysis

目的:从我们拥有的数据集中创建有意义的估计量,并通过将它们与不同的ML模型和ROC曲线的准确性得分进行比较,来选择预测性能最好的模型。

1- EDA

2- Preprocessing

3- Models and ROC Curve Comparison

  • Logistic Regression
  • Gaussian Naive Bayes
  • Support Vector Classification
  • Decision Tree Model
  • Random Forest
  • Model Tuning for Random Forest
  • XGBoost
  • Neural Network
  • Model Tuning for Neural Network
import numpy as np
import pandas as pd
import seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifierfrom warnings import filterwarnings
filterwarnings('ignore')
df = pd.read_csv("../kaggle/hotel_bookings.csv")
df.head()
hotelis_canceledlead_timearrival_date_yeararrival_date_montharrival_date_week_numberarrival_date_day_of_monthstays_in_weekend_nightsstays_in_week_nightsadults...deposit_typeagentcompanydays_in_waiting_listcustomer_typeadrrequired_car_parking_spacestotal_of_special_requestsreservation_statusreservation_status_date
0Resort Hotel03422015July271002...No DepositNaNNaN0Transient0.000Check-Out2015-07-01
1Resort Hotel07372015July271002...No DepositNaNNaN0Transient0.000Check-Out2015-07-01
2Resort Hotel072015July271011...No DepositNaNNaN0Transient75.000Check-Out2015-07-02
3Resort Hotel0132015July271011...No Deposit304.0NaN0Transient75.000Check-Out2015-07-02
4Resort Hotel0142015July271022...No Deposit240.0NaN0Transient98.001Check-Out2015-07-03

5 rows × 32 columns

df.shape
(119390, 32)
print("# of NaN in each columns:", df.isnull().sum(), sep='\n')
# of NaN in each columns:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64
# It is better to copy original dataset, it can be needed in some cases.
data = df.copy()

1. EDA

条件分布:在新老顾客中的订单中,订单取消的数量如下,可以发现老顾客没有人取消订单,但是新顾客中有一部分人取消订单。

sns.set(style = "darkgrid")
ax = sns.countplot(x = "is_canceled", hue = 'is_repeated_guest', data = data)
plt.title("Canceled or not", fontdict = {'fontsize': 20})
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GgSI1rCu-1590722659706)(output_11_0.png)]
重复入住的客人不会取消预订也就不足为奇了。 当然也有一些例外。 同样,大多数顾客不是回头客。

按细分市场和酒店类型划分的酒店住宿之夜的箱形图分布

plt.figure(figsize = (15,10))
sns.boxplot(x = "market_segment", y = "stays_in_week_nights", data = data, hue = "hotel", palette = 'Set1');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QT9kvBlv-1590722659713)(output_14_0.png)]

plt.figure(figsize=(15,10))
sns.boxplot(x = "market_segment", y = "stays_in_weekend_nights", data = data, hue = "hotel", palette = 'Set1')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSSPtwdl-1590722659715)(output_15_0.png)]
航空部门(Aviation)的客户似乎没有住在度假酒店,而且日均消费水平相对较低。除此之外,周末和工作日的平均值大致相等。航空部门的客户可能会因业务原因很快到达。也可能大多数机场都离大海有点远,而且最可能离城市酒店更近。

显然,当人们去度假酒店时,他们更喜欢住宿。

市场细分的计数图分布

sns.set(style = "darkgrid")
plt.figure(figsize = (13,10))
ax = sns.countplot(x = "market_segment", hue = 'deposit_type', data = data)
plt.title("Countplot Distrubiton of Segment by Deposit Type", fontdict = {'fontsize':20})
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OiEX1sEr-1590722659758)(output_18_0.png)]

plt.figure(figsize = (13,10))
sns.set(style = "darkgrid")
plt.title("Countplot Distributon of Segments by Cancellation", fontdict = {'fontsize':20})
ax = sns.countplot(x = "market_segment", hue = 'is_canceled', data = data)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lnEv91Ty-1590722659762)(output_19_0.png)]

取消的提前天数密度曲线

(sns.FacetGrid(data, hue = 'is_canceled',height = 6,xlim = (0,500)).map(sns.kdeplot, 'lead_time', shade = True).add_legend());
plt.show()

![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yMkdPNbo-1590722659765)(output_21_0.png)]]

每月取消和按酒店类型划分的客户

plt.figure(figsize =(13,10))
sns.set(style="darkgrid")
plt.title("Total Customers - Monthly ", fontdict={'fontsize': 20})
ax = sns.countplot(x = "arrival_date_month", hue = 'hotel', data = data)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O3IjHTCP-1590722659767)(output_23_0.png)]

关于图像的解释:Seaborn会对’color’列中的数值进行归类后按照estimator参数的方法(默认为平均值)计算相应的值,计算出来的值就作为条形图所显示的值(条形图上的误差棒则表示各类的数值相对于条形图所显示的值的误差

plt.figure(figsize =(13,10))
sns.barplot(x = 'arrival_date_month', y = 'is_canceled', data = data)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wLd3i8vC-1590722659769)(output_25_0.png)]

plt.figure(figsize = (20,10))
sns.barplot(x = 'arrival_date_month', y = 'is_canceled', hue = 'hotel', data = data)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JpwtHY70-1590722659771)(output_26_0.png)]

预处理

缺失值,特征工程和标准化

print("# of NaN in each columns:", df.isnull().sum(), sep='\n')
# of NaN in each columns:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

缺失比例计算

def perc_mv(x, y):perc = y.isnull().sum() / len(x) * 100return percprint('Missing value ratios:\nCompany: {}\nAgent: {}\nCountry: {}'.format(perc_mv(df, df['company']),perc_mv(df, df['agent']),perc_mv(df, df['country'])))
Missing value ratios:
Company: 94.30689337465449
Agent: 13.686238378423655
Country: 0.40874445095904177
data["agent"].value_counts().count()
333

我们可以看到94.3%的公司名缺少值。 因此选择删除公司那一列。

代理列的13.68%缺少值,无需删除代理栏。 但是我们也不应该删除行,因为13.68%的数据确实是巨大的数据,并且这些行有机会获得重要的信息。 有333个唯一代理,因为代理太多,可能无法预测。
NA值也可以是当前333个代理中未列出的代理。 我们无法预测代理,并且由于缺失值占所有数据的13%,因此我们也无法删除它们。 相关部分之后,我将决定如何处理代理。

如果我们在“国家/地区”列中删除缺少值的行,那将不是问题。 不过,我将等待相关性。

# company is dropped
data = data.drop(['company'], axis = 1)
# We have also 4 missing values in children column. If there is no information about children, In my opinion those customers do not have any children.
data['children'] = data['children'].fillna(0)

处理特征

我们应该检查特征以创建一些更有意义的变量,并尽可能减少特征数量。

data.dtypes
hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             float64
days_in_waiting_list                int64
customer_type                      object
adr                               float64
required_car_parking_spaces         int64
total_of_special_requests           int64
reservation_status                 object
reservation_status_date            object
dtype: object
# I wanted to label them manually. I will do the rest with get.dummies or label_encoder.
data['hotel'] = data['hotel'].map({'Resort Hotel':0, 'City Hotel':1})data['arrival_date_month'] = data['arrival_date_month'].map({'January':1, 'February': 2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,'August':8, 'September':9, 'October':10, 'November':11, 'December':12})

上述代码将字符串赋值成字数字。

def family(data):if ((data['adults'] > 0) & (data['children'] > 0)):val = 1elif ((data['adults'] > 0) & (data['babies'] > 0)):val = 1else:val = 0return valdef deposit(data):if ((data['deposit_type'] == 'No Deposit') | (data['deposit_type'] == 'Refundable')):return 0else:return 1
def feature(data):data["is_family"] = data.apply(family, axis = 1)data["total_customer"] = data["adults"] + data["children"] + data["babies"]data["deposit_given"] = data.apply(deposit, axis=1)data["total_nights"] = data["stays_in_weekend_nights"]+ data["stays_in_week_nights"]return datadata = feature(data)

上述处理:data[“is_family”]将三列处理成了一列0、1变量,当成年人带上儿童或者婴儿即为1,否则为0;data[“total_customer”]计算为成年人+儿童+婴儿的总人数;data[“deposit_given”]将data[‘deposit_type’]列处理成0、1变量;data[“total_nights”]计算一共住了多少晚上。

完成一些变量处理后,则需要删除用过的变量

data = data.drop(columns = ['adults', 'babies', 'children', 'deposit_type', 'reservation_status_date'])

Correlation,考察相关关系

data.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year','arrival_date_month', 'arrival_date_week_number','arrival_date_day_of_month', 'stays_in_weekend_nights','stays_in_week_nights', 'meal', 'country', 'market_segment','distribution_channel', 'is_repeated_guest', 'previous_cancellations','previous_bookings_not_canceled', 'reserved_room_type','assigned_room_type', 'booking_changes', 'agent','days_in_waiting_list', 'customer_type', 'adr','required_car_parking_spaces', 'total_of_special_requests','reservation_status', 'is_family', 'total_customer', 'deposit_given','total_nights'],dtype='object')
cor_data = data.copy()

复制数据来得出相关系数,不会改变后面建模所用的数据data。

le = LabelEncoder()
cor_data['meal'] = le.fit_transform(cor_data['meal'])
cor_data['distribution_channel'] = le.fit_transform(cor_data['distribution_channel'])
cor_data['reserved_room_type'] = le.fit_transform(cor_data['reserved_room_type'])
cor_data['assigned_room_type'] = le.fit_transform(cor_data['assigned_room_type'])
cor_data['agent'] = le.fit_transform(cor_data['agent'])
cor_data['customer_type'] = le.fit_transform(cor_data['customer_type'])
cor_data['reservation_status'] = le.fit_transform(cor_data['reservation_status'])
cor_data['market_segment'] = le.fit_transform(cor_data['market_segment'])
cor_data.corr()
hotelis_canceledlead_timearrival_date_yeararrival_date_montharrival_date_week_numberarrival_date_day_of_monthstays_in_weekend_nightsstays_in_week_nightsmeal...days_in_waiting_listcustomer_typeadrrequired_car_parking_spacestotal_of_special_requestsreservation_statusis_familytotal_customerdeposit_giventotal_nights
hotel1.0000000.1365310.0753810.0352670.0018170.001270-0.001862-0.186596-0.2340200.008018...0.0724320.0475310.096719-0.218873-0.043390-0.124331-0.058306-0.0408210.172003-0.247479
is_canceled0.1365311.0000000.2931230.0166600.0110220.008148-0.006130-0.0017910.024765-0.017678...0.054186-0.0681400.047557-0.195498-0.234658-0.917196-0.0130100.0465220.4814570.017779
lead_time0.0753810.2931231.0000000.0401420.1314240.1268710.0022680.0856710.1657990.000349...0.1700840.073403-0.063077-0.116451-0.095712-0.302175-0.0439720.0722650.3801790.157167
arrival_date_year0.0352670.0166600.0401421.000000-0.527739-0.540561-0.0002210.0214970.0308830.065840...-0.056497-0.0061490.197580-0.0136840.108531-0.0176830.0527110.052127-0.0659630.031438
arrival_date_month0.0018170.0110220.131424-0.5277391.0000000.995105-0.0260630.0184400.019212-0.015205...0.019045-0.0297530.0793150.0002570.028026-0.0210900.0104270.0272520.0087460.021536
arrival_date_week_number0.0012700.0081480.126871-0.5405610.9951051.0000000.0668090.0182080.015558-0.017381...0.022933-0.0284320.0757910.0019200.026149-0.0173870.0106110.0252200.0077730.018719
arrival_date_day_of_month-0.001862-0.0061300.002268-0.000221-0.0260630.0668091.000000-0.016354-0.028174-0.007086...0.0227280.0121880.0302450.0086830.0030620.0114600.0147100.006742-0.008616-0.027408
stays_in_weekend_nights-0.186596-0.0017910.0856710.0214970.0184400.018208-0.0163541.0000000.4989690.045744...-0.054151-0.1092200.049342-0.0185540.0726710.0085580.0523060.101426-0.1142750.762790
stays_in_week_nights-0.2340200.0247650.1657990.0308830.0192120.015558-0.0281740.4989691.0000000.036742...-0.002020-0.1272230.065237-0.0248590.068192-0.0216070.0504240.101665-0.0799990.941005
meal0.008018-0.0176780.0003490.065840-0.015205-0.017381-0.0070860.0457440.0367421.000000...-0.0071320.0446580.059098-0.0389230.0231360.015393-0.041727-0.005975-0.0907250.045277
market_segment0.0837950.0593380.0137970.1076970.001293-0.000510-0.0040880.1153500.1085690.145132...-0.041503-0.1658140.232763-0.0622260.274373-0.0615840.0804500.213221-0.1838800.126052
distribution_channel0.1744190.1676000.2204140.0226440.0073810.0056990.0015780.0930970.0871850.116957...0.048642-0.0696400.092396-0.1322800.098815-0.1713300.0004640.1443570.1025480.101407
is_repeated_guest-0.050421-0.084793-0.1244100.010341-0.030729-0.030131-0.006145-0.087239-0.097245-0.057009...-0.022235-0.017111-0.1343140.0770900.0130500.083504-0.035127-0.136748-0.058423-0.106626
previous_cancellations-0.0122920.1101330.086042-0.1198220.0374790.035501-0.027011-0.012775-0.013992-0.003772...0.005929-0.008188-0.065646-0.018492-0.048384-0.110758-0.027262-0.0200580.143314-0.015429
previous_bookings_not_canceled-0.004441-0.057358-0.0735480.029218-0.021640-0.020904-0.000300-0.042715-0.048743-0.040417...-0.009397-0.012259-0.0721440.0476530.0378240.055051-0.022815-0.099097-0.031509-0.053049
reserved_room_type-0.249677-0.061282-0.1060890.092809-0.007923-0.0079970.0169290.1420830.168616-0.120749...-0.068821-0.1209780.3920600.1315830.1374660.0586930.3239100.383357-0.2013480.181296
assigned_room_type-0.307834-0.176028-0.1722190.036141-0.006378-0.0056840.0116460.0866430.100795-0.120792...-0.068676-0.0844270.2581340.1601310.1246830.1725370.2929400.302422-0.2466020.109042
booking_changes-0.072820-0.1443810.0001490.0308720.0048090.0055080.0106130.0632810.0962090.024650...-0.0116340.0920290.0196180.0656200.0528330.1407990.079121-0.003173-0.1193330.096498
agent-0.158500-0.127883-0.171430-0.017723-0.0007990.001638-0.002271-0.110284-0.110354-0.095428...-0.0396670.066095-0.1264070.113648-0.0854290.123264-0.032656-0.155423-0.013898-0.125406
days_in_waiting_list0.0724320.0541860.170084-0.0564970.0190450.0229330.022728-0.054151-0.002020-0.007132...1.0000000.099121-0.040756-0.030600-0.082730-0.057927-0.036312-0.0264310.120249-0.022652
customer_type0.047531-0.0681400.073403-0.006149-0.029753-0.0284320.012188-0.109220-0.1272230.044658...0.0991211.000000-0.077155-0.030060-0.1356240.066004-0.060139-0.113232-0.086745-0.137577
adr0.0967190.047557-0.0630770.1975800.0793150.0757910.0302450.0493420.0652370.059098...-0.040756-0.0771551.0000000.0566280.172185-0.0505200.3093600.368105-0.0876080.067945
required_car_parking_spaces-0.218873-0.195498-0.116451-0.0136840.0002570.0019200.008683-0.018554-0.024859-0.038923...-0.030600-0.0300600.0566281.0000000.0826260.1793100.0691410.047934-0.094982-0.025794
total_of_special_requests-0.043390-0.234658-0.0957120.1085310.0280260.0261490.0030620.0726710.0681920.023136...-0.082730-0.1356240.1721850.0826261.0000000.2256740.1282050.156834-0.2680340.079259
reservation_status-0.124331-0.917196-0.302175-0.017683-0.021090-0.0173870.0114600.008558-0.0216070.015393...-0.0579270.066004-0.0505200.1793100.2256741.0000000.013117-0.055273-0.478747-0.012781
is_family-0.058306-0.013010-0.0439720.0527110.0104270.0106110.0147100.0523060.050424-0.041727...-0.036312-0.0601390.3093600.0691410.1282050.0131171.0000000.579899-0.1066430.058049
total_customer-0.0408210.0465220.0722650.0521270.0272520.0252200.0067420.1014260.101665-0.005975...-0.026431-0.1132320.3681050.0479340.156834-0.0552730.5798991.000000-0.0806760.115463
deposit_given0.1720030.4814570.380179-0.0659630.0087460.007773-0.008616-0.114275-0.079999-0.090725...0.120249-0.086745-0.087608-0.094982-0.268034-0.478747-0.106643-0.0806761.000000-0.104314
total_nights-0.2474790.0177790.1571670.0314380.0215360.018719-0.0274080.7627900.9410050.045277...-0.022652-0.1375770.067945-0.0257940.079259-0.0127810.0580490.115463-0.1043141.000000

29 rows × 29 columns

cor_data.corr()['stays_in_week_nights']
hotel                            -0.234020
is_canceled                       0.024765
lead_time                         0.165799
arrival_date_year                 0.030883
arrival_date_month                0.019212
arrival_date_week_number          0.015558
arrival_date_day_of_month        -0.028174
stays_in_weekend_nights           0.498969
stays_in_week_nights              1.000000
meal                              0.036742
market_segment                    0.108569
distribution_channel              0.087185
is_repeated_guest                -0.097245
previous_cancellations           -0.013992
previous_bookings_not_canceled   -0.048743
reserved_room_type                0.168616
assigned_room_type                0.100795
booking_changes                   0.096209
agent                            -0.110354
days_in_waiting_list             -0.002020
customer_type                    -0.127223
adr                               0.065237
required_car_parking_spaces      -0.024859
total_of_special_requests         0.068192
reservation_status               -0.021607
is_family                         0.050424
total_customer                    0.101665
deposit_given                    -0.079999
total_nights                      0.941005
Name: stays_in_week_nights, dtype: float64

删除一些列:

cor_data = cor_data.drop(columns = ['total_nights', 'arrival_date_week_number', 'stays_in_weekend_nights', 'arrival_date_month', 'agent'], axis = 1)

删除空值的行:

indices = cor_data.loc[pd.isna(cor_data["country"]), :].index 
cor_data = cor_data.drop(cor_data.index[indices])   
cor_data.isnull().sum()
hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_day_of_month         0
stays_in_week_nights              0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
reservation_status                0
is_family                         0
total_customer                    0
deposit_given                     0
dtype: int64

删除空值的行和一些列:

indices = data.loc[pd.isna(data["country"]), :].index 
data = data.drop(data.index[indices])   
data = data.drop(columns = ['arrival_date_week_number', 'stays_in_weekend_nights', 'arrival_date_month', 'agent'], axis = 1)
data.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year','arrival_date_day_of_month', 'stays_in_week_nights', 'meal', 'country','market_segment', 'distribution_channel', 'is_repeated_guest','previous_cancellations', 'previous_bookings_not_canceled','reserved_room_type', 'assigned_room_type', 'booking_changes','days_in_waiting_list', 'customer_type', 'adr','required_car_parking_spaces', 'total_of_special_requests','reservation_status', 'is_family', 'total_customer', 'deposit_given','total_nights'],dtype='object')
df1 = data.copy()

将分类变量处理成虚拟变量:

#one-hot-encoding
df1 = pd.get_dummies(data = df1, columns = ['meal', 
'market_segment', 'distribution_channel',
'reserved_room_type', 'assigned_room_type','customer_type', 'reservation_status'])
df1['country'] = le.fit_transform(df1['country']) 

le.fit_transform:参考博客:le.fit_transform
,也是将字符变量处理成数字变量

Decision Tree Model (reservation_status included)

y = df1["is_canceled"]
X = df1.drop(["is_canceled"], axis=1)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)
cart = DecisionTreeClassifier(max_depth = 12)
cart_model = cart.fit(X_train, y_train)
y_pred = cart_model.predict(X_test)
print('Decision Tree Model')print('Accuracy Score: {}\n\nConfusion Matrix:\n {}\n\nAUC Score: {}'.format(accuracy_score(y_test,y_pred), confusion_matrix(y_test,y_pred), roc_auc_score(y_test,y_pred)))
Decision Tree Model
Accuracy Score: 1.0Confusion Matrix:[[22353     0][    0 13318]]AUC Score: 1.0

准确率100%

pd.DataFrame(data = cart_model.feature_importances_*100,columns = ["Importances"],index = X_train.columns).sort_values("Importances", ascending = False)[:20].plot(kind = "barh", color = "r")plt.xlabel("Feature Importances (%)")
plt.show()

在这里插入图片描述
在分析相关系数时,我们已经看到了预订状态对因变量的影响比较大。建模时保留这个变量会完全主导其他变量。 如将reservation_status保留在数据中,有可能达到100%的准确率。为了分析起见,将删除Reservation_status并继续分析。

比较模型之前的最终安排

df2 = df1.drop(columns = ['reservation_status_Canceled', 'reservation_status_Check-Out', 'reservation_status_No-Show'], axis = 1)

这三个变量是由reservation_status处理成虚拟变量生成的,所以要删除不能只删除reservation_status_Check-Out,而应该全部删除。

y = df2["is_canceled"]
X = df2.drop(["is_canceled"], axis=1)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

定义模型和评价模型的函数、图像等:

def model(algorithm, X_train, X_test, y_train, y_test):alg = algorithmalg_model = alg.fit(X_train, y_train)global y_prob, y_predy_prob = alg.predict_proba(X_test)[:,1]y_pred = alg_model.predict(X_test)print('Accuracy Score: {}\n\nConfusion Matrix:\n {}'.format(accuracy_score(y_test,y_pred), confusion_matrix(y_test,y_pred)))def ROC(y_test, y_prob):false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_prob)roc_auc = auc(false_positive_rate, true_positive_rate)plt.figure(figsize = (10,10))plt.title('Receiver Operating Characteristic')plt.plot(false_positive_rate, true_positive_rate, color = 'red', label = 'AUC = %0.2f' % roc_auc)plt.legend(loc = 'lower right')plt.plot([0, 1], [0, 1], linestyle = '--')plt.axis('tight')plt.ylabel('True Positive Rate')plt.xlabel('False Positive Rate')plt.show()

sklearn中predict_proba用法(注意和predict的区别)

Model and ROC Curve Comparison

Logistic Regression Model

print('Model: Logistic Regression\n')
model(LogisticRegression(solver = "liblinear"), X_train, X_test, y_train, y_test)
Model: Logistic RegressionAccuracy Score: 0.8038742956463233Confusion Matrix:[[20486  1867][ 5129  8189]]

cross_val_score:交叉验证

LogR = LogisticRegression(solver = "liblinear")
cv_scores = cross_val_score(LogR, X, y, cv = 8, scoring = 'accuracy')
print('Mean Score of CV: ', cv_scores.mean())
Mean Score of CV:  0.7701217519101682
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fU2GDzgw-1590722659775)(output_76_0.png)]

Gaussian Naive Bayes Model

print('Model: Gaussian Naive Bayes\n')
model(GaussianNB(), X_train, X_test, y_train, y_test)
Model: Gaussian Naive BayesAccuracy Score: 0.586246530795324Confusion Matrix:[[ 9604 12749][ 2010 11308]]
NB = GaussianNB()
cv_scores = cross_val_score(NB, X, y, cv = 8, scoring = 'accuracy')
print('Mean Score of CV: ', cv_scores.mean())
Mean Score of CV:  0.5624280984012298
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qPqPU9wt-1590722659777)(output_80_0.png)]

Support Vector Classification Model

print('Model: SVC\n')def model1(algorithm, X_train, X_test, y_train, y_test):alg = algorithmalg_model = alg.fit(X_train, y_train)global y_predy_pred = alg_model.predict(X_test)print('Accuracy Score: {}\n\nConfusion Matrix:\n {}'.format(accuracy_score(y_test,y_pred), confusion_matrix(y_test,y_pred)))model1(SVC(kernel = 'linear'), X_train, X_test, y_train, y_test)

Decision Tree Model

print('Model: Decision Tree\n')
model(DecisionTreeClassifier(max_depth = 12), X_train, X_test, y_train, y_test)
DTC = DecisionTreeClassifier(max_depth = 12)
cv_scores = cross_val_score(DTC, X, y, cv = 8, scoring = 'accuracy')
print('Mean Score of CV: ', cv_scores.mean())
Mean Score of CV:  0.6725617115938002
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BurbtMk6-1590722659778)(output_86_0.png)]

Random Forest

print('Model: Random Forest\n')
model(RandomForestClassifier(), X_train, X_test, y_train, y_test)
Model: Random ForestAccuracy Score: 0.8835748927700373Confusion Matrix:[[20946  1407][ 2746 10572]]
RFC = RandomForestClassifier()
cv_scores = cross_val_score(RFC, X, y, cv = 8, scoring = 'accuracy')
print('Mean Score of CV: ', cv_scores.mean())
Mean Score of CV:  0.6697106885103477
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pxeKMGcR-1590722659780)(output_90_0.png)]

Random Forest Model Tuning

rf_parameters = {"max_depth": [10,13],"n_estimators": [10,100,500],"min_samples_split": [2,5]}
rf_model = RandomForestClassifier()
rf_cv_model = GridSearchCV(rf_model,rf_parameters,cv = 10,n_jobs = -1,verbose = 2)rf_cv_model.fit(X_train, y_train)
Fitting 10 folds for each of 12 candidates, totalling 120 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 12.5min finishedGridSearchCV(cv=10, error_score=nan,estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,class_weight=None,criterion='gini', max_depth=None,max_features='auto',max_leaf_nodes=None,max_samples=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100, n_jobs=None,oob_score=False,random_state=None, verbose=0,warm_start=False),iid='deprecated', n_jobs=-1,param_grid={'max_depth': [10, 13], 'min_samples_split': [2, 5],'n_estimators': [10, 100, 500]},pre_dispatch='2*n_jobs', refit=True, return_train_score=False,scoring=None, verbose=2)
print('Best parameters: ' + str(rf_cv_model.best_params_))
Best parameters: {'max_depth': 13, 'min_samples_split': 2, 'n_estimators': 500}
rf_tuned = RandomForestClassifier(max_depth = 13,min_samples_split = 2,n_estimators = 500)print('Model: Random Forest Tuned\n')
model(rf_tuned, X_train, X_test, y_train, y_test)
Model: Random Forest TunedAccuracy Score: 0.8515320568529057Confusion Matrix:[[21151  1202][ 4094  9224]]

调整后的模型的准确性得分比默认模型差。 在默认模型中,最大深度没有限制。 最大深度的增加为我们提供了更好的准确性得分,但可能会降低泛化性。

XGBoost Model

print('Model: XGBoost\n')
model(XGBClassifier(), X_train, X_test, y_train, y_test)
Model: XGBoostAccuracy Score: 0.8696980740657677Confusion Matrix:[[20570  1783][ 2865 10453]]
XGB = XGBClassifier()
cv_scores = cross_val_score(XGB, X, y, cv = 8, scoring = 'accuracy')
print('Mean Score of CV: ', cv_scores.mean())
Mean Score of CV:  0.651031688035794
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yWBo0xJW-1590722659782)(output_101_0.png)]

Neural Network Model

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print('Model: Neural Network\n')
model(MLPClassifier(), X_train_scaled, X_test_scaled, y_train, y_test)
Model: Neural NetworkAccuracy Score: 0.8486445572033304Confusion Matrix:[[20212  2141][ 3258 10060]]
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nExY6bgw-1590722659784)(output_105_0.png)]

Neural Network Model Tuning

mlpc_parameters = {"alpha": [1, 0.1, 0.01, 0.001],"hidden_layer_sizes": [(50,50,50),(100,100)],"solver": ["adam", "sgd"],"activation": ["logistic", "relu"]}
mlpc = MLPClassifier()
mlpc_cv_model = GridSearchCV(mlpc, mlpc_parameters,cv = 10,n_jobs = -1,verbose = 2)mlpc_cv_model.fit(X_train_scaled, y_train)
Fitting 10 folds for each of 32 candidates, totalling 320 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 13.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 123.4min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 290.8min finishedGridSearchCV(cv=10, error_score=nan,estimator=MLPClassifier(activation='relu', alpha=0.0001,batch_size='auto', beta_1=0.9,beta_2=0.999, early_stopping=False,epsilon=1e-08, hidden_layer_sizes=(100,),learning_rate='constant',learning_rate_init=0.001, max_fun=15000,max_iter=200, momentum=0.9,n_iter_no_change=10,nesterovs_momentum=True, power_t=0.5,random_state=None, shuffle=True,solver='adam', tol=0.0001,validation_fraction=0.1, verbose=False,warm_start=False),iid='deprecated', n_jobs=-1,param_grid={'activation': ['logistic', 'relu'],'alpha': [1, 0.1, 0.01, 0.001],'hidden_layer_sizes': [(50, 50, 50), (100, 100)],'solver': ['adam', 'sgd']},pre_dispatch='2*n_jobs', refit=True, return_train_score=False,scoring=None, verbose=2)
print('Best parameters: ' + str(mlpc_cv_model.best_params_))
Best parameters: {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': (100, 100), 'solver': 'adam'}
mlpc_tuned = MLPClassifier(activation = 'relu',alpha = 0.1,hidden_layer_sizes = (100,100),solver = 'adam')
print('Model: Neural Network Tuned\n')
model(mlpc_tuned, X_train_scaled, X_test_scaled, y_train, y_test)
Model: Neural Network TunedAccuracy Score: 0.859409604440582Confusion Matrix:[[20464  1889][ 3126 10192]]
ROC(y_test, y_prob)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iOjwRzF7-1590722659785)(output_112_0.png)]

Conclusion

Feature Importances

randomf = RandomForestClassifier()
rf_model1 = randomf.fit(X_train, y_train)pd.DataFrame(data = rf_model1.feature_importances_*100,columns = ["Importances"],index = X_train.columns).sort_values("Importances", ascending = False)[:15].plot(kind = "barh", color = "r")plt.xlabel("Feature Importances (%)")
Text(0.5, 0, 'Feature Importances (%)')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nE3xsCdT-1590722659787)(output_115_1.png)]

Summary Table of the Models

table = pd.DataFrame({"Model": ["Decision Tree (reservation status included)", "Logistic Regression","Naive Bayes", "Support Vector", "Decision Tree", "Random Forest","Random Forest Tuned", "XGBoost", "Neural Network", "Neural Network Tuned"],"Accuracy Scores": ["1", "0.804", "0.582", "0.794", "0.846","0.883", "0.851", "0.869", "0.848", "0.859"],"ROC | Auc": ["1", "0.88", "0.78", "0","0.92", "0.95", "0", "0.94","0.93", "0.94"]})table["Model"] = table["Model"].astype("category")
table["Accuracy Scores"] = table["Accuracy Scores"].astype("float32")
table["ROC | Auc"] = table["ROC | Auc"].astype("float32")pd.pivot_table(table, index = ["Model"]).sort_values(by = 'Accuracy Scores', ascending=False)

pandas 透视表

Accuracy ScoresROC | Auc
Model
Decision Tree (reservation status included)1.0001.00
Random Forest0.8830.95
XGBoost0.8690.94
Neural Network Tuned0.8590.94
Random Forest Tuned0.8510.00
Neural Network0.8480.93
Decision Tree0.8460.92
Logistic Regression0.8040.88
Support Vector0.7940.00
Naive Bayes0.5820.78

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/64850.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

酒店应用爆发式增长,“API即服务”已成趋势!

据谷歌发布的《2021API经济报告》显示:2020年,近四分之三的组织继续在数字化转型上投资,其中三分之二的组织加大投资或作出战略调整,实行数字优先战略。 而数字化转型的核心,就是将组织的服务、资产和能力打包成互联网…

数据储存技术演进趋势研判

如果以 1987 年 Symmetrix 高端存储产品的诞生作为独立外置存储行业出现的标志,那么外置存储行业已经历了探索、成长和成熟三个阶段。在探索和成长期内,行业发展出大量令人惊叹的创新存储技术,如: SAN/NAS/iSCSI/Object 等存储协议…

英特尔数据存储创新三大技术看点和猜想

“话说天下大势,分久必合,合久必分。周末七国分争,并入于秦。及秦灭之后,楚、汉分争,又并入于汉。汉朝自高祖斩白蛇而起义,一统天下,后来光武中兴,传至献帝,遂分为三国。…

中国存储芯片行业市场发展趋势预测与运营模式分析报告2021~2027年

第1章:中国存储芯片行业发展概况1.1 存储芯片行业发展概述 1.1.1 存储芯片相关定义及分类 (1)存储芯片相关定义 (2)存储芯片主要分类 1.1.2 存储芯片行业发展模式概述 1.2 中国存储芯片行业发展环境分析 1.2.1 行业发展经济环境分析 (1)宏观经济现状分析 (2)经…

内存(DRAM)芯片国产进程

目录 前言1. SSD 缓存作用2. 内存技术2.1 内存存储数据2.2 内存 技术前沿2.2.1 先进DDR5 内存技术2.2.2 专利壁垒2.2.3 先进制程2.2.4 良率 总结 前言 存储芯片生态包含设计环节和制造封装环节还有品牌营销环节。设计环节是核心技术,包含闪存芯片、闪存主控芯片、缓…

芯片行业数据我们打造可靠高效存储设备解决方案

芯片行业,大动能,专为半导体集成电路EDA打造可靠高效存储方案 云计算、物联网、智能制造、大数据、VR、5G等全新数字经济业态引发了产业变革,并带来了创新商业模式,不断催生出更多芯片需求。小小的芯片,不仅推动了社会…

最全芯片产业报告出炉,计算、存储、模拟IC一文扫尽

来源:智东西 最近几年, 半导体产业风起云涌。 一方面, 中国半导体异军突起, 另一方面, 全球产业面临超级周期,加上人工智能等新兴应用的崛起,中美科技摩擦频发,全球半导体现状如何&a…

详解数据存储芯片AT24C02的应用及编程

一.芯片简介 AT24C02是一个2K位串行CMOS E2PROM,内部含有256个8位字节,采用先进CMOS技术实质上减少了器件的功耗。AT24C02有一个8字节页写缓冲器,该器件通过IIC总线接口进行操作,有一个专门的写保护功能。 二&#x…

存算一体芯片技术及其最新发展趋势(陈巍谈芯)

相关推荐 陈巍谈芯:7.2 RRAM模拟存内计算 《先进存算一体芯片设计》节选https://zhuanlan.zhihu.com/p/474261353 陈巍谈芯:存算一体技术是什么?发展史、优势、应用方向、主要介质(收录于存算一体芯片赛道投资融资分析&#xff…

纯国产服务器芯片以芯片堆叠技术提升性能,进一步替代美国芯片

国产芯片替代美国芯片已是当下的主流,而在事关信息安全的服务器芯片方面,中国又有一家芯片企业推出了服务器芯片,这次是真正纯国产芯片,从芯片架构、芯片制造都实现国产化,辅以芯片叠加技术提升芯片性能,达…

分享一款国产并口PSRAM存储芯片EMI164NA16LM

EMI164NA16LM该设备是一个集成的存储器设备,其中包含64Mbit静态随机存取存储器,使用自刷新DRAM阵列由16位组织为4M。模具具有单独的电源轨,VCCQ和VSSQ,用于从设备核心的单独电源运行。 特征 •电源 -VCC和VCCQ电压:3.…

存储芯片行业信息汇总

1、存储芯片的分类 RAM(Random Access Memory)随机存储器,我们在日常生活中经常会听到RAM这个单词。比如手机6G RAM128G ROM,对手机比较了解的人都知道RAM是代表运行内存,运行内存越大,可以打开的应用就越多…

汽车行业数据存储越发复杂?群晖备份存储方案为您支招

信息化正在逐渐渗透到各行各业的生产应用中,汽车行业也不例外。数据作为数字经济时代新型生产要素,已成为汽车行业数字化转型的核心资产。而保障企业重要数据的安全也就成为了管理者关注的重要问题。 但是,在智能化、数字化大潮下&#xff0c…

pycharm反应特别慢的可能原因

下面这里是很多东西的时候,一般显示出来的矩阵还特别大,就会卡。如果多开了几个pycharm窗口,其中一个数据显示的多,其他的窗口也会卡。

计算机切换器鼠标反应慢,解决鼠标反应迟钝与反应慢故障的方法

鼠标反应迟钝与反应慢故障解决方法 在这里,就是向大家介绍自己可以通过一些小小的调整,来解决的鼠标失灵现象,如果是鼠标按键失灵或者连接线断开等方面的损失,就只有换新的鼠标了,小编也就无能为力。 故障一&#xff1…

计算机如何解决卡顿问题,电脑反应慢怎么办?电脑卡顿处理方法

原标题:电脑反应慢怎么办?电脑卡顿处理方法 电脑用的时间长了,系统运行速度和开机速度会变慢。 是什么原因导致电脑变慢呢? 临时文件,注册表,磁盘碎片,安装软件直接默认安装到C盘(系统盘)都会导致电脑变慢…

git bash反应慢解决办法

方法来源于其他网友,感谢其他网友的尝试和分享。 方法众多,我试了一个比较有效的方法就是不使用git bash,而是使用没有爆露的bash.exe文件。 我们在windows下面一般用git bash的方法就是,在所在文件夹右键,选择git b…

电脑卡顿反应慢怎么处理?电脑提速,4个方法!

案例:电脑卡顿反应慢怎么处理? 【快帮帮我!我的电脑现在越用越卡了,有时候光是打开一个文件都要卡好几分钟,我真的太难了,有什么可以加速电脑反应速度的好方法吗?万分感谢!】 随着…

计算机老是卡顿怎么解决,电脑反应太慢怎么处理_电脑卡顿什么原因-win7之家

电脑使用久了,电脑的垃圾文件就会有很多,这就是会导致电脑卡顿反应慢的原因,还有上网是浏览的记录它一直会保存你之前浏览的网页,也会导致电脑卡顿反应慢,那么电脑反应吗卡顿的话我们要怎么处理呢,下面小编…

打开计算机文件反应慢怎么解决方法,电脑反应慢怎么解决

相信很多用户使用电脑久了之后就会发现电脑反应变慢了,其实这是因为电脑硬件老化问题提前进入衰竭期,这个原因是无法避免的,还有另外一个原因就是系统遗留的各种垃圾文件和软件导致的。那么电脑反应慢怎么解决呢?下面小编就为大家整理出加快…