【AI】Sklearn

长期更新,建议关注、收藏、点赞。

友情链接:
AI中的数学_线代微积分概率论最优化
Python
numpy_pandas_matplotlib_spicy

建议路线:机器学习->深度学习->强化学习


目录

  • 预处理
  • 模型选择
  • 分类
    • 实例: 二分类比赛 +网格搜索
    • 实例:MNIST数字分类
  • 回归
  • 聚类
  • 降维
  • 综合实例1:鸢尾花数据集
  • 综合实例2:用8种不同算法


Sklearn (全称 Scikit-Learn) 是基于 Python 语言的机器学习工具。它建立在 NumPy, SciPy, Pandas 和 Matplotlib 之上,里面的 API 的设计非常好,所有对象的接口简单,很适合新手上路。

官方文档:sklearn

预处理

模型选择

分类

实例: 二分类比赛 +网格搜索

import numpy as np
import pandas as pd
train_data=pd.read_csv('train_data.csv')
train_data.head()
# train_data
train_data.drop(['ID'],inplace=True,axis=1)
train_data.head()#训练数据分出输入和最后预测的值
train_X=train_data.iloc[:,train_data.columns!='y']
print(train_X.head())
train_y=train_data.iloc[:,train_data.columns=='y']
print(train_y.head())test_data=pd.read_csv('test_set.csv')
test_data.head()
test_data.drop(['ID'],inplace=True,axis=1)
test_data.head()#特征提取#LabelEncoder
#pd.Categorical().codes可以直接得到原始数据的对应序号列表 详细参考官网:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Categorical.html
#相当于encode
c = ['A','A','A','B','B','C','C','C','C']
category = pd.Categorical(c)
#接下来查看category的label即可print(category.codes)  #[0 0 0 1 1 2 2 2 2]
print(category.dtype) #category#factorize相当于编码encoding
job_feature=train_X['job'].unique() #去重
# print(job_feature)
len(job_feature)
example=train_X
example['job'],uniques=pd.factorize(example['job'])
#pd.factorize:Encode the object as an enumerated type or categorical variable.
print(pd.factorize(example['job']))
# print(example['job'])
# example.head()train_X['job']=train_X['job']+1marital_feature=train_X['marital'].unique()
print(marital_feature)
len(marital_feature)train_X['marital'],unique=pd.factorize(train_X['marital'])
train_X['marital']=train_X['marital']+1
train_X.head()education_feature=train_X['education'].unique()
print(education_feature)
len(education_feature)train_X['education'],unique=pd.factorize(train_X['education'])
train_X['education']=train_X['education']+1
train_X.head()contact_feature=train_X['contact'].unique()
print(contact_feature)
len(contact_feature)train_X['contact'],unique=pd.factorize(train_X['contact'])
train_X['contact']=train_X['contact']+1
train_X.head()month_feature=train_X['month'].unique()
print(month_feature)
len(month_feature)train_X['month'],unique=pd.factorize(train_X['month'])
train_X['month']=train_X['month']+1
train_X.head()poutcome_feature=train_X['poutcome'].unique()
print(poutcome_feature)
len(poutcome_feature)train_X['poutcome'],unique=pd.factorize(train_X['poutcome'])
train_X['poutcome']=train_X['poutcome']+1
train_X.head()default_feature=train_X['default'].unique()
print(default_feature)
len(default_feature)train_X['default'],unique=pd.factorize(train_X['default'])
train_X['default']=train_X['default']+1
train_X.head()housing_feature=train_X['housing'].unique()
print(housing_feature)
len(housing_feature)
train_X['housing'],unique=pd.factorize(train_X['housing'])
train_X['housing']=train_X['housing']+1
train_X.head()loan_feature=train_X['loan'].unique()
print(loan_feature)
len(loan_feature)
train_X['loan'],unique=pd.factorize(train_X['loan'])
train_X['loan']=train_X['loan']+1
train_X.head()#测试集数据数字化
test_data.head()
test_data['job'],jnum=pd.factorize(test_data['job'])
test_data['job']=test_data['job']+1
test_data.head()test_data['marital'],jnum=pd.factorize(test_data['marital'])
test_data['marital']=test_data['marital']+1test_data['education'],jnum=pd.factorize(test_data['education'])
test_data['education']=test_data['education']+1test_data['default'],jnum=pd.factorize(test_data['default'])
test_data['default']=test_data['default']+1test_data['housing'],jnum=pd.factorize(test_data['housing'])
test_data['housing']=test_data['housing']+1test_data['loan'],jnum=pd.factorize(test_data['loan'])
test_data['loan']=test_data['loan']+1test_data['contact'],jnum=pd.factorize(test_data['contact'])
test_data['contact']=test_data['contact']+1test_data['month'],jnum=pd.factorize(test_data['month'])
test_data['month']=test_data['month']+1test_data['poutcome'],jnum=pd.factorize(test_data['poutcome'])
test_data['poutcome']=test_data['poutcome']+1test_data.head()#LogisticRegression
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()
LR.fit(train_X,train_y)
#测试
test_y=LR.predict(test_data)
test_y
df_test=pd.read_csv('test_set.csv')
df_test['pred']=test_y.tolist()
df_result=df_test.loc[:,['ID','pred']]#save res
df_result.to_csv('LR.csv',index=False)#SVM
from sklearn.svm import LinearSVC
classifierSVM=LinearSVC()
classifierSVM.fit(train_X,train_y)
test_ySVM=classifierSVM.predict(test_data)
df_test=pd.read_csv('test_set.csv')
df_test['pred']=test_ySVM.tolist()
df_result=df_test.loc[:,['ID','pred']]
df_result.to_csv('LSVM.csv',index=False)#knn#decision tree#average prediction
test_yAver=(test_y+test_ySVM+test_yKNN+test_yTree)/4
test_yAver #array([0.  , 0.  , 0.  , ..., 0.25, 0.  , 0.25])
df_test=pd.read_csv('test_set.csv')
df_test['pred']=test_yAver.tolist()
df_result=df_test.loc[:,['ID','pred']]
df_result.to_csv('Aver.csv',index=False)#提高泛化能力
'''
GridSearchCV网格搜索
Exhaustive search over specified parameter values for an estimator.
The parameters of the estimator used to apply these methods are 
optimized by cross-validated grid-search over a parameter grid.param_grid:
e.g. {'n_estimators':list(range(10,401,10))}
每一轮 params其中一个元素为{'n_estimators':x 其中一个值 从前往后}
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.scoring:Strategy to evaluate the performance of the cross-validated model on the test set.cv:Determines the cross-validation splitting strategy.n_estimators:the number of trees to be used in the forest.
The number of boosting stages to perform. 
Gradient boosting is fairly robust to over-fitting 
so a large number usually results in better performance. 
Values must be in the range [1, inf).min_samples_split:
determines the minimum number of features to consider while looking for a split.min_samples_leaf:
The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it 
leaves at least min_samples_leaf training samples in each of the left 
and right branches. 
This may have the effect of smoothing the model, especially in regression.
--------------
GradientBoostingClassifier
基于决策树DT
subsample:The fraction比例 of samples to be used for fitting the individual单个 base learners. max_features:The number of features to consider when looking for the best split
Choosing max_features < n_features leads to a reduction of variance and an increase in bias.
the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
若一个节点一直没找到一个有效划分,则一直找,即使已经找过超过max_featuresrandom_state:Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details).'''
param_test1={'n_estimators':list(range(10,401,10))}#网格搜索max_iteration
gsearch1=GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,max_features=None, subsample=0.8,random_state=10),param_grid=param_test1,scoring='roc_auc',iid=False,cv=3)
gsearch1.fit(train_X.values,train_y2)
gsearch1.grid_scores_,gsearch1.best_params_,gsearch1.best_score_
##{'n_estimators': 350}, 0.8979275309747781)
## 找到一个合适的迭代次数,开始对决策树进行调参。
'''
grid_scores_:
每轮打印 mean/std/paramsbest_params_:
e.g. {'n_estimators': 350}指向这个350轮
Parameter setting that gave the best results on the hold out data.best_score_:
Mean cross-validated score of the best_estimator
'''
param_test2={'max_depth':list(range(3,14,2)),'min_samples_split':list(range(20,100,10))}#网格搜索max_depth
gsearch2=GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,n_estimators=350,min_samples_leaf=20,max_features=None,subsample=0.8,random_state=10),param_grid=param_test2,scoring='roc_auc',iid=False,cv=3  )
gsearch2.fit(train_X.values,train_y2)
gsearch2.grid_scores_,gsearch2.best_params_,gsearch2.best_score_
#{'max_depth': 3, 'min_samples_split': 90}, 0.8973756708021962)'''
上述的决策树的深度可以定下来,
但是划分所需要的最小样本数min_samples_split还不能定下来,
这个参数还与决策树其他参数存在关联记下来对内部节点再划分所需最小样本数min_samples_split和叶子结点最少样本数min_samples_leaf一起调参
'''
param_test3={'min_samples_split':list(range(80,1080,100)),'min_samples_leaf':list(range(60,101,10))}
gsearch3=GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,n_estimators=350,max_depth=3,max_features=None,subsample=0.8,random_state=10),param_grid=param_test3,scoring='roc_auc',iid=False,cv=3)
gsearch3.fit(train_X.values,train_y2)
gsearch3.grid_scores_,gsearch3.best_params_,gsearch3.best_score_
##{'min_samples_leaf': 60, 'min_samples_split': 280}, 0.8976660805899851)##调完参后,放到GBDT里面看看效果
gbm1=GradientBoostingClassifier(learning_rate=0.1,n_estimators=350,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=None,subsample=0.8,random_state=10)
gbm1.fit(train_X.values,train_y2)
y_pred=gbm1.predict(train_X)
y_predprob=gbm1.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC score(Train):%f" % metrics.roc_auc_score(train_y,y_predprob))## 对最大特征数max_features进行网格搜索
param_test4={'max_features':list(range(4,16,2))}
gsearch4=GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,n_estimators=350,max_depth=3,min_samples_leaf=60 ,min_samples_split=280,subsample=0.8,random_state=10),param_grid=param_test4,scoring='roc_auc',iid=False,cv=3)
gsearch4.fit(train_X.values,train_y2)
gsearch4.grid_scores_,gsearch4.best_params_,gsearch4.best_score_
## {'max_features': 14}, 0.8971037288653009)## 对子采样比例进行网格搜索
param_test5={'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5=GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1,n_estimators=350,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,random_state=10),param_grid=param_test5,scoring='roc_auc',iid=False,cv=3)
gsearch5.fit(train_X.values,train_y2)
gsearch5.grid_scores_,gsearch5.best_params_,gsearch5.best_score_
##{'subsample': 0.85}, 0.8976770026809427)#基本得到所有调优的参数结果了,可以减半步长,加倍最大迭代次数增加模型的泛化能力
gbm2=GradientBoostingClassifier(learning_rate=0.05,n_estimators=350,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm2.fit(train_X.values,train_y2)
y_pred=gbm2.predict(train_X)
y_predprob=gbm2.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))gbm5=GradientBoostingClassifier(learning_rate=0.05,n_estimators=700,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm5.fit(train_X.values,train_y2)
y_pred=gbm5.predict(train_X)
y_predprob=gbm5.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))#继续减小步长,增加迭代次数
gbm3=GradientBoostingClassifier(learning_rate=0.01,n_estimators=350,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm3.fit(train_X.values,train_y2)
y_pred=gbm3.predict(train_X)
y_predprob=gbm3.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))#继续减小步长,增加迭代次数
gbm4=GradientBoostingClassifier(learning_rate=0.01,n_estimators=600,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm4.fit(train_X.values,train_y2)
y_pred=gbm4.predict(train_X)
y_predprob=gbm4.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))#继续减小步长,增加迭代次数
gbm6=GradientBoostingClassifier(learning_rate=0.005,n_estimators=1200,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm6.fit(train_X.values,train_y2)
y_pred=gbm6.predict(train_X)
y_predprob=gbm6.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))gbm7=GradientBoostingClassifier(learning_rate=0.05,n_estimators=1200,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm7.fit(train_X.values,train_y2)
y_pred=gbm7.predict(train_X)
y_predprob=gbm7.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))gbm8=GradientBoostingClassifier(learning_rate=0.01,n_estimators=1200,max_depth=3,min_samples_leaf=60,min_samples_split=280,max_features=14,subsample=0.85,random_state=10)
gbm8.fit(train_X.values,train_y2)
y_pred=gbm8.predict(train_X)
y_predprob=gbm8.predict_proba(train_X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(train_y.values,y_pred))
print("AUC Score(Train): %f" % metrics.roc_auc_score(train_y,y_predprob))#调来调去发现gbm7的accuracy最高0.954668,选这个保存
test_y_predprob=gbm7.predict_proba(test_data)[:,1]
df_test['pred']=test_y_predprob.tolist()
df_result=df_test.loc[:,['ID','pred']]
df_result.to_csv('GBDToptimiza.csv',index=False)

实例:MNIST数字分类

采用逻辑回归。
Note that this accuracy of this l1-penalized linear model is significantly below what can be reached by an l2-penalized linear model or a non-linear multi-layer perceptron model on this dataset.不如L2正则化 以及非线性模型的

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clauseimport timeimport matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state# Turn down for faster convergence
t0 = time.time()
train_samples = 10000# Load data from https://www.openml.org/d/554
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
#type:ndarray
#y:label
#X:70000张图片矩阵random_state = check_random_state(0)#return <class 'numpy.random.mtrand.RandomState'>
permutation = random_state.permutation(X.shape[0])#70000个随机数
X = X[permutation]#打乱,得到随机数对应的图片和label
y = y[permutation]
#X = X.reshape((X.shape[0], -1)) #这个操作实际上没什么必要,一直是70000*784X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_samples, test_size=10000
)scaler = StandardScaler()#训练集、测试集都要标准化 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)# Turn up tolerance for faster convergence
clf = LogisticRegression(C=50.0 / train_samples, penalty="l1", solver="saga", tol=0.1)
#c:Inverse of regularization strength;正则化强度的逆,c值越小正则化越强,
#solver:Algorithm to use in the optimization problem.saga适合较大的数据集,
#tol:Tolerance for stopping criteria.什么时候停止
clf.fit(X_train, y_train)
#print(clf.coef_.shape)#the number == 7840
print(np.mean(clf.coef_==0))#coef相关系数, True=1 False=0来计算mean
#print(np.sum(clf.coef_==0))
#print(np.sum(clf.coef_!=0))sparsity = np.mean(clf.coef_ == 0) * 100 #.coef即相关系数coefficient
#用这个表示稀疏程度 
#等价于np.sum(clf.coef_==0)/(clf.coef_.shape[0]*clf.coef_.shape[1])score = clf.score(X_test, y_test)
# print('Best C % .4f' % clf.C_)
print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()#取出里面相关系数最大的数的绝对值for i in range(10):l1_plot = plt.subplot(2, 5, i + 1)#放置第i+1个图l1_plot.imshow(#利用图片的相关系数,也可以画出大致数字的轮廓coef[i].reshape(28, 28),interpolation="nearest",#插值法cmap=plt.cm.RdBu,vmin=-scale,vmax=scale,)l1_plot.set_xticks(())l1_plot.set_yticks(())l1_plot.set_xlabel("Class %i" % i)
plt.suptitle("Classification vector for...")run_time = time.time() - t0
print("Example run in %.3f s" % run_time)
plt.show()

在这里插入图片描述

回归

聚类

降维

综合实例1:鸢尾花数据集

#下载鸢尾花数据集
import seaborn as sns
iris = sns.load_dataset("iris")#数据查看
type(iris)#pandas.core.frame.DataFrame
iris.shape#(150, 5)
iris.head()
iris.info()
iris.describe()
iris.species.value_counts()#3个分类分别的样例数目
sns.pairplot(data=iris, hue="species")#根据species形成不同颜色,根据属性形成笛卡尔积数据展示图#数据清洗
iris_simple = iris.drop(["sepal_length", "sepal_width"], axis=1)
iris_simple.head()
#删掉了这两列#标签编码
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
iris_simple["species"] = encoder.fit_transform(iris_simple["species"])
#将species的字符串编码为int#数据集标准化
from sklearn.preprocessing import StandardScaler
import pandas as pd
trans = StandardScaler()
_iris_simple = trans.fit_transform(iris_simple[["petal_length", "petal_width"]])
_iris_simple = pd.DataFrame(_iris_simple, columns = ["petal_length", "petal_width"])
_iris_simple.describe()#构建训练集、测试集
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(iris_simple, test_size=0.2)
test_set.head()iris_x_train = train_set[["petal_length", "petal_width"]]
iris_x_train.head()iris_y_train = train_set["species"].copy()
iris_y_train.head()iris_x_test = test_set[["petal_length", "petal_width"]]
iris_x_test.head()iris_y_test = test_set["species"].copy()
iris_y_test.head()

对上述数据集采用不同的机器学习算法。

  • k近邻算法
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()#new一个分类器对象
clf
clf.fit(iris_x_train, iris_y_train)#训练
res = clf.predict(iris_x_test)#预测
print(res)
print(iris_y_test.values)#打印比对#翻转:int反编码回原来的分类string
encoder.inverse_transform(res)#评估
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))#存储数据
out = iris_x_test.copy()
out["y"] = iris_y_test
out["pre"] = res #prediction
out
out.to_csv("iris_predict.csv")#可视化
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as pltdef draw(clf):# 网格化M, N = 500, 500x1_min, x2_min = iris_simple[["petal_length", "petal_width"]].min(axis=0)x1_max, x2_max = iris_simple[["petal_length", "petal_width"]].max(axis=0)t1 = np.linspace(x1_min, x1_max, M)t2 = np.linspace(x2_min, x2_max, N)x1, x2 = np.meshgrid(t1, t2)#把向量转换成array# 预测x_show = np.stack((x1.flat, x2.flat), axis=1)#列堆叠y_predict = clf.predict(x_show)# 配色cm_light = mpl.colors.ListedColormap(["#A0FFA0", "#FFA0A0", "#A0A0FF"])cm_dark = mpl.colors.ListedColormap(["g", "r", "b"])# 绘制预测区域图plt.figure(figsize=(10, 6))plt.pcolormesh(t1, t2, y_predict.reshape(x1.shape), cmap=cm_light)#Create a pseudocolor plot with a non-regular rectangular grid.# 绘制原始数据点plt.scatter(iris_simple["petal_length"], iris_simple["petal_width"], label=None,c=iris_simple["species"], cmap=cm_dark, marker='o', edgecolors='k')plt.xlabel("petal_length")plt.ylabel("petal_width")# 绘制图例color = ["g", "r", "b"]species = ["setosa", "virginica", "versicolor"]for i in range(3):plt.scatter([], [], c=color[i], s=40, label=species[i])    # 利用空点绘制图例plt.legend(loc="best")#放置图例 best指最佳位置plt.title('iris_classfier')draw(clf)
  • 朴素贝叶斯算法
    探究:当X=(x1, x2)发生的时候,哪一个yk发生的概率最大
#步骤跟之前相同
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()#构造分类器对象
clf.fit(iris_x_train, iris_y_train)#训练
res = clf.predict(iris_x_test)#预测
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)#评估
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)#可视化
  • 决策树算法
    CART算法:每次通过一个特征,将数据尽可能的分为纯净的两类,递归的分下去
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 逻辑回归算法
    训练:通过一个映射方式,将特征X=(x1, x2) 映射成 P(y=ck), 求使得所有概率之积最大化的映射方式里的参数
    预测:计算p(y=ck) 取概率最大的那个类别作为预测对象的分类
    在这里插入图片描述
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='saga', max_iter=1000)
'''
solverAlgorithm to use in the optimization problem. 
Default is ‘lbfgs’.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;‘liblinear’ and ‘newton-cholesky’ can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrapt it with the OneVsRestClassifier.‘newton-cholesky’ is a good choice for n_samples >> n_features, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on n_features because it explicitly computes the Hessian matrix.
'''
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 支持向量机算法
    以二分类为例,假设数据可用完全分开:
    用一个超平面将两类数据完全分开,且最近点到平面的距离最大
from sklearn.svm import SVC   
clf = SVC()
clf #打印查看有什么属性
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 集成方法——随机森林
    训练集m,有放回的随机抽取m个数据,构成一组,共抽取n组采样集
    n组采样集训练得到n个弱分类器 弱分类器一般用决策树或神经网络
    将n个弱分类器进行组合得到强分类器
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 集成方法——Adaboost
    训练集m,用初始数据权重训练得到第一个弱分类器,根据误差率计算弱分类器系数,更新数据的权重
    使用新的权重训练得到第二个弱分类器,以此类推
    根据各自系数,将所有弱分类器加权求和获得强分类器
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 集成方法——梯度提升树GBDT
    训练集m,获得第一个弱分类器,获得残差,然后不断地拟合残差
    所有弱分类器相加得到强分类器
    (残差在数理统计中是指实际观察值与估计值(拟合值)之间的差。)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf
clf.fit(iris_x_train, iris_y_train)
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
accuracy = clf.score(iris_x_test, iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
draw(clf)
  • 更多常见可选模型
    【1】xgboost
    GBDT的损失函数只对误差部分做负梯度(一阶泰勒)展开
    XGBoost损失函数对误差部分做二阶泰勒展开,更加准确,更快收敛

【2】lightgbm
微软:快速的,分布式的,高性能的基于决策树算法的梯度提升框架,速度更快

【3】stacking
堆叠或者叫模型融合
先建立几个简单的模型进行训练,第二级学习器会基于前级模型的预测结果进行再训练

【4】神经网络

综合实例2:用8种不同算法

使用 8 种不同算法

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas_profiling as ppf
import seaborn as snsdef load_data(file_path):'''导入数据:param file_path: 数据存放路径:return: 返回数据列表'''f = open(file_path)data = []for line in f.readlines():row = []  # 记录每一行lines = line.strip().split("\t")for x in lines:row.append(x)data.append(row)f.close()return datadata = load_data('datingTestSet.txt')
# data
data = pd.DataFrame(data, columns=['每年的飞行距离', '玩视频游戏所耗时间的百分比', '每周消费冰激凌的公升数', '喜欢的程度'])data = data.astype(float)
# data['喜欢的程度'] = data['喜欢的程度'].astype(int)data['喜欢的程度'].value_counts()#每种值对应多少个rowppf.ProfileReport(data)#输出report# windows版解决sns.pairplot()中文问题
from matplotlib.font_manager import FontProperties
myfont=FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf',size=14)
sns.set(font=myfont.get_name())sns.pairplot(data=data, hue='喜欢的程度')#数据预处理:标签编码、处理缺失值、数据标准化
#本例无需标签编码,没有缺失值,需要进行数据标准化
from sklearn.preprocessing import StandardScaler
trans = StandardScaler()
data_simple = trans.fit_transform(data[['每年的飞行距离', '玩视频游戏所耗时间的百分比', '每周消费冰激凌的公升数']])
data_simple = pd.DataFrame(data, columns=['每年的飞行距离', '玩视频游戏所耗时间的百分比', '每周消费冰激凌的公升数'])
data_simple.head(10)#构建训练集和测试集
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2)
train_set.head()data_x_train = train_set[['每年的飞行距离', '玩视频游戏所耗时间的百分比', '每周消费冰激凌的公升数']]
data_y_train = train_set['喜欢的程度'].copy()
# data_x_train.head()
data_y_train.head()data_x_test = test_set[['每年的飞行距离', '玩视频游戏所耗时间的百分比', '每周消费冰激凌的公升数']]
data_y_test = test_set['喜欢的程度'].copy()# 使用 8 种不同算法,分别对数据集进行训练,获得分类模型,并用测试集进行测试,最后将预测结果存储到本地文件中
#k近邻算法
#朴素贝叶斯算法
#决策树算法
#逻辑回归算法
#支持向量机算法
#集成方法——随机森林
#集成方法——Adaboost
#集成方法——梯度提升树GBDT#找一个表现较好的算法,对比舍弃一个不重要特征与否对模型性能的影响
data = data.drop(['每周消费冰激凌的公升数'], axis=1)
data_simple = trans.fit_transform(data[['每年的飞行距离', '玩视频游戏所耗时间的百分比']])
data_simple = pd.DataFrame(data, columns=['每年的飞行距离', '玩视频游戏所耗时间的百分比'])
data_simple.head(10)
# data.head()train_set, test_set = train_test_split(data, test_size=0.2)
train_set.head()data_x_train = train_set[['每年的飞行距离', '玩视频游戏所耗时间的百分比']]
data_y_train = train_set['喜欢的程度'].copy()
data_y_train.head()data_x_test = test_set[['每年的飞行距离', '玩视频游戏所耗时间的百分比']]
data_y_test = test_set['喜欢的程度'].copy()clf = GradientBoostingClassifier()
clf.fit(data_x_train, data_y_train)
res = clf.predict(data_x_test)#预测结果
accuracy = clf.score(data_x_test, data_y_test)
print("预测正确率:{:.0%}".format(accuracy))#可视化
def draw(clf):# 网格化M, N = 500, 500x1_min, x2_min = data_simple[['每年的飞行距离', '玩视频游戏所耗时间的百分比']].min(axis=0)x1_max, x2_max = data_simple[['每年的飞行距离', '玩视频游戏所耗时间的百分比']].max(axis=0)t1 = np.linspace(x1_min, x1_max, M)t2 = np.linspace(x2_min, x2_max, N)x1, x2 = np.meshgrid(t1, t2)# 预测x_show = np.stack((x1.flat, x2.flat), axis=1)y_predict = clf.predict(x_show)# 配色cm_light = mpl.colors.ListedColormap(["#A0FFA0", "#FFA0A0", "#A0A0FF"])cm_dark = mpl.colors.ListedColormap(["g", "r", "b"])# 绘制预测区域图plt.figure(figsize=(10, 6))plt.pcolormesh(t1, t2, y_predict.reshape(x1.shape), cmap=cm_light)# 绘制原始数据点plt.scatter(data_simple["每年的飞行距离"], data_simple["玩视频游戏所耗时间的百分比"], label=None,c=data["喜欢的程度"], cmap=cm_dark, marker='o', edgecolors='k')plt.xlabel("每年的飞行距离")plt.ylabel("玩视频游戏所耗时间的百分比")# 绘制图例color = ["g", "r", "b"]species = ["1", "2", "3"]for i in range(3):plt.scatter([], [], c=color[i], s=40, label=species[i])    # 利用空点绘制图例#s:The marker size in points**2 (typographic points are 1/72 in.)plt.legend(loc="best")plt.title('data_classfier')

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/479722.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

软件质量保证——软件测试流程

笔记内容及图片整理自XJTUSE “软件质量保证” 课程ppt&#xff0c;仅供学习交流使用&#xff0c;谢谢。 对于软件测试中产品/服务/成果的质量&#xff0c;需要细化到每个质量特性上&#xff0c;因此出现了较为公认的软件质量模型&#xff0c;包括McCall质量模型、ISO/IEC 9126…

代码美学2:MATLAB制作渐变色

效果&#xff1a; %代码美学&#xff1a;MATLAB制作渐变色 % 创建一个10x10的矩阵来表示热力图的数据 data reshape(1:100, [10, 10]);% 创建热力图 figure; imagesc(data);% 设置颜色映射为“cool” colormap(cool);% 在热力图上添加边框 axis on; grid on;% 设置热力图的颜色…

从0开始学PHP面向对象内容之常用设计模式(组合,外观,代理)

二、结构型设计模式 4、组合模式&#xff08;Composite&#xff09; 组合模式&#xff08;Composite Pattern&#xff09;是一种结构型设计模式&#xff0c;它将对象组合成树形结构以表示”部分–整体“的层次结构。通过组合模式&#xff0c;客户端可以以一致的方式处理单个对…

femor 第三方Emby应用全平台支持v1.0.54更新

femor v1.0.54 版本更新 mpv播放器增加切换后台和恢复时隐藏状态栏的功能修复服务器首页因为连接超时异常的问题 获取路径&#xff1a;【femor 历史版本收录】

如何搭建一个小程序:从零开始的详细指南

在当今数字化时代&#xff0c;小程序以其轻便、无需下载安装即可使用的特点&#xff0c;成为了连接用户与服务的重要桥梁。无论是零售、餐饮、教育还是娱乐行业&#xff0c;小程序都展现了巨大的潜力。如果你正考虑搭建一个小程序&#xff0c;本文将为你提供一个从零开始的详细…

nrm镜像管理工具使用方法

nrm&#xff08;NPM Registry Manager&#xff09;是一款专门用于管理 npm 包镜像源的命令行工具。在使用 npm 安装各种包时&#xff0c;默认会从官方的 npm 仓库&#xff08;registry&#xff09;获取资源&#xff0c;但有时候由于网络环境等因素&#xff0c;访问官方源可能速…

OpenCV截取指定图片区域

import cv2 img cv2.imread(F:/2024/Python/demo1/test1/man.jpg) cv2.imshow(Image, img) # 显示图片 #cv2.waitKey(0) # 等待按键x, y, w, h 500, 100, 200, 200 # 示例坐标 roi img[y:yh, x:xw] # 截取指定区域 cv2.imshow(ROI, roi) cv2.waitKey(0) cv…

易速鲜花聊天客服机器人的开发(下)

目录 “聊天机器人”项目说明 方案 1 &#xff1a;通过 Streamlit 部署聊天机器人 方案2 &#xff1a;通过 Gradio 部署聊天机器人 总结 上一节&#xff0c;咱们的聊天机器人已经基本完成&#xff0c;这节课&#xff0c;我们要看一看如何把它部署到网络上。 “聊天机器人”…

STM32笔记(串口IAP升级)

一、IAP简介 IAP&#xff08;In Application Programming&#xff09;即在应用编程&#xff0c; IAP 是用户自己的程序在运行过程中对 User Flash 的部分区域进行烧写&#xff0c;目的是为了在产品发布后可以方便地通过预留的通信口对产 品中的固件程序进行更新升级。 通常实…

斐波那契堆与二叉堆在Prim算法中的性能比较:稀疏图与稠密图的分析

斐波那契堆与二叉堆在Prim算法中的性能比较:稀疏图与稠密图的分析 引言基本概念回顾Prim算法的时间复杂度分析稀疏图中的性能比较稠密图中的性能比较|E| 和 |V| 的关系伪代码与C代码示例结论引言 在图论中,Prim算法是一种用于求解最小生成树(MST)的贪心算法。其性能高度依…

使用argo workflow 实现springboot 项目的CI、CD

文章目录 基础镜像制作基础镜像设置镜像源并安装工具git下载和安装 Maven设置环境变量设置工作目录默认命令最终dockerfile 制作ci argo workflow 模版volumeClaimTemplatestemplatesvolumes完整workflow文件 制作cd argo workflow 模版Workflow 结构Templates 定义创建 Kubern…

BUUCTF—Reverse—不一样的flag(7)

是不是做习惯了常规的逆向题目&#xff1f;试试这道题&#xff0c;看你在能不能在程序中找到真正的flag&#xff01;注意&#xff1a;flag并非是flag{XXX}形式&#xff0c;就是一个’字符串‘&#xff0c;考验眼力的时候到了&#xff01; 注意&#xff1a;得到的 flag 请包上 f…

insmod一个ko提供基础函数供后insmod的ko使用的方法

一、背景 在内核模块开发时&#xff0c;多个不同的内核模块&#xff0c;有时候可能需要都共用一些公共的函数&#xff0c;比如申请一些平台性的公共资源。但是&#xff0c;这些公共的函数又不方便去加入到内核镜像里&#xff0c;这时候就需要把这些各个内核模块需要用到的一些…

LangGraph中的State管理

本教程将介绍如何使用LangGraph库构建和测试状态图。我们将通过一系列示例代码&#xff0c;逐步解释程序的运行逻辑。 1. 基本状态图构建 首先&#xff0c;我们定义一个状态图的基本结构和节点。 定义状态类 from langgraph.graph import StateGraph, START, END from typi…

MATLAB中Simulink的基础知识

Simulink是MATLAB中的一种可视化仿真工具&#xff0c; 是一种基于MATLAB的框图设计环境&#xff0c;是实现动态系统建模、仿真和分析的一个软件包&#xff0c;被广泛应用于线性系统、非线性系统、数字控制及数字信号处理的建模和仿真中。 Simulink提供一个动态系统建模、仿真和…

最小生成树-Prim与Kruskal算法

文章目录 什么是最小生成树&#xff1f;Prim算法求最小生成树Python实现&#xff1a; Kruskal算法求最小生成树并查集 Python实现&#xff1a; Reference 什么是最小生成树&#xff1f; 在图论中&#xff0c;树是图的一种&#xff0c;无法构成闭合回路的节点-边连接组合称之为…

关闭AWS账号后,服务是否仍会继续运行?

在使用亚马逊网络服务&#xff08;AWS&#xff09;时&#xff0c;用户有时可能会考虑关闭自己的AWS账户。这可能是因为项目结束、费用过高&#xff0c;或是转向使用其他云服务平台。然而&#xff0c;许多人对关闭账户后的服务状态感到困惑&#xff0c;我们九河云和大家一起探讨…

Could not locate device support files.

报错信息&#xff1a;Failure Reason: The device may be running a version of iOS (13.6.1 17G80) that is not supported by this version of Xcode.[missing string: 869a8e318f07f3e2f42e11d435502286094f76de] 问题&#xff1a;xcode15升级到xcode16之后&#xff0c;13.…

Linux文件基础

目录 一、文件类型 二、文件权限 三、权限修改 Linux中一切皆文件&#xff0c;文件目录分布呈树状数据结构&#xff0c;/是根目录&#xff0c;目录的源头 一、文件类型 类型字符说明普通-Linux中最多的一种文件类型&#xff0c;包括 纯文本文件(ASCII)、二进制文件(binary…

自然语言处理基础之文本预处理

一. NLP介绍 1957年, 怛特摩斯会议 二. 文本预处理 文本预处理及作用 将文本转换成模型可以识别的数据 文本转化成张量(可以利用GPU计算), 规范张量的尺寸. 科学的文本预处理可以有效的指导模型超参数的选择, 提升模型的评估指标 文本处理形式 分词 词性标注 命名实体识别…