Python机器学习笔记（十七、分箱、离散化、线性模型与树）

数据表示的最佳方法：取决于数据的语义，所使用的模型种类。

线性模型与基于树的模型（决策树、梯度提升树和随机森林）是两种成员很多同时又非常常用的模型，它们在处理不同的特征表示时就具有非常不同的性质。我们使用wave回归数据集（只有一个输入特征）。先学习线性回归模型与决策树回归在这个数据集上的对比表现：

import numpy as np
import mglearn
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
X, y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
plt.plot(line, reg.predict(line), label="decision tree")
reg = LinearRegression().fit(X, y)
plt.plot(line, reg.predict(line), label="linear regression")
plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
plt.show()

输出图形：

如图：线性模型只能对线性关系建模，单个特征的情况就是直线。决策树可以构建更为复杂的数据模型，但强烈依赖于数据表示。有一种方法可以让线性模型在连续数据上变得更加强大，就是使用特征分箱（binning，也叫离散化discretization）将其划分为多个特征，即：假设将特征的输入范围（在这个例子中是从-3到3）划分成固定个数的箱子（bin），比如10个，那么数据点就可以用它所在的箱子来表示。为了确定这一点，我们首先需要定义箱子。在下面的例子中，我们在-3和3之间定义10个均匀分布的箱子。用np.linspace函数创建11个元素，从而创建10个箱子，即两个连续边界之间的空间：

import numpy as npbins = np.linspace(-3, 3, 11)
print("bins: {}".format(bins))

输出：bins: [-3. -2.4 -1.8 -1.2 -0.6 0. 0.6 1.2 1.8 2.4 3. ]

第一个箱子包含特征取值在-3到-2.4之间的所有数据点，第二个箱子包含特征取值在-2.4到-1.8之间的所有数据点，以此类推。然后，我们记录每个数据点所属的箱子。这可以用 np.digitize 函数轻松计算出来：

import numpy as np
import mglearnX, y = mglearn.datasets.make_wave(n_samples=100)
bins = np.linspace(-3, 3, 11)
which_bin = np.digitize(X, bins=bins)
print("\nData points:\n", X[:5])
print("\nBin membership for data points:\n", which_bin[:5])

输出结果：

Data points:
[[-0.75275929]
[ 2.70428584]
[ 1.39196365]
[ 0.59195091]
[-2.06388816]]

Bin membership for data points:
[[ 4]
[10]
[ 8]
[ 6]
[ 2]]

这里做的是将wave数据集中单个连续输入特征变换为一个分类特征，用于表示数据点所在的箱子。要想在这个数据上使用scikit-learn模型，可以利用preprocessing模块的OneHotEncoder将这个离散特征变换为one-hot编码。OneHotEncoder实现的编码与pandas.get_dummies相同，但目前它只适用于值为整数的分类变量：

import numpy as np
import mglearn
from sklearn.preprocessing import OneHotEncoderX, y = mglearn.datasets.make_wave(n_samples=100)
bins = np.linspace(-3, 3, 11)
which_bin = np.digitize(X, bins=bins)# 使用OneHotEncoder进行变换
encoder = OneHotEncoder(sparse=False)
# encoder.fit找到which_bin中的唯一值
encoder.fit(which_bin)
# transform创建one-hot编码
X_binned = encoder.transform(which_bin)
print(X_binned[:5])

这里会输出一个异常：

Traceback (most recent call last):
File "d:\Softs\python-workspace\ml\test.py", line 27, in <module>
encoder = OneHotEncoder(sparse=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

这个问题的原因是，在较新版本的scikit-learn库中（从0.23版本开始），OneHotEncoder类的构造函数不再接受sparse参数。在旧版本中（0.22及以下），sparse参数用来控制编码后的输出是稀疏矩阵（sparse=True，默认）还是密集数组（sparse=False）。

为了解决这个问题，可以根据scikit-learn版本采取不同的措施：scikit-learn 0.23或更高版本‌移除sparse=False参数，因为在新版本中，OneHotEncoder默认返回密集数组（numpy数组），不再支持生成稀疏矩阵。如果需要生成稀疏矩阵，则降低scikit-learn的版本，或用toarray()转换。

import numpy as np
import mglearn
from sklearn.preprocessing import OneHotEncoderX, y = mglearn.datasets.make_wave(n_samples=100)
bins = np.linspace(-3, 3, 11)
which_bin = np.digitize(X, bins=bins)# 使用OneHotEncoder进行变换
encoder = OneHotEncoder()
# encoder.fit找到which_bin中的唯一值
encoder.fit(which_bin)
# transform创建one-hot编码
X_binned = encoder.transform(which_bin)
print(X_binned[:5])
print(X_binned.toarray()[:10])

输出结果：

<Compressed Sparse Row sparse matrix of dtype 'float64'
with 5 stored elements and shape (5, 10)>
Coords Values
(0, 3) 1.0
(1, 9) 1.0
(2, 7) 1.0
(3, 5) 1.0
(4, 1) 1.0
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

如上结果中第一行 (0, 3) 1.0 表示：第0行第3列值为1.0，其余列为0。由于我们指定了 10 个箱子，所以变换后的 X_binned 数据集现在包含10个特征：

import numpy as np
import mglearn
from sklearn.preprocessing import OneHotEncoderX, y = mglearn.datasets.make_wave(n_samples=100)
bins = np.linspace(-3, 3, 11)
which_bin = np.digitize(X, bins=bins)# 使用OneHotEncoder进行变换
encoder = OneHotEncoder()
# encoder.fit找到which_bin中的唯一值
encoder.fit(which_bin)
# transform创建one-hot编码
X_binned = encoder.transform(which_bin)
#print(X_binned[:5])
print("X_binned.shape: {}".format(X_binned.shape))

输出结果：X_binned.shape: (100, 10)

下面在one-hot编码后的数据上构建新的线性模型和新的决策树模型。如下代码示例：

import numpy as np
import mglearn
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
#from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import KBinsDiscretizerX, y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
kb = KBinsDiscretizer(n_bins=10, strategy='uniform', encode='onehot-dense')
kb.fit(X)
X_binned = kb.transform(X)
line_binned = kb.transform(line)reg = LinearRegression().fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='linear regression binned')
reg = DecisionTreeRegressor(min_samples_split=3).fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='decision tree binned')
plt.plot(X[:, 0], y, 'o', c='k')
plt.vlines(kb.bin_edges_[0], -3, 3, linewidth=1, alpha=.2)
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.show()

输出：

上图是在分箱特征上比较线性回归和决策树回归。蓝线和黄线完全重合，说明线性回归模型和决策树做出了完全相同的预测。对于每个bin箱子，二者都预测一个常数值。因为每个箱子内的特征是不变的，所以对于一个箱子内的所有点，任何模型都会预测相同的值。比较对特征进行分箱前后模型学到的内容，我们发现，线性模型变得更加灵活了，因为现在它对每个箱子具有不同的取值，而决策树模型的灵活性降低了。分箱特征对基于树的模型通常不会产生更好的效果，因为这种模型可以学习在任何位置划分数据。从某种意义上来看，决策树可以学习如何分箱对预测这些数据最为有用。此外，决策树可以同时查看多个特征，而分箱通常针对的是单个特征。不过，线性模型的表现力在数据变换后得到了极大的提高。对于特定的数据集，如果有充分的理由使用线性模型——比如数据集很大、维度很高，但有些特征与输出的关系是非线性的——那么分箱是提高建模能力的好方法。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/500442.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！