前言
统计学作为一门重要的数据分析领域,为我们理解和解释数据提供了有力的工具。而Python是用来进行统计自动化和画图的重要工具。本文总结了与统计学相关的Python数据库和不同类型的统计图的关键知识点,帮助读者更好地理解工具,以及各知识点之间的逻辑,以便未来利用这些工具进行数据分析和可视化。
目录
- 前言
- 库
- Pandas DataFrame 的数据结构
- Script
- MatPlotLib(画图)
- Seaborn
- 散点图sns.scatterplot
- 各类型统计图
- 变量类型:
- Histogram直方图:
- Bar charts 条形图:
- Bar charts 和Histograms的区别:
- Pie Chart 饼状图:
- Scatter Plot:
库
Pandas DataFrame 的数据结构
1.读取
df = pd.read_csv('filename.csv')
- 创建
# 从列表创建df = pd.DataFrame([['Alice', 25], ['Bob', 30], ['Charlie', 35]], columns=['Name', 'Age'])# 从字典创建data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}df = pd.DataFrame(data)
- 删除missing case
da["BMXWT"].dropna().describe()
da[“BMXWT”].dropna().describe() 生成摘要之前先删除missing cases
Script
- CDF - Cumulative Distribution Function(累积分布函数):
- 定义: 对于一个随机变量 X,它的 CDF 是一个函数 F(x),定义为 P(X ≤ x),表示随机变量小于或等于 x 的概率。
- 公式: F(x) = P(X ≤ x)
- 性质: CDF 是单调非减的,总是在 [0, 1] 范围内,并且在 x 增加时不减小。
- 应用: 对于二项分布
binom
,binom.cdf(k, n, p)
k=想成功的次数;n=实验次数;p=成功概率。
stats.binom(10, 0.5)
二项分布,试验10次,每次成功概率0.5
print(stats.expon.cdf(3))
计算随机变量小于等于给定值的概率。在这里,参数是 3,表示计算累积分布函数为 3 时的概率。
- PDF - Probability Density Function(概率密度函数):
- 定义: 对于连续型随机变量 X,PDF 是一个函数 f(x),表示在给定点 x 处的概率密度,即 X 在 x 处的概率密度。(曲线下面函数面积值)
- 性质: PDF 的值并不直接对应概率,而是在给定范围内的概率密度。总体积分为 1
- 公式:
- ∫ − ∞ ∞ f ( x ) d x \int_{-\infty}^{\infty} f(x) \, dx ∫−∞∞f(x)dx
- 应用:
print(stats.norm.pdf(1))
调用正态分布对象的pdf方法,该方法计算随机变量取值为给定值的概率密度。在这里,参数是 1,表示计算概率密度函数在值为 1 处的概率密度。
- PPF概率百分点函数Percent Point Function:
在统计学中,PPF 是累积分布函数(CDF)的逆函数。对于给定的概率,PPF 返回相应的随机变量值。- 公式: F − 1 ( p ) = x F^{-1}(p) = x F−1(p)=x
- 应用:
print(stats.t(10).ppf(0.5))
计算自由度为 10 的 t 分布的累积分布函数(CDF=0.5时)的逆函数(返回随机变量值),即概率百分点函数(PPF).
MatPlotLib(画图)
plt.grid(True)
plt.plot(x, y, ":", lw=5, color="orange") #lw=线条宽度
plt.ylabel("Y", size=15)
plt.xlabel("X", size=15)
-
绘图接口:
pyplot
模块是 Matplotlib 的绘图接口,它提供了类似于 MATLAB 的绘图功能。通常使用import matplotlib.pyplot as plt
的方式引入。
-
基本绘图:
- 使用
plt.plot(x, y)
可以绘制线图,其中x
和y
是数据点的坐标。
- 使用
-
样式和颜色:
- 可以通过参数设置线条样式、颜色、标记等。例如,
plt.plot(x, y, linestyle='--', color='blue', marker='o')
。
- 可以通过参数设置线条样式、颜色、标记等。例如,
-
图表类型:
- Matplotlib 支持绘制多种类型的图表,包括散点图、柱状图、饼图等。
-
标签和标题:
- 使用
plt.xlabel()
,plt.ylabel()
,plt.title()
可以设置坐标轴标签和图表标题。
- 使用
-
图例:
- 使用
plt.legend()
可以添加图例,说明每条线或每个数据集的含义。
- 使用
Seaborn
散点图sns.scatterplot
sns.scatterplot(x='x', y='y', hue='group', data=df)
plt.show()
x=‘x’ 和 y=‘y’:分别指定 x 轴和 y 轴的数据。
hue=‘group’:通过 ‘group’ 列的取值来着色散点,即根据 ‘group’ 列的不同取值,点的颜色会有所区分。
各类型统计图
变量类型:
-
Categorical Ordinal: 有顺序的。The variable represents categories or groups (adult or not adult). Would imply an == ordered relationship== among categories (e.g., low, medium, high).
-
Nominal: There is no inherent order or ranking among the categories; they are simply different groups.
-
Quantitative Continuous: number within a range(取值范围内所有数都可以取)
-
Quantitative Discrete: Would represent numeric values that are distinct and separate.
Histogram直方图:
- Description: Histograms are used to visualize the distribution of a continuous variable by dividing the data into bins and displaying the frequency of observations in each bin.
- Types:
- Single-Peak (Unimodal): One clear peak in the distribution.
- Bimodal双峰:两个峰必须整体趋势一致才叫biomodal.
- Skewed (Left or Right): left (negatively skewed,小的值多,平均数小于中位数) or right (positively skewed,大的值多,平均数大于中位).
- Bell-Shaped: Symmetrical distribution resembling a bell curve, often observed in normal distributions.
- Use: Assessing Spread and Dispersion etc.
Bar charts 条形图:
- Description: represent categorical data with rectangular bars. The lengths of the bars are proportional to the values they represent.
- Use: Useful for comparing the values of different categories. Bar charts are versatile and can be used for both nominal and ordinal categorical data.
Bar charts 和Histograms的区别:
A histogram is the graphical representation of data where data is grouped into == continuous number ranges== and each range corresponds to a vertical bar.
Pie Chart 饼状图:
- Description: Pie charts represent data in a circular graph where each category is shown as a wedge, and the size of each wedge corresponds to the proportion of that category in the whole.
- Use: Useful for displaying the composition of a whole, highlighting the relative sizes of different categories.
- Box Plot (Box-and-Whisker Plot):
- Description: Box plots provide a visual summary of the distribution of a numerical variable through quartiles (25th, 50th, and 75th percentiles) and identify potential outliers.
- Use: Useful for comparing the spread and central tendency of different groups or variables.
知识点:四分位距。
-
Five-Number Summary:
- Explanation of the five-number summary: == minimum, first quartile (Q1, 25%在这个value以下), median(50%), third quartile (Q3,75%), and maximum. ==
- Example using the adult male heights histogram with values for each parameter.
-
Interquartile Range (IQR) 四分位距:
- Introduction of the Interquartile Range (IQR) as a measure of spread.
- Calculation of IQR using Q3 minus Q1 in the adult male heights example.
-
Comparison of Measures:
- Emphasis on the robustness of the median as an estimate of the center, less influenced by outliers.
- Standard deviation as an average distance from the mean.
- Preference for == IQR== over the == range== due to robustness against outliers.
· 注意异常: Remember that even though outliers are plotting individually in boxplots, they are still part of the data set. 会影响Mean值。
Scatter Plot:
- Description: Scatter plots are used to display the relationship between two continuous variables. Each point on the plot represents an observation with values on both X and Y axes.
- Use: They help determine whether there is a positive, negative, or no correlation between variables. (一段关系是不是线性的)
- Judgement:
- r=1: Perfect positive correlation. As one variable increases, the other variable increases proportionally.
- r=−1: Perfect negative correlation. As one variable increases, the other variable decreases proportionally.
- r=0: No linear correlation. The variables are not linearly related.