Statistics with Python知识总结：库、统计图

前言

统计学作为一门重要的数据分析领域，为我们理解和解释数据提供了有力的工具。而Python是用来进行统计自动化和画图的重要工具。本文总结了与统计学相关的Python数据库和不同类型的统计图的关键知识点，帮助读者更好地理解工具，以及各知识点之间的逻辑，以便未来利用这些工具进行数据分析和可视化。

库

Pandas DataFrame 的数据结构

1.读取

df = pd.read_csv('filename.csv')

创建

 # 从列表创建df = pd.DataFrame([['Alice', 25], ['Bob', 30], ['Charlie', 35]], columns=['Name', 'Age'])# 从字典创建data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}df = pd.DataFrame(data)

删除missing case

da["BMXWT"].dropna().describe()

da[“BMXWT”].dropna().describe() 生成摘要之前先删除missing cases

Script

CDF - Cumulative Distribution Function（累积分布函数）:
- 定义： 对于一个随机变量 X，它的 CDF 是一个函数 F(x)，定义为 P(X ≤ x)，表示随机变量小于或等于 x 的概率。
- 公式： F(x) = P(X ≤ x)
- 性质： CDF 是单调非减的，总是在 [0, 1] 范围内，并且在 x 增加时不减小。
- 应用： 对于二项分布 binom，binom.cdf(k, n, p) k=想成功的次数；n=实验次数；p=成功概率。

stats.binom(10, 0.5)

二项分布，试验10次，每次成功概率0.5

print(stats.expon.cdf(3))

计算随机变量小于等于给定值的概率。在这里，参数是 3，表示计算累积分布函数为 3 时的概率。

PDF - Probability Density Function（概率密度函数）:
- 定义： 对于连续型随机变量 X，PDF 是一个函数 f(x)，表示在给定点 x 处的概率密度，即 X 在 x 处的概率密度。（曲线下面函数面积值）
- 性质： PDF 的值并不直接对应概率，而是在给定范围内的概率密度。总体积分为 1
- 公式：

$\int_{-\infty}^{\infty} f(x) \, dx$
- 应用：

print(stats.norm.pdf(1))

调用正态分布对象的pdf方法，该方法计算随机变量取值为给定值的概率密度。在这里，参数是 1，表示计算概率密度函数在值为 1 处的概率密度。

PPF概率百分点函数Percent Point Function：
在统计学中，PPF 是累积分布函数（CDF）的逆函数。对于给定的概率，PPF 返回相应的随机变量值。
- 公式： $F^{-1}(p) = x$
- 应用：

print(stats.t(10).ppf(0.5))

计算自由度为 10 的 t 分布的累积分布函数（CDF=0.5时）的逆函数（返回随机变量值），即概率百分点函数（PPF）.

MatPlotLib（画图）

plt.grid(True)
plt.plot(x, y, ":", lw=5, color="orange") #lw=线条宽度
plt.ylabel("Y", size=15)
plt.xlabel("X", size=15)

绘图接口：
- pyplot 模块是 Matplotlib 的绘图接口，它提供了类似于 MATLAB 的绘图功能。通常使用 import matplotlib.pyplot as plt 的方式引入。
基本绘图：
- 使用 plt.plot(x, y) 可以绘制线图，其中 x 和 y 是数据点的坐标。
样式和颜色：
- 可以通过参数设置线条样式、颜色、标记等。例如，plt.plot(x, y, linestyle='--', color='blue', marker='o')。
图表类型：
- Matplotlib 支持绘制多种类型的图表，包括散点图、柱状图、饼图等。
标签和标题：
- 使用 plt.xlabel(), plt.ylabel(), plt.title() 可以设置坐标轴标签和图表标题。
图例：
- 使用 plt.legend() 可以添加图例，说明每条线或每个数据集的含义。

Seaborn

散点图sns.scatterplot

sns.scatterplot(x='x', y='y', hue='group', data=df)
plt.show()

x=‘x’ 和 y=‘y’：分别指定 x 轴和 y 轴的数据。
hue=‘group’：通过 ‘group’ 列的取值来着色散点，即根据 ‘group’ 列的不同取值，点的颜色会有所区分。

各类型统计图

变量类型：

Categorical Ordinal: 有顺序的。The variable represents categories or groups (adult or not adult). Would imply an == ordered relationship== among categories (e.g., low, medium, high).
Nominal: There is no inherent order or ranking among the categories; they are simply different groups.
Quantitative Continuous: number within a range（取值范围内所有数都可以取）
Quantitative Discrete: Would represent numeric values that are distinct and separate.

Histogram直方图:

Description: Histograms are used to visualize the distribution of a continuous variable by dividing the data into bins and displaying the frequency of observations in each bin.
Types:
- Single-Peak (Unimodal): One clear peak in the distribution.
- Bimodal双峰：两个峰必须整体趋势一致才叫biomodal.
- Skewed (Left or Right): left (negatively skewed，小的值多，平均数小于中位数) or right (positively skewed，大的值多，平均数大于中位).
- Bell-Shaped: Symmetrical distribution resembling a bell curve, often observed in normal distributions.
Use: Assessing Spread and Dispersion etc.

Bar charts 条形图：

Description: represent categorical data with rectangular bars. The lengths of the bars are proportional to the values they represent.
Use: Useful for comparing the values of different categories. Bar charts are versatile and can be used for both nominal and ordinal categorical data.

Bar charts 和Histograms的区别：

A histogram is the graphical representation of data where data is grouped into == continuous number ranges== and each range corresponds to a vertical bar.

Pie Chart 饼状图:

Description: Pie charts represent data in a circular graph where each category is shown as a wedge, and the size of each wedge corresponds to the proportion of that category in the whole.
Use: Useful for displaying the composition of a whole, highlighting the relative sizes of different categories.

Box Plot (Box-and-Whisker Plot):
- Description: Box plots provide a visual summary of the distribution of a numerical variable through quartiles (25th, 50th, and 75th percentiles) and identify potential outliers.
- Use: Useful for comparing the spread and central tendency of different groups or variables.

在这里插入图片描述

知识点：四分位距。

Five-Number Summary:
- Explanation of the five-number summary: == minimum, first quartile (Q1, 25%在这个value以下), median(50%), third quartile (Q3,75%), and maximum. ==
- Example using the adult male heights histogram with values for each parameter.
Interquartile Range (IQR) 四分位距:
- Introduction of the Interquartile Range (IQR) as a measure of spread.
- Calculation of IQR using Q3 minus Q1 in the adult male heights example.
Comparison of Measures:
- Emphasis on the robustness of the median as an estimate of the center, less influenced by outliers.
- Standard deviation as an average distance from the mean.
- Preference for == IQR== over the == range== due to robustness against outliers.

· 注意异常： Remember that even though outliers are plotting individually in boxplots, they are still part of the data set. 会影响Mean值。

Scatter Plot:

Description: Scatter plots are used to display the relationship between two continuous variables. Each point on the plot represents an observation with values on both X and Y axes.
Use: They help determine whether there is a positive, negative, or no correlation between variables. （一段关系是不是线性的）
Judgement:
- r=1: Perfect positive correlation. As one variable increases, the other variable increases proportionally.
- r=−1: Perfect negative correlation. As one variable increases, the other variable decreases proportionally.
- r=0: No linear correlation. The variables are not linearly related.

在这里插入图片描述