1. 数据分析的开端,Tidyverse?
tidyverse 是一个清晰的 R 包集合,在数据操作、探索和可视化领域提供统一的数据科学解决方案,这些解决方案具有共同的设计理念。它是由 RStudio 背后的首席科学家 Hadley Wickham 创建的。 tidyverse 中的 R 包旨在提高统计学家和数据科学家的工作效率。包引导他们完成工作流程,促进沟通并产生可重复的工作产品。 tidyverse 本质上侧重于使工作流成为可能的工具的互连。在数据科学项目中采用 tidyverse有很多优势。它提供一致的功能、工作流覆盖范围、数据科学教育、数据科学工具开发的简化路径以及提高生产力的潜力。
它的主要目标之一是帮助任何需要分析数据的人高效地工作。如下图所示,tidyverse可以帮助你实现:
- 数据导入/导出
- 数据清洗处理
- 批量建模
- 数据/模型结果可视化
- 生成(可交互)的分析报告. (eg. pdf, word, ppt)
本文主要讨论前两个目标及数据的创建导入导出,以及数据清洗处理。
- tidyverse核心: 管道操作 %>%
进行数据分析前先导入必要的包和数据:
library(tidyverse)
library(rio)
data("german", package = "rchallenge")
首先查看你的数据集:
german %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
如你想对数据集german进行如下操作,:
- 选择status,savings,amount,age,credit_risk这些列
german %>% select(status,savings,amount, age, credit_risk) %>%
- 再按分类变量status分类
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
- 并计算每一类的个数。
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
通过tidyverse这样的工作流程,我们可以按照自己的思路一步一步的处理清洗我们的数据。
2. 数据导入/导出
- 数据读取
这里建议使用rio包里的import函数导入。可以导入xlxs,.csv,.txt, SPSS, Stata, SAS等大部分的格式。
当然每个格式也有单独对应的包,如readr可以用来读取csv,readxl读取xls,haven读取SPSS,Stat,SAS文件等。
library(rio)
df = import("yourdirectory/data.csv")
df = import("yourdirectory/data.xlxs")
- 数据导出
同理这里推荐使用rio包里的export函数。
export(df, "yourdirectory/name.csv")
export(df, "yourdirectory/name.xlxs")
3. tidyverse数据清洗处理流程
这里主要介绍关于行列创建修改的基本操作,以及一些分组汇总知识。更多细节请查看。
R语言编程–基于tidyverse
3.1选择列
主要使用到的函数有 relocate/select
- select 选择所需要的列
- relocate 选择需要的列并排序,它保留了所有列,但为重新排序提供了更大的灵活性。
german %>% select(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
## $ job <fct> skilled employee/official, skilled employee/official, unsk…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
german %>% select(2,3,5,7) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6, 48…
## $ credit_history <fct> all credits at this bank paid back duly, all credi…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 109…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 <= .…
# 选取这些列排到最前面,保留剩余列
german %>% relocate(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status <fct> no checking account, no checking account, ... …
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
# 选取这些列,且以status-age 为顺序,age是倒数第三列。
german %>% relocate(status,credit_risk,job,age, .after = last_col(offset = 3))
- 使用函数选择
german %>% select(starts_with("s")) %>% glimpse() # 选择列名是以s开始的
## Rows: 1,000
## Columns: 2
## $ status <fct> no checking account, no checking account, ... < 0 DM, no check…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ... < …
german %>% select(ends_with("s")) %>% glimpse() # 选择列名是以s结束的
## Rows: 1,000
## Columns: 5
## $ status <fct> no checking account, no checking account, ... …
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
german %>% select(contains("s")) # 选择列名中包含s的
german %>% select(where(is.factor)) # 选择是列属性是factor的
- 使用逻辑符选择
# 选择这三列,并且这三列中是因子的
german %>% select(c(status,age,credit_risk) & where(is.factor)) %>% glimpse()
## Rows: 1,000
## Columns: 2
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
# 不选择这age,status这两列
german %>% select(!c(age,status))
german %>% select(-c(age,status))
3.2 修改列
- 创建新列/修改新列/修改旧列 mutate
german %>% # 创建新列age_mean计算age的均值,一般默认会把新列放在最后,.before=1使得放在第一列mutate(age_mean = mean(age), .before = 1) %>% # 也可以修改原有的列mutate(age = age + 1) %>% #也可以同时创建多个列,顺序是依次往后计算,所以前面创建的列可以在后面直接用。mutate(duration_mean = mean(duration),duration_median = median(duration),duration_sd = sd(duration),order = duration_mean > duration_median,order = as.numeric(order),.before= 1) %>% glimpse()
## Rows: 1,000
## Columns: 26
## $ duration_mean <dbl> 20.903, 20.903, 20.903, 20.903, 20.903, 20.903…
## $ duration_median <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18…
## $ duration_sd <dbl> 12.05881, 12.05881, 12.05881, 12.05881, 12.058…
## $ order <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age_mean <dbl> 35.542, 35.542, 35.542, 35.542, 35.542, 35.542…
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <dbl> 22, 37, 24, 40, 39, 49, 40, 41, 66, 24, 37, 25…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
- 同时修改多个列 across
german %>%# 将选定列转化为字符型mutate(across(c(duration, age), as.character)) %>% # 将每一列转化为数字型mutate(across(everything(), as.numeric)) %>% # 分别计算duration, age 列的均值和方差mutate(across(c(duration, age),.fns = list(mean = mean, stdev =sd)),.before = 1)
german %>% group_by(status) %>% summarise(across(c(duration,age), .fns = list(mean = mean, stdev = sd)),.groups = "drop")
-
对列进行重新编码
- if_else() 多用于处理二分类,也可以多分类但不推荐
# 当age>=60,老人,1
german %>% mutate(old = if_else(age<60,0,1))
# 也可以用多个if_else 嵌套处理多分类
# age<30:0 ; age<60:1; else:2
german %>% mutate(age_category = if_else(age<30,0,if_else(age<60,1,2)),.before=1) %>% glimpse()
## Rows: 1,000
## Columns: 22
## $ age_category <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1…
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
- case_when() 用于处理多分类,要比if_else()更直接好用
# 同理对年龄进行多分类
german %>% mutate(age_category = case_when(age<30 ~ 0,age<60 ~ 1,TRUE ~ 2),.before=1)
german %>% mutate(age_category = case_when(age<30 ~ "young",age<60 ~ "middle",TRUE ~ "old"),.before=1) %>% glimpse()
- one-hot 处理 correlationfunnel
把所有变量进行one-hot处理,生成二进制 (0/1) 变量的特征集。
-
Numeric data 被切分(使用n_bins)成分类数据,然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别(高基数类别)增加维数(结果数据框的宽度),我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。
-
Categorical data : one-hot encoding
# 把所有变量进行one-hot 处理
# install.packages("correlationfunnel")
german %>% correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER",one_hot = TRUE) %>% glimpse()
## Rows: 1,000
## Columns: 74
## $ status__no_checking_account <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM` <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year` <dbl> 0, 0, 0,…
## $ status__OTHER <dbl> 0, 0, 0,…
........
## $ duration__12_15 <dbl> 0, 0, 0,…
## $ duration__15_24 <dbl> 1, 0, 0,…
## $ duration__24_30 <dbl> 0, 0, 0,…
## $ `telephone__yes_(under_customer_name)` <dbl> 0, 0, 0,…
## $ foreign_worker__no <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER <dbl> 0, 0, 0,…
## $ credit_risk__bad <dbl> 0, 0, 0,…
## $ credit_risk__good <dbl> 1, 1, 1,…
关于correlationfunnel包的更多具体使用请查看R packages: Correlationfunnel 相关漏斗图,可视化因变量自变量之间的相关关系
3.3 筛选行 filter()
# 选取满足多个条件的行,这里, 相当于 & 表示和的意思
german %>% filter(age > 60, credit_risk == "good", telephone == "no") %>% glimpse()
german %>% filter(age != 35 | age>60, credit_risk == "good") %>% glimpse()
# 选取age列和duration列都大于50的行
german %>% filter(if_all(c(age, duration), ~ .x > 50)) %>% glimpse()
分组汇总
german %>% group_by(status) %>% count(credit_risk, name = "n") %>% glimpse()
## Rows: 8
## Columns: 3
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, ... …
## $ credit_risk <fct> bad, good, bad, good, bad, good, bad, good
## $ n <int> 135, 139, 105, 164, 14, 49, 46, 348
german_wider = german %>% group_by(status) %>% count(credit_risk, name = "n") %>% pivot_wider(names_from = credit_risk,values_from = n,values_fill = 0) %>% ungroup() %>% glimpse()
## Rows: 4
## Columns: 3
## $ status <fct> no checking account, ... < 0 DM, 0<= ... < 200 DM, ... >= 200 D…
## $ bad <int> 135, 105, 14, 46
## $ good <int> 139, 164, 49, 348
生成分组表格的快捷方式
library(table1)
# 汇总如下变量
table1(~ duration + age + status + job , data = german)
总结
tidyverse 的优势包括一致的功能、工作流覆盖、数据科学教育的途径、开发数据科学工具的简约方法以及提高生产力的可能性。tidyverse 用一种更加优雅且符合人直观思路的方式,以管道式、泛函式编程技术实现了数据科学的
整个流程:数据导入、数据清洗、数据操作、数据可视化、数据建模、可重现与交互报告。
如果你也面临以下问题待解决:
- 处理自己领域真实的数据
- 对数据进行建模分析、可视化
- 得出有用的结论
欢迎使用tidyverse工作流。是从basic R 还是tidy R入门R语言的争议一直都有,没有绝对的对错,只有合适与否。基于我自身,我需要解决的就是上述三个问题,因此tidy R 工作流程更适合我。
1. 数据分析的开端,Tidyverse?
tidyverse 是一个清晰的 R 包集合,在数据操作、探索和可视化领域提供统一的数据科学解决方案,这些解决方案具有共同的设计理念。它是由 RStudio 背后的首席科学家 Hadley Wickham 创建的。 tidyverse 中的 R 包旨在提高统计学家和数据科学家的工作效率。包引导他们完成工作流程,促进沟通并产生可重复的工作产品。 tidyverse 本质上侧重于使工作流成为可能的工具的互连。在数据科学项目中采用 tidyverse有很多优势。它提供一致的功能、工作流覆盖范围、数据科学教育、数据科学工具开发的简化路径以及提高生产力的潜力。
它的主要目标之一是帮助任何需要分析数据的人高效地工作。如下图所示,tidyverse可以帮助你实现:
- 数据导入/导出
- 数据清洗处理
- 批量建模
- 数据/模型结果可视化
- 生成(可交互)的分析报告. (eg. pdf, word, ppt)
本文主要讨论前两个目标及数据的创建导入导出,以及数据清洗处理。
- tidyverse核心: 管道操作 %>%
进行数据分析前先导入必要的包和数据:
library(tidyverse)
library(rio)
data("german", package = "rchallenge")
首先查看你的数据集:
german %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
如你想对数据集german进行如下操作,:
- 选择status,savings,amount,age,credit_risk这些列
german %>% select(status,savings,amount, age, credit_risk) %>%
- 再按分类变量status分类
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
- 并计算每一类的个数。
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
通过tidyverse这样的工作流程,我们可以按照自己的思路一步一步的处理清洗我们的数据。
2. 数据导入/导出
- 数据读取
这里建议使用rio包里的import函数导入。可以导入xlxs,.csv,.txt, SPSS, Stata, SAS等大部分的格式。
当然每个格式也有单独对应的包,如readr可以用来读取csv,readxl读取xls,haven读取SPSS,Stat,SAS文件等。
library(rio)
df = import("yourdirectory/data.csv")
df = import("yourdirectory/data.xlxs")
- 数据导出
同理这里推荐使用rio包里的export函数。
export(df, "yourdirectory/name.csv")
export(df, "yourdirectory/name.xlxs")
3. tidyverse数据清洗处理流程
这里主要介绍关于行列创建修改的基本操作,以及一些分组汇总知识。更多细节请查看。
R语言编程–基于tidyverse
3.1选择列
主要使用到的函数有 relocate/select
- select 选择所需要的列
- relocate 选择需要的列并排序,它保留了所有列,但为重新排序提供了更大的灵活性。
german %>% select(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
## $ job <fct> skilled employee/official, skilled employee/official, unsk…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
german %>% select(2,3,5,7) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6, 48…
## $ credit_history <fct> all credits at this bank paid back duly, all credi…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 109…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 <= .…
# 选取这些列排到最前面,保留剩余列
german %>% relocate(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status <fct> no checking account, no checking account, ... …
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
# 选取这些列,且以status-age 为顺序,age是倒数第三列。
german %>% relocate(status,credit_risk,job,age, .after = last_col(offset = 3))
- 使用函数选择
german %>% select(starts_with("s")) %>% glimpse() # 选择列名是以s开始的
## Rows: 1,000
## Columns: 2
## $ status <fct> no checking account, no checking account, ... < 0 DM, no check…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ... < …
german %>% select(ends_with("s")) %>% glimpse() # 选择列名是以s结束的
## Rows: 1,000
## Columns: 5
## $ status <fct> no checking account, no checking account, ... …
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
german %>% select(contains("s")) # 选择列名中包含s的
german %>% select(where(is.factor)) # 选择是列属性是factor的
- 使用逻辑符选择
# 选择这三列,并且这三列中是因子的
german %>% select(c(status,age,credit_risk) & where(is.factor)) %>% glimpse()
## Rows: 1,000
## Columns: 2
## $ status <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
# 不选择这age,status这两列
german %>% select(!c(age,status))
german %>% select(-c(age,status))
3.2 修改列
- 创建新列/修改新列/修改旧列 mutate
german %>% # 创建新列age_mean计算age的均值,一般默认会把新列放在最后,.before=1使得放在第一列mutate(age_mean = mean(age), .before = 1) %>% # 也可以修改原有的列mutate(age = age + 1) %>% #也可以同时创建多个列,顺序是依次往后计算,所以前面创建的列可以在后面直接用。mutate(duration_mean = mean(duration),duration_median = median(duration),duration_sd = sd(duration),order = duration_mean > duration_median,order = as.numeric(order),.before= 1) %>% glimpse()
## Rows: 1,000
## Columns: 26
## $ duration_mean <dbl> 20.903, 20.903, 20.903, 20.903, 20.903, 20.903…
## $ duration_median <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18…
## $ duration_sd <dbl> 12.05881, 12.05881, 12.05881, 12.05881, 12.058…
## $ order <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age_mean <dbl> 35.542, 35.542, 35.542, 35.542, 35.542, 35.542…
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <dbl> 22, 37, 24, 40, 39, 49, 40, 41, 66, 24, 37, 25…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
- 同时修改多个列 across
german %>%# 将选定列转化为字符型mutate(across(c(duration, age), as.character)) %>% # 将每一列转化为数字型mutate(across(everything(), as.numeric)) %>% # 分别计算duration, age 列的均值和方差mutate(across(c(duration, age),.fns = list(mean = mean, stdev =sd)),.before = 1)
german %>% group_by(status) %>% summarise(across(c(duration,age), .fns = list(mean = mean, stdev = sd)),.groups = "drop")
-
对列进行重新编码
- if_else() 多用于处理二分类,也可以多分类但不推荐
# 当age>=60,老人,1
german %>% mutate(old = if_else(age<60,0,1))
# 也可以用多个if_else 嵌套处理多分类
# age<30:0 ; age<60:1; else:2
german %>% mutate(age_category = if_else(age<30,0,if_else(age<60,1,2)),.before=1) %>% glimpse()
## Rows: 1,000
## Columns: 22
## $ age_category <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1…
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
- case_when() 用于处理多分类,要比if_else()更直接好用
# 同理对年龄进行多分类
german %>% mutate(age_category = case_when(age<30 ~ 0,age<60 ~ 1,TRUE ~ 2),.before=1)
german %>% mutate(age_category = case_when(age<30 ~ "young",age<60 ~ "middle",TRUE ~ "old"),.before=1) %>% glimpse()
- one-hot 处理 correlationfunnel
把所有变量进行one-hot处理,生成二进制 (0/1) 变量的特征集。
-
Numeric data 被切分(使用n_bins)成分类数据,然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别(高基数类别)增加维数(结果数据框的宽度),我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。
-
Categorical data : one-hot encoding
# 把所有变量进行one-hot 处理
# install.packages("correlationfunnel")
german %>% correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER",one_hot = TRUE) %>% glimpse()
## Rows: 1,000
## Columns: 74
## $ status__no_checking_account <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM` <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year` <dbl> 0, 0, 0,…
## $ status__OTHER <dbl> 0, 0, 0,…
........
## $ duration__12_15 <dbl> 0, 0, 0,…
## $ duration__15_24 <dbl> 1, 0, 0,…
## $ duration__24_30 <dbl> 0, 0, 0,…
## $ `telephone__yes_(under_customer_name)` <dbl> 0, 0, 0,…
## $ foreign_worker__no <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER <dbl> 0, 0, 0,…
## $ credit_risk__bad <dbl> 0, 0, 0,…
## $ credit_risk__good <dbl> 1, 1, 1,…
关于correlationfunnel包的更多具体使用请查看R packages: Correlationfunnel 相关漏斗图,可视化因变量自变量之间的相关关系
3.3 筛选行 filter()
# 选取满足多个条件的行,这里, 相当于 & 表示和的意思
german %>% filter(age > 60, credit_risk == "good", telephone == "no") %>% glimpse()
german %>% filter(age != 35 | age>60, credit_risk == "good") %>% glimpse()
# 选取age列和duration列都大于50的行
german %>% filter(if_all(c(age, duration), ~ .x > 50)) %>% glimpse()
分组汇总
german %>% group_by(status) %>% count(credit_risk, name = "n") %>% glimpse()
## Rows: 8
## Columns: 3
## Groups: status [4]
## $ status <fct> no checking account, no checking account, ... < 0 DM, ... …
## $ credit_risk <fct> bad, good, bad, good, bad, good, bad, good
## $ n <int> 135, 139, 105, 164, 14, 49, 46, 348
german_wider = german %>% group_by(status) %>% count(credit_risk, name = "n") %>% pivot_wider(names_from = credit_risk,values_from = n,values_fill = 0) %>% ungroup() %>% glimpse()
## Rows: 4
## Columns: 3
## $ status <fct> no checking account, ... < 0 DM, 0<= ... < 200 DM, ... >= 200 D…
## $ bad <int> 135, 105, 14, 46
## $ good <int> 139, 164, 49, 348
生成分组表格的快捷方式
library(table1)
# 汇总如下变量
table1(~ duration + age + status + job , data = german)
总结
tidyverse 的优势包括一致的功能、工作流覆盖、数据科学教育的途径、开发数据科学工具的简约方法以及提高生产力的可能性。tidyverse 用一种更加优雅且符合人直观思路的方式,以管道式、泛函式编程技术实现了数据科学的
整个流程:数据导入、数据清洗、数据操作、数据可视化、数据建模、可重现与交互报告。
如果你也面临以下问题待解决:
- 处理自己领域真实的数据
- 对数据进行建模分析、可视化
- 得出有用的结论
欢迎使用tidyverse工作流。是从basic R 还是tidy R入门R语言的争议一直都有,没有绝对的对错,只有合适与否。基于我自身,我需要解决的就是上述三个问题,因此tidy R 工作流程更适合我。