R语言入门-数据分析实操(tyidyverse工作流+代码)

1. 数据分析的开端,Tidyverse?

tidyverse 是一个清晰的 R 包集合,在数据操作、探索和可视化领域提供统一的数据科学解决方案,这些解决方案具有共同的设计理念。它是由 RStudio 背后的首席科学家 Hadley Wickham 创建的。 tidyverse 中的 R 包旨在提高统计学家和数据科学家的工作效率。包引导他们完成工作流程,促进沟通并产生可重复的工作产品。 tidyverse 本质上侧重于使工作流成为可能的工具的互连。在数据科学项目中采用 tidyverse有很多优势。它提供一致的功能、工作流覆盖范围、数据科学教育、数据科学工具开发的简化路径以及提高生产力的潜力。

它的主要目标之一是帮助任何需要分析数据的人高效地工作。如下图所示,tidyverse可以帮助你实现:

  1. 数据导入/导出
  2. 数据清洗处理
  3. 批量建模
  4. 数据/模型结果可视化
  5. 生成(可交互)的分析报告. (eg. pdf, word, ppt)

本文主要讨论前两个目标及数据的创建导入导出,以及数据清洗处理。

在这里插入图片描述

  • tidyverse核心: 管道操作 %>%

进行数据分析前先导入必要的包和数据:

library(tidyverse)
library(rio)
data("german", package = "rchallenge") 

首先查看你的数据集:

german %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

如你想对数据集german进行如下操作,:

  • 选择status,savings,amount,age,credit_risk这些列
german %>% select(status,savings,amount, age, credit_risk) %>% 
  • 再按分类变量status分类
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
  • 并计算每一类的个数。
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…

通过tidyverse这样的工作流程,我们可以按照自己的思路一步一步的处理清洗我们的数据。

2. 数据导入/导出

  • 数据读取

这里建议使用rio包里的import函数导入。可以导入xlxs,.csv,.txt, SPSS, Stata, SAS等大部分的格式。

当然每个格式也有单独对应的包,如readr可以用来读取csv,readxl读取xls,haven读取SPSS,Stat,SAS文件等。

library(rio)
df = import("yourdirectory/data.csv")
df = import("yourdirectory/data.xlxs")
  • 数据导出

同理这里推荐使用rio包里的export函数。

export(df, "yourdirectory/name.csv")
export(df, "yourdirectory/name.xlxs")

3. tidyverse数据清洗处理流程

这里主要介绍关于行列创建修改的基本操作,以及一些分组汇总知识。更多细节请查看。

R语言编程–基于tidyverse

3.1选择列

主要使用到的函数有 relocate/select

  • select 选择所需要的列
  • relocate 选择需要的列并排序,它保留了所有列,但为重新排序提供了更大的灵活性。
german %>% select(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
## $ job         <fct> skilled employee/official, skilled employee/official, unsk…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
german %>% select(2,3,5,7) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ duration            <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6, 48…
## $ credit_history      <fct> all credits at this bank paid back duly, all credi…
## $ amount              <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 109…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 <= .…
# 选取这些列排到最前面,保留剩余列
german %>% relocate(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
# 选取这些列,且以status-age 为顺序,age是倒数第三列。
german %>% relocate(status,credit_risk,job,age, .after = last_col(offset = 3)) 
  • 使用函数选择
german %>% select(starts_with("s")) %>% glimpse() # 选择列名是以s开始的
## Rows: 1,000
## Columns: 2
## $ status  <fct> no checking account, no checking account, ... < 0 DM, no check…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ... < …
german %>% select(ends_with("s")) %>% glimpse() # 选择列名是以s结束的
## Rows: 1,000
## Columns: 5
## $ status                  <fct> no checking account, no checking account, ... …
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
german %>% select(contains("s")) # 选择列名中包含s的
german %>% select(where(is.factor))  # 选择是列属性是factor的
  • 使用逻辑符选择
# 选择这三列,并且这三列中是因子的
german %>% select(c(status,age,credit_risk) & where(is.factor)) %>% glimpse()
## Rows: 1,000
## Columns: 2
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
# 不选择这age,status这两列
german %>% select(!c(age,status))
german %>% select(-c(age,status))

3.2 修改列

  • 创建新列/修改新列/修改旧列 mutate
german %>% # 创建新列age_mean计算age的均值,一般默认会把新列放在最后,.before=1使得放在第一列mutate(age_mean = mean(age), .before = 1) %>% # 也可以修改原有的列mutate(age = age + 1) %>% #也可以同时创建多个列,顺序是依次往后计算,所以前面创建的列可以在后面直接用。mutate(duration_mean = mean(duration),duration_median = median(duration),duration_sd = sd(duration),order = duration_mean > duration_median,order = as.numeric(order),.before= 1) %>% glimpse()
## Rows: 1,000
## Columns: 26
## $ duration_mean           <dbl> 20.903, 20.903, 20.903, 20.903, 20.903, 20.903…
## $ duration_median         <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18…
## $ duration_sd             <dbl> 12.05881, 12.05881, 12.05881, 12.05881, 12.058…
## $ order                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age_mean                <dbl> 35.542, 35.542, 35.542, 35.542, 35.542, 35.542…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <dbl> 22, 37, 24, 40, 39, 49, 40, 41, 66, 24, 37, 25…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
  • 同时修改多个列 across
german %>%# 将选定列转化为字符型mutate(across(c(duration, age), as.character)) %>% # 将每一列转化为数字型mutate(across(everything(), as.numeric)) %>% # 分别计算duration, age 列的均值和方差mutate(across(c(duration, age),.fns = list(mean = mean, stdev =sd)),.before = 1)
german %>% group_by(status) %>% summarise(across(c(duration,age), .fns = list(mean = mean, stdev = sd)),.groups = "drop") 
  • 对列进行重新编码

    • if_else() 多用于处理二分类,也可以多分类但不推荐
# 当age>=60,老人,1
german %>% mutate(old = if_else(age<60,0,1)) 
# 也可以用多个if_else 嵌套处理多分类
# age<30:0 ; age<60:1; else:2 
german %>% mutate(age_category = if_else(age<30,0,if_else(age<60,1,2)),.before=1) %>% glimpse()
## Rows: 1,000
## Columns: 22
## $ age_category            <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
  • case_when() 用于处理多分类,要比if_else()更直接好用
# 同理对年龄进行多分类
german %>% mutate(age_category = case_when(age<30 ~ 0,age<60 ~ 1,TRUE ~ 2),.before=1) 
german %>% mutate(age_category = case_when(age<30 ~ "young",age<60 ~ "middle",TRUE ~ "old"),.before=1) %>% glimpse()
  • one-hot 处理 correlationfunnel

把所有变量进行one-hot处理,生成二进制 (0/1) 变量的特征集。

  • Numeric data 被切分(使用n_bins)成分类数据,然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别(高基数类别)增加维数(结果数据框的宽度),我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。

  • Categorical data : one-hot encoding

# 把所有变量进行one-hot 处理
# install.packages("correlationfunnel")
german %>% correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER",one_hot = TRUE) %>% glimpse()
## Rows: 1,000
## Columns: 74
## $ status__no_checking_account                                   <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM`                                          <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year`          <dbl> 0, 0, 0,…
## $ status__OTHER                                                 <dbl> 0, 0, 0,…
........
## $ duration__12_15                                               <dbl> 0, 0, 0,…
## $ duration__15_24                                               <dbl> 1, 0, 0,…
## $ duration__24_30                                               <dbl> 0, 0, 0,…
## $ `telephone__yes_(under_customer_name)`                        <dbl> 0, 0, 0,…
## $ foreign_worker__no                                            <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER                                         <dbl> 0, 0, 0,…
## $ credit_risk__bad                                              <dbl> 0, 0, 0,…
## $ credit_risk__good                                             <dbl> 1, 1, 1,…

关于correlationfunnel包的更多具体使用请查看R packages: Correlationfunnel 相关漏斗图,可视化因变量自变量之间的相关关系

3.3 筛选行 filter()

# 选取满足多个条件的行,这里, 相当于 & 表示和的意思
german %>% filter(age > 60, credit_risk == "good", telephone == "no") %>% glimpse()
german %>% filter(age != 35 | age>60, credit_risk == "good") %>% glimpse()
# 选取age列和duration列都大于50的行
german %>% filter(if_all(c(age, duration), ~ .x > 50)) %>% glimpse()

分组汇总

german %>% group_by(status) %>% count(credit_risk, name = "n") %>% glimpse()
## Rows: 8
## Columns: 3
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, ... …
## $ credit_risk <fct> bad, good, bad, good, bad, good, bad, good
## $ n           <int> 135, 139, 105, 164, 14, 49, 46, 348
german_wider = german %>% group_by(status) %>% count(credit_risk, name = "n") %>% pivot_wider(names_from = credit_risk,values_from = n,values_fill = 0) %>% ungroup() %>% glimpse()
## Rows: 4
## Columns: 3
## $ status <fct> no checking account, ... < 0 DM, 0<= ... < 200 DM, ... >= 200 D…
## $ bad    <int> 135, 105, 14, 46
## $ good   <int> 139, 164, 49, 348

生成分组表格的快捷方式

library(table1)
# 汇总如下变量 
table1(~  duration + age + status + job , data = german)

总结

tidyverse 的优势包括一致的功能、工作流覆盖、数据科学教育的途径、开发数据科学工具的简约方法以及提高生产力的可能性。tidyverse 用一种更加优雅且符合人直观思路的方式,以管道式、泛函式编程技术实现了数据科学的
整个流程:数据导入、数据清洗、数据操作、数据可视化、数据建模、可重现与交互报告。

如果你也面临以下问题待解决:

  • 处理自己领域真实的数据
  • 对数据进行建模分析、可视化
  • 得出有用的结论

欢迎使用tidyverse工作流。是从basic R 还是tidy R入门R语言的争议一直都有,没有绝对的对错,只有合适与否。基于我自身,我需要解决的就是上述三个问题,因此tidy R 工作流程更适合我。

1. 数据分析的开端,Tidyverse?

tidyverse 是一个清晰的 R 包集合,在数据操作、探索和可视化领域提供统一的数据科学解决方案,这些解决方案具有共同的设计理念。它是由 RStudio 背后的首席科学家 Hadley Wickham 创建的。 tidyverse 中的 R 包旨在提高统计学家和数据科学家的工作效率。包引导他们完成工作流程,促进沟通并产生可重复的工作产品。 tidyverse 本质上侧重于使工作流成为可能的工具的互连。在数据科学项目中采用 tidyverse有很多优势。它提供一致的功能、工作流覆盖范围、数据科学教育、数据科学工具开发的简化路径以及提高生产力的潜力。

它的主要目标之一是帮助任何需要分析数据的人高效地工作。如下图所示,tidyverse可以帮助你实现:

  1. 数据导入/导出
  2. 数据清洗处理
  3. 批量建模
  4. 数据/模型结果可视化
  5. 生成(可交互)的分析报告. (eg. pdf, word, ppt)

本文主要讨论前两个目标及数据的创建导入导出,以及数据清洗处理。

  • tidyverse核心: 管道操作 %>%

进行数据分析前先导入必要的包和数据:

library(tidyverse)
library(rio)
data("german", package = "rchallenge") 

首先查看你的数据集:

german %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

如你想对数据集german进行如下操作,:

  • 选择status,savings,amount,age,credit_risk这些列
german %>% select(status,savings,amount, age, credit_risk) %>% 
  • 再按分类变量status分类
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
  • 并计算每一类的个数。
german %>% select(status,savings,amount, age, credit_risk) %>% group_by(status) %>% glimpse()
## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…

通过tidyverse这样的工作流程,我们可以按照自己的思路一步一步的处理清洗我们的数据。

2. 数据导入/导出

  • 数据读取

这里建议使用rio包里的import函数导入。可以导入xlxs,.csv,.txt, SPSS, Stata, SAS等大部分的格式。

当然每个格式也有单独对应的包,如readr可以用来读取csv,readxl读取xls,haven读取SPSS,Stat,SAS文件等。

library(rio)
df = import("yourdirectory/data.csv")
df = import("yourdirectory/data.xlxs")
  • 数据导出

同理这里推荐使用rio包里的export函数。

export(df, "yourdirectory/name.csv")
export(df, "yourdirectory/name.xlxs")

3. tidyverse数据清洗处理流程

这里主要介绍关于行列创建修改的基本操作,以及一些分组汇总知识。更多细节请查看。

R语言编程–基于tidyverse

3.1选择列

主要使用到的函数有 relocate/select

  • select 选择所需要的列
  • relocate 选择需要的列并排序,它保留了所有列,但为重新排序提供了更大的灵活性。
german %>% select(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
## $ job         <fct> skilled employee/official, skilled employee/official, unsk…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
german %>% select(2,3,5,7) %>% glimpse()
## Rows: 1,000
## Columns: 4
## $ duration            <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6, 48…
## $ credit_history      <fct> all credits at this bank paid back duly, all credi…
## $ amount              <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 109…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 <= .…
# 选取这些列排到最前面,保留剩余列
german %>% relocate(status,credit_risk,job,age) %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
# 选取这些列,且以status-age 为顺序,age是倒数第三列。
german %>% relocate(status,credit_risk,job,age, .after = last_col(offset = 3)) 
  • 使用函数选择
german %>% select(starts_with("s")) %>% glimpse() # 选择列名是以s开始的
## Rows: 1,000
## Columns: 2
## $ status  <fct> no checking account, no checking account, ... < 0 DM, no check…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ... < …
german %>% select(ends_with("s")) %>% glimpse() # 选择列名是以s结束的
## Rows: 1,000
## Columns: 5
## $ status                  <fct> no checking account, no checking account, ... …
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
german %>% select(contains("s")) # 选择列名中包含s的
german %>% select(where(is.factor))  # 选择是列属性是factor的
  • 使用逻辑符选择
# 选择这三列,并且这三列中是因子的
german %>% select(c(status,age,credit_risk) & where(is.factor)) %>% glimpse()
## Rows: 1,000
## Columns: 2
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
# 不选择这age,status这两列
german %>% select(!c(age,status))
german %>% select(-c(age,status))

3.2 修改列

  • 创建新列/修改新列/修改旧列 mutate
german %>% # 创建新列age_mean计算age的均值,一般默认会把新列放在最后,.before=1使得放在第一列mutate(age_mean = mean(age), .before = 1) %>% # 也可以修改原有的列mutate(age = age + 1) %>% #也可以同时创建多个列,顺序是依次往后计算,所以前面创建的列可以在后面直接用。mutate(duration_mean = mean(duration),duration_median = median(duration),duration_sd = sd(duration),order = duration_mean > duration_median,order = as.numeric(order),.before= 1) %>% glimpse()
## Rows: 1,000
## Columns: 26
## $ duration_mean           <dbl> 20.903, 20.903, 20.903, 20.903, 20.903, 20.903…
## $ duration_median         <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18…
## $ duration_sd             <dbl> 12.05881, 12.05881, 12.05881, 12.05881, 12.058…
## $ order                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age_mean                <dbl> 35.542, 35.542, 35.542, 35.542, 35.542, 35.542…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <dbl> 22, 37, 24, 40, 39, 49, 40, 41, 66, 24, 37, 25…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
  • 同时修改多个列 across
german %>%# 将选定列转化为字符型mutate(across(c(duration, age), as.character)) %>% # 将每一列转化为数字型mutate(across(everything(), as.numeric)) %>% # 分别计算duration, age 列的均值和方差mutate(across(c(duration, age),.fns = list(mean = mean, stdev =sd)),.before = 1)
german %>% group_by(status) %>% summarise(across(c(duration,age), .fns = list(mean = mean, stdev = sd)),.groups = "drop") 
  • 对列进行重新编码

    • if_else() 多用于处理二分类,也可以多分类但不推荐
# 当age>=60,老人,1
german %>% mutate(old = if_else(age<60,0,1)) 
# 也可以用多个if_else 嵌套处理多分类
# age<30:0 ; age<60:1; else:2 
german %>% mutate(age_category = if_else(age<30,0,if_else(age<60,1,2)),.before=1) %>% glimpse()
## Rows: 1,000
## Columns: 22
## $ age_category            <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
  • case_when() 用于处理多分类,要比if_else()更直接好用
# 同理对年龄进行多分类
german %>% mutate(age_category = case_when(age<30 ~ 0,age<60 ~ 1,TRUE ~ 2),.before=1) 
german %>% mutate(age_category = case_when(age<30 ~ "young",age<60 ~ "middle",TRUE ~ "old"),.before=1) %>% glimpse()
  • one-hot 处理 correlationfunnel

把所有变量进行one-hot处理,生成二进制 (0/1) 变量的特征集。

  • Numeric data 被切分(使用n_bins)成分类数据,然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别(高基数类别)增加维数(结果数据框的宽度),我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。

  • Categorical data : one-hot encoding

# 把所有变量进行one-hot 处理
# install.packages("correlationfunnel")
german %>% correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER",one_hot = TRUE) %>% glimpse()
## Rows: 1,000
## Columns: 74
## $ status__no_checking_account                                   <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM`                                          <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year`          <dbl> 0, 0, 0,…
## $ status__OTHER                                                 <dbl> 0, 0, 0,…
........
## $ duration__12_15                                               <dbl> 0, 0, 0,…
## $ duration__15_24                                               <dbl> 1, 0, 0,…
## $ duration__24_30                                               <dbl> 0, 0, 0,…
## $ `telephone__yes_(under_customer_name)`                        <dbl> 0, 0, 0,…
## $ foreign_worker__no                                            <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER                                         <dbl> 0, 0, 0,…
## $ credit_risk__bad                                              <dbl> 0, 0, 0,…
## $ credit_risk__good                                             <dbl> 1, 1, 1,…

关于correlationfunnel包的更多具体使用请查看R packages: Correlationfunnel 相关漏斗图,可视化因变量自变量之间的相关关系

3.3 筛选行 filter()

# 选取满足多个条件的行,这里, 相当于 & 表示和的意思
german %>% filter(age > 60, credit_risk == "good", telephone == "no") %>% glimpse()
german %>% filter(age != 35 | age>60, credit_risk == "good") %>% glimpse()
# 选取age列和duration列都大于50的行
german %>% filter(if_all(c(age, duration), ~ .x > 50)) %>% glimpse()

分组汇总

german %>% group_by(status) %>% count(credit_risk, name = "n") %>% glimpse()
## Rows: 8
## Columns: 3
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, ... …
## $ credit_risk <fct> bad, good, bad, good, bad, good, bad, good
## $ n           <int> 135, 139, 105, 164, 14, 49, 46, 348
german_wider = german %>% group_by(status) %>% count(credit_risk, name = "n") %>% pivot_wider(names_from = credit_risk,values_from = n,values_fill = 0) %>% ungroup() %>% glimpse()
## Rows: 4
## Columns: 3
## $ status <fct> no checking account, ... < 0 DM, 0<= ... < 200 DM, ... >= 200 D…
## $ bad    <int> 135, 105, 14, 46
## $ good   <int> 139, 164, 49, 348

生成分组表格的快捷方式

library(table1)
# 汇总如下变量 
table1(~  duration + age + status + job , data = german)

在这里插入图片描述

总结

tidyverse 的优势包括一致的功能、工作流覆盖、数据科学教育的途径、开发数据科学工具的简约方法以及提高生产力的可能性。tidyverse 用一种更加优雅且符合人直观思路的方式,以管道式、泛函式编程技术实现了数据科学的
整个流程:数据导入、数据清洗、数据操作、数据可视化、数据建模、可重现与交互报告。

如果你也面临以下问题待解决:

  • 处理自己领域真实的数据
  • 对数据进行建模分析、可视化
  • 得出有用的结论

欢迎使用tidyverse工作流。是从basic R 还是tidy R入门R语言的争议一直都有,没有绝对的对错,只有合适与否。基于我自身,我需要解决的就是上述三个问题,因此tidy R 工作流程更适合我。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/42327.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

从零开发基于chatGPT的社区交友系统 (前后端分离)丨主页框架 03

03 主页框架 前言开工引用iconfont字体图标template代码style代码页面展示 路由局部跳转script代码 总结 前言 上一章我们实现了基础的主题切换功能&#xff0c;这一章我们开始搭建主页主要框架&#xff0c;并测试一下我们的切换主题功能。 开工 引用iconfont字体图标 这里…

制造焦虑的一边去! AI 不会让我们失业,还会带来大量新机遇!

点击上方“编程技术进阶”&#xff0c;加"星标" 重磅干货&#xff0c;第一时间送达 大家好&#xff0c;我是编哥。先送上涛哥用 AI 画的美女&#xff0c;目测这个是涛哥最近的梦中情人 声明&#xff1a;本文引用图片来自涛哥&#xff08;公众号&#xff1a;涛哥聊Pyt…

AI崛起,我们应该保持自我意识!

目录 前言西部世界剧情&#xff1f;辅助还是威胁&#xff1f;如何自我保护?总结 前言 人工智能技术发展迅速&#xff0c;为我们的生活带来了很多便利。但就个人来说这些技术都是可控的&#xff0c;我并不怎么担心。直到ChatGPT的到来&#xff0c;把我惊呆了。一边感叹AI技术的…

GPT帮你写简历:求职日企篇

正文共 927 字&#xff0c;阅读大约需要 3 分钟 求职者必备技巧&#xff0c;您将在3分钟后获得以下超能力&#xff1a; 快速生成日文简历 Beezy评级 &#xff1a;B级 *经过简单的寻找&#xff0c; 大部分人能立刻掌握。主要节省时间。 推荐人 | nanako 编辑者 | Linda 在应聘…

chatgpt赋能Python-python如何构造矩阵

Intro Python是一门强大的编程语言&#xff0c;拥有广泛的应用领域&#xff0c;其中矩阵运算是其中一个重要的方向。在本文中&#xff0c;我们将介绍如何使用Python构造矩阵&#xff0c;并介绍常见的矩阵运算。 什么是矩阵&#xff1f; 矩阵是由一组有序数排列成的矩形阵列&…

chatgpt赋能python:Python怎么手动输入矩阵

Python怎么手动输入矩阵 若想在使用Python进行科学计算或机器学习时&#xff0c;我们经常会使用矩阵。要能够利用Python处理这些矩阵&#xff0c;我们需要了解如何输入一个矩阵。本文将介绍如何通过代码手动输入矩阵。 什么是矩阵 矩阵是一个二维数组&#xff0c;它包含了行…

新知实验室 TRTC在线上教育场景中的一种应用实践

一、参赛说明 很荣幸参与此次征文活动&#xff0c;我是从2年多以前&#xff0c;接触腾讯云TRTC相关产品&#xff0c;正逢疫情肆虐&#xff0c;很多业务迫切需要从线下转移到线上。 经过2年的实践&#xff0c;TRTC本身也经历了多次的迭代&#xff0c;功能更强大&#xff0c;也…

Auto_GPT:如何使用Auto-GPT、安装、开发

文章目录 前言一、使用前提二、如何在您的 PC 上安装 Auto-GPT第 1 步&#xff1a;安装 Python第 2 步&#xff1a;获取 Auto-GPT 源代码和 OpenAI API 密钥第 3 步&#xff1a;在 Windows、MacOS和 Linux上安装 Auto-GPT 三、如何在你的 PC 上运行Auto-GPT总结 前言 Auto-GPT…

【霸指拓客】抖音自动引流脚本源码终端开发

<<抖音自动引流脚本源码.>> auto(); //等待广告加载 launchApp("抖音极速版"); sleep(Math.random()*10010000); click("我知道了"); sleep(3000); back(); sleep(2000); var w device.width; var h device.height; while(true){ var i…

Autojs抖音小助手抢红包福袋脚本实战分享

此代码由日赚3万网autojs开发交流群整理提供&#xff0c;欢迎大家一起技术交流&#xff0c;查看更多脚本和源码&#xff1a; 一键加群。如有侵权,请联系我删除&#xff01; // 最新抖音小助手抢福袋脚本已更新至V5.18&#xff0c;可自动换号批量多开操作&#xff0c;单机日入15…

不要再用 C/C++ 的这种说法了!

我们对“C/C”这种写法或说法似乎在无形之中早已习以为常&#xff0c;然而&#xff0c;这种做法真的是对的吗&#xff1f; 在今天这篇文章中&#xff0c;有开发者呼吁应该立即停止使用“C/C”这种说法&#xff0c;因为这属于两种完全不同的编程语言&#xff0c;以下是他解释的原…

真正的Python多线程来了!

【CSDN 编者按】IBM工程师Martin Heinz发文表示&#xff0c;Python即将迎来了真正的多线程时刻&#xff01; 原文&#xff1a;https://martinheinz.dev/blog/97 未经授权&#xff0c;禁止转载&#xff01; 作者 | Martin Heinz 责编 | 梦依丹 翻译工具 | ChatGPT 32岁的Python…

Circular lollipop | 哇咔咔!!!环形棒棒糖图好吃又好玩!~

1写在前面 今天不想废话了&#xff0c;直接看图吧。&#x1f447; 复现代码step by step&#xff0c;自己看吧。&#x1f92a; 2用到的包 rm(list ls())library(tidyverse)library(ggtext)library(patchwork) 3示例数据 df_pw <- read.csv("./passwords.csv",row…

最新突破!天然产物首次实现全合成,轰动整个化学界

01 科研背景 生物医学是综合医学、生命科学和生物学的理论和方法而发展起来的前沿交叉学科,基本任务是运用生物学及工程技术手段研究和解决生命科学&#xff0c;特别是医学中的有关问题。机器学习技术能利用复杂的算法在大规模、异质性数据集中进行运行,在生物医学方面、人类…

基于人工智能与机器学习等多种方法的研究内容

导语 生物医学是综合医学、生命科学和生物学的理论和方法而发展起来的前沿交叉学科,基本任务是运用生物学及工程技术手段研究和解决生命科学&#xff0c;特别是医学中的有关问题。机器学习技术能利用复杂的算法在大规模、异质性数据集中进行运行,在生物医学方面、人类基因组项…

《简说基因》公众号编辑招募公告

【简说基因】免费交流群&#xff0c;资料领取/学习交流 生物信息学是什么&#xff1f;看一看 ChatGPT 的回答&#xff1a; 生物信息学是一个跨学科领域&#xff0c;结合了生物学和计算机科学的原理和技术&#xff0c;旨在研究和解释生物学数据。它利用计算机算法和统计方法来分…

挖呀挖!AutoDock 分子对接极简实践方法

挖呀挖&#xff01;AutoDock 分子对接极简实践方法&#x1f44b; 哈哈哈 开心&#x1f606; 在小小的花园里面挖呀挖呀挖 &#x1f331; 种下一颗小小的种子&#x1fad8; 开出一朵小小的花&#x1f338; 在大大的花园里面挖呀挖呀挖&#x1f331; 种下一颗大大的种子&#x1f…

【TOP生物信息】CNS图表复现,单细胞marker基因展示的另一种方式——蜂巢图

扫码关注下方公粽号&#xff0c;回复推文合集&#xff0c;获取400页单细胞学习资源&#xff01; 本文共计1359字&#xff0c;阅读大约需要4分钟。 Sten Linnarsson大神的单细胞绘图堪称极致美学&#xff0c;在这里&#xff0c;小编选择了发表在nature上展示marker基因的绘图进…