5 特征工程

Created with AI

该部分内容由 ChatGPT 等生成式 AI 工具参与创作，可能存在的问题如下：

其中的说法可能并不准确，包括张冠李戴、时效性等方面可能存在问题。
其中的一些代码虽然已经经过调试，可以正常运行，但是可能并不符合现下的编程规范或者推荐用法。

尽管如此，这些内容还是具有较大的参考意义的。

使用过程中请自行甄别。

欢迎提出修改完善的建议。

5.1 配置默认环境

# 设置 knitr 选项
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    message = FALSE,
    warning = FALSE
)

# 显示英文报错信息
Sys.setenv(LANG = "en")

# 使用 rmodels 环境
reticulate::use_condaenv("rmodels",
    conda = "/opt/homebrew/anaconda3/bin/conda"
)

# 导入 tidyverse 包
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# 导入 cailab.utils 包
if (Sys.getenv("USER") == "gaoch") {
    devtools::load_all("~/GitHub/cailab.utils")
} else {
    library("cailab.utils")
}

ℹ Loading cailab.utils
Registered S3 methods overwritten by 'treeio':
  method              from    
  MRCA.phylo          tidytree
  MRCA.treedata       tidytree
  Nnode.treedata      tidytree
  Ntip.treedata       tidytree
  ancestor.phylo      tidytree
  ancestor.treedata   tidytree
  child.phylo         tidytree
  child.treedata      tidytree
  full_join.phylo     tidytree
  full_join.treedata  tidytree
  groupClade.phylo    tidytree
  groupClade.treedata tidytree
  groupOTU.phylo      tidytree
  groupOTU.treedata   tidytree
  inner_join.phylo    tidytree
  inner_join.treedata tidytree
  is.rooted.treedata  tidytree
  nodeid.phylo        tidytree
  nodeid.treedata     tidytree
  nodelab.phylo       tidytree
  nodelab.treedata    tidytree
  offspring.phylo     tidytree
  offspring.treedata  tidytree
  parent.phylo        tidytree
  parent.treedata     tidytree
  root.treedata       tidytree
  rootnode.phylo      tidytree
  sibling.phylo       tidytree

# 导入 tidymodels 包
library(tidymodels)

Registered S3 method overwritten by 'parsnip':
  method          from 
  print.nullmodel vegan
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.5     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.9     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

# 设置 ggplot 默认主题
theme_set(theme_bw())

5.2 什么是特征工程

特征工程是一种数据预处理的方法，它可以帮助机器学习模型更好地理解和预测数据。特征工程的主要步骤有：

特征理解：分析数据的来源、类型、分布、缺失值、异常值等，了解数据的特点和含义。
特征选择：根据数据的相关性、重要性、冗余性等，选择对模型有用的特征，减少特征的维度和噪声。
特征提取：利用数学或统计方法，从原始数据中提取出新的特征，例如主成分分析（PCA）、线性判别分析（LDA）、奇异值分解（SVD）等。
特征构造：利用数据领域的知识，创造出新的特征，例如组合、分解、变换、编码等。
特征转换：将特征转换为适合模型的格式，例如标准化、归一化、离散化、独热编码等。

特征工程是机器学习中的一门艺术，它需要不断地尝试和优化，才能找到最合适的特征组合。特征工程的好坏，往往决定了模型的性能和效果。

在 tidymodels 中，使用 recipes 软件包进行特征工程，它可以让您用类似dplyr的管道语法来创建和预处理机器学习的特征。recipes软件包的主要优点有：

它可以处理各种数据类型，包括数值、分类、文本、图像等。
它可以方便地添加、删除、修改和组合特征工程的步骤，以及调整参数和选项。
它可以自动估计特征工程的统计参数，并将其应用到新的数据集上，保证数据的一致性。
它可以与其他tidymodels包，如rsample、parsnip、tune等无缝地集成，构建完整的机器学习流程。

如果您想学习如何使用 recipes 软件包，您可以参考以下的资源：

recipes官方网站：这里有recipes软件包的详细文档、教程和示例，以及常见问题的解答。
Feature Engineering and Selection：这是一本在线书籍，介绍了特征工程的理论和应用，包括使用recipes软件包的示例。
R Recipes: A Problem-Solution Approach：这是一本实用的书籍，提供了使用R语言进行数据分析和机器学习的各种问题和解决方案，其中也涉及了recipes软件包的用法。

5.3 recipes 中的概念

来源：https://recipes.tidymodels.org/articles/recipes.html

首先说明几个定义如下：

变量（Variables）：原始数据集中的列，例如在传统公式 Y ~ A + B + A:B 中，变量包括 A、B 和 Y。
角色（Roles）：定义变量在模型中如何使用。例如：predictor（自变量）、response（因变量）和 case weight。这意味着角色的设定是开放且可扩展的。
项（Terms）：设计矩阵中的列，如 A、B 和 A:B。这些也可以是其他派生实体，例如一组主成分或一组定义变量基函数的列。这些与机器学习中的特征是同义词。被赋予 predictor 角色的变量将自动成为主效应项。

总的来说，在 recipes 包中，你可以通过分配“角色”来指定每个变量的用途，并创建“项”以指定模型中的特征。这种方式提供了一个灵活的框架，使得数据预处理和特征工程变得更加简单和直观。

5.4 最小实例

这段代码首先从 “modeldata” 包中加载了一个名为 “ames” 的数据集，然后对 “Sale_Price” 列（房价）进行了对数转换。接着，它定义了一个预处理流程，包括一些特征工程步骤，这个流程将用于训练模型。

Note

关于 ames 数据集

Ames Housing 数据集来源于美国爱荷华州 Ames 市的住宅销售信息，由 Dean De Cock 教授收集而成，用于教学目的，特别是数据清洗和高级回归技术。

这个数据集包含了 2006 年到 2010 年间 Ames 市近 3000 所房屋的 79 种特征，如房屋类型、建造年份、房间数量、地下室情况、车库大小、建筑材料等，以及每个房屋的最终销售价格。

在这个数据集中，每一行代表一处房产，每一列代表一个特性，其中 Sale_Price 列是我们通常要预测的目标变量。这个数据集通常被用来进行回归分析或机器学习任务，例如预测未来的房价。

因为这个数据集有许多特征，并且涉及到各种不同类型的变量（如类别变量、顺序变量和数值变量），所以它是一个非常好的数据集，可以用来练习和展示数据预处理、特征工程和模型调优等技巧。

data(ames, package = "modeldata")

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

ames_rec <-
  recipe(Sale_Price ~ ., data = ames[-(1:6), ]) %>%
  step_other(Neighborhood, threshold = 0.05) %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_ns(Longitude, Latitude, deg_free = 2) %>%
  step_zv(all_predictors())

ames_rec = prep(ames_rec)

# return the training set (already embedded in ames_rec)
bake(ames_rec, new_data = NULL)
#> # A tibble: 2,924 × 259
#>    Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#>           <dbl>    <int>      <int>          <int>        <dbl>        <dbl>
#>  1           41     4920       2001           2001            0            3
#>  2           43     5005       1992           1992            0            1
#>  3           39     5389       1995           1996            0            3
#>  4           60     7500       1999           1999            0            7
#>  5           75    10000       1993           1994            0            7
#>  6            0     7980       1992           2007            0            1
#>  7           63     8402       1998           1998            0            7
#>  8           85    10176       1990           1990            0            3
#>  9            0     6820       1985           1985            0            3
#> 10           47    53504       2003           2003          603            1
#> # ℹ 2,914 more rows
#> # ℹ 253 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> #   Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
#> #   Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> #   Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
#> #   TotRms_AbvGrd <int>, Fireplaces <int>, Garage_Cars <dbl>,
#> #   Garage_Area <dbl>, Wood_Deck_SF <int>, Open_Porch_SF <int>, …

# apply processing to other data:
bake(ames_rec, new_data = head(ames))
#> # A tibble: 6 × 259
#>   Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#>          <dbl>    <int>      <int>          <int>        <dbl>        <dbl>
#> 1          141    31770       1960           1960          112            2
#> 2           80    11622       1961           1961            0            6
#> 3           81    14267       1958           1958          108            1
#> 4           93    11160       1968           1968            0            1
#> 5           74    13830       1997           1998            0            3
#> 6           78     9978       1998           1998           20            3
#> # ℹ 253 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> #   Total_Bsmt_SF <dbl>, First_Flr_SF <int>, Second_Flr_SF <int>,
#> #   Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> #   Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
#> #   TotRms_AbvGrd <int>, Fireplaces <int>, Garage_Cars <dbl>,
#> #   Garage_Area <dbl>, Wood_Deck_SF <int>, Open_Porch_SF <int>,
#> #   Enclosed_Porch <int>, Three_season_porch <int>, Screen_Porch <int>, …

以下是每个步骤的详细解释：

recipe(Sale_Price ~ ., data = ames[-(1:6), ])：这行创建了一个 “recipe” 对象，指定了因变量 (Sale_Price) 和自变量（数据框的所有其他列）。ames[-(1:6), ] 表示去掉了前 6 行的数据。
step_other(Neighborhood, threshold = 0.05)：该步骤将 “Neighborhood” 变量中那些少于 5% 的类别合并为一个新的类别 “other”。
step_dummy(all_nominal())：这步创建虚拟（哑）变量，对所有标称变量执行独热编码。
step_interact(~ starts_with("Central_Air"):Year_Built)：这步创建交互项，即 “Central_Air” 与 “Year_Built” 的乘积。
step_ns(Longitude, Latitude, deg_free = 2)：该步执行自然样条转换，通常用于处理非线性关系。
step_zv(all_predictors())：这步会删除所有零方差预测变量，即那些在所有观察值中都具有相同值的列。

prep(ames_rec) 函数对这个 recipe 进行预处理，计算出需要的统计量（如均值、标准差等）。之后，可以使用 bake() 函数来应用这个预处理流程到新的数据上。例如，bake(ames_rec, new_data = NULL) 会将预处理流程应用到训练集（即创建 recipe 时用的数据），而 bake(ames_rec, new_data = head(ames)) 则会将其应用到数据框 “ames” 的前几行。

5.5 特征工程的方法

recipes 软件包中有两大类函数，一类是用来进行变量选择的函数，与 tidyselect 中的用法大概相同，另一类是用来进行变量转换的函数，通常以 step_* 开头。

Selectors

use basic variable names (e.g. x1, x2),
dplyr functions for selecting variables: contains(), ends_with(), everything(), matches(), num_range(), and starts_with(),
functions that subset on the role of the variables that have been specified so far: all_outcomes(), all_predictors(), has_role(),
similar functions for the type of data: all_nominal(), all_numeric(), and has_type(), or compound selectors such as all_nominal_predictors() or all_numeric_predictors().

Step

recipes 库提供了许多 step_ 函数用于数据预处理和特征工程。以下是一些主要的类别：

缩放与中心化：例如 step_center() 和 step_scale()。这两个函数可以将数值型变量重新缩放到均值为 0，标准差为 1。比如：

recipe <- recipe(~ ., data = your_data) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric())

离散化：例如 step_discretize()。这个函数可以将连续变量划分为若干个范围（即“桶”），然后转换为因子类型。

recipe <- recipe(~ ., data = your_data) %>%
  step_discretize(Age, options = list(cuts = 5))

创建虚拟变量：例如 step_dummy()。这个函数可以对分类变量进行独热编码，每个类别生成一个新的二进制特征。

recipe <- recipe(~ ., data = your_data) %>%
  step_dummy(Gender)

交互项和多项式：例如 step_interact() 和 step_poly(). step_interact() 可以创建交互项（即两个或更多变量的乘积），而 step_poly() 可以创建多项式特征。

recipe <- recipe(~ ., data = your_data) %>%
  step_interact(~ starts_with("x1"):starts_with("x2")) %>%
  step_poly(Gender, degree = 2)

缺失值处理：例如 step_impute_knn() 和 step_impute_median(). 这些函数可以用不同的方法（如 KNN 填充或中位数填充）来处理缺失值。

recipe <- recipe(~ ., data = your_data) %>%
  step_impute_knn(all_predictors()) %>%
  step_impute_median(Age)

此外，recipes 包中的 filter 类函数用于选择或排除特定的观察值或变量。例如：

step_slice()：这个函数会根据给定的行索引保留或删除观察值。

recipe <- recipe(~ ., data = your_data) %>%
  step_slice(row_index(5:10))

上面的代码将保留第5行至第10行的数据。

step_rm()：这个函数会从数据集中删除指定的变量。

recipe <- recipe(~ ., data = your_data) %>%
  step_rm(Gender)

上面的代码将从数据集中移除 Gender 这一列。

step_zv()：该函数会删除所有零方差预测变量，即那些在所有观察值中都具有相同值的列。

recipe <- recipe(~ ., data = your_data) %>%
  step_zv(all_predictors())

上面的代码将移除所有零方差的预测变量。

step_corr()：对于高度相关的预测变量（即两个变量相互之间的相关性超过给定阈值），此函数将只保留一个。

recipe <- recipe(~ ., data = your_data) %>%
  step_corr(all_numeric(), threshold = 0.9)

以上代码将移除与任何其他数值变量相关性超过 0.9 的变量。

以上只是一部分例子，recipes 包还提供了更多的 step_ 函数。具体可根据数据集和建模需求选择合适的预处理步骤。

5.6 其它的特征工程工具

除了 recipes 包，还有另外一些包提供了更多的特征工程工具。这些工具可以在 https://www.tidymodels.org/find/recipes/ 找到。

例如，textrecipes 是 tidymodels 的一个扩展包，专门处理自然语言处理（NLP）任务中常见的文本预处理和特征工程步骤。它遵循与 recipes 包相同的设计原理，并提供了一些针对文本数据的 step_ 函数。

以下是一些主要的 step_ 函数：

step_tokenize()：这个函数将文本分割成单词或标记(token)。

recipe <- recipe(~ ., data = your_data) %>%
  step_tokenize(text_column)

step_stopwords()：这个函数可以删除被认为对模型没有信息价值的常用词（如“the”、“and”等）。

recipe <- recipe(~ ., data = your_data) %>%
  step_tokenize(text_column) %>%
  step_stopwords(text_column)

step_tfidf()：此函数计算每个词的 TF-IDF （词频-逆文档频率）得分，这是一种常见的计算词重要性的方法。

recipe <- recipe(~ ., data = your_data) %>%
  step_tokenize(text_column) %>%
  step_tfidf(text_column)

step_sequence_onehot(): 对序列数据进行独热编码。

recipe <- recipe(~ ., data = your_data) %>%
  step_sequence_onehot(text_column)

以上只是 textrecipes 包提供的部分函数，还有更多其他的函数用于处理特定的文本预处理任务，比如词干提取（stemming）、词形还原（lemmatization）等。这个包是对 recipes 包的有效扩展，使其能够更好地处理文本数据。

除此之外，用户还可以创建自己的 step_* 函数，参见：https://www.tidymodels.org/learn/develop/recipes/。

5.7 创建新变量

这个代码示例主要展示了 recipes 包在数据预处理中的使用，包括创建新的变量以及如何正确地嵌入对象。让我们一步一步来看。

rec <-
  recipe(~., data = iris) %>%
  step_mutate(
    dbl_width = Sepal.Width * 2,
    half_length = Sepal.Length / 2
  )

prepped <- prep(rec, training = iris %>% slice(1:75))

library(dplyr)

dplyr_train <-
  iris %>%
  as_tibble() %>%
  slice(1:75) %>%
  mutate(
    dbl_width = Sepal.Width * 2,
    half_length = Sepal.Length / 2
  )

rec_train <- bake(prepped, new_data = NULL)
all.equal(dplyr_train, rec_train)
#> [1] TRUE

dplyr_test <-
  iris %>%
  as_tibble() %>%
  slice(76:150) %>%
  mutate(
    dbl_width = Sepal.Width * 2,
    half_length = Sepal.Length / 2
  )
rec_test <- bake(prepped, iris %>% slice(76:150))
all.equal(dplyr_test, rec_test)
#> [1] TRUE

# Embedding objects:
const <- 1.414

qq_rec <-
  recipe(~., data = iris) %>%
  step_mutate(
    bad_approach = Sepal.Width * const,
    best_approach = Sepal.Width * !!const
  ) %>%
  prep(training = iris)

bake(qq_rec, new_data = NULL, contains("appro")) %>% slice(1:4)
#> # A tibble: 4 × 2
#>   bad_approach best_approach
#>          <dbl>         <dbl>
#> 1         4.95          4.95
#> 2         4.24          4.24
#> 3         4.52          4.52
#> 4         4.38          4.38

# The difference:
tidy(qq_rec, number = 1)
#> # A tibble: 2 × 3
#>   terms         value               id          
#>   <chr>         <chr>               <chr>       
#> 1 bad_approach  Sepal.Width * const mutate_Bv7YY
#> 2 best_approach Sepal.Width * 1.414 mutate_Bv7YY

首先，定义了一个 recipe，该 recipe 对 iris 数据集的 Sepal.Width 和 Sepal.Length 列进行了变换，创建了两个新的列 dbl_width（等于 Sepal.Width 的两倍）和 half_length（等于 Sepal.Length 的一半）。然后，使用 prep() 函数准备（或训练）这个 recipe，得到了预处理步骤的结果。

接着，使用 dplyr 手动对训练集和测试集进行了相同的预处理步骤，并使用 all.equal() 检查手动处理后的结果是否与使用 recipes 得到的结果一致。结果表明，两种方式得到的结果是一致的。

然后，介绍了嵌入常数对象的正确方法。在这部分的 recipe 中，通过对变量 Sepal.Width 乘以一个常数 const 来生成新的变量。这里使用了 !! 符号来强制评估 const。这种做法被称为“非标准评估”（non-standard evaluation, NSE），是 tidyverse 软件包中经常用到的一种技巧。在这种情况下，如果不使用 !!，那么 const 会被当作一个字符串，而不是其对应的值。

最后，使用 tidy() 函数查看了 recipe 中预处理步骤的详细情况。从输出中可以看到，在 best_approach 列的预处理中，常数 const 已经被替换为其实际值 1.414。

总的来说，这个示例展示了如何使用 recipes 包进行数据预处理，以及在处理过程中如何正确地处理嵌入的对象。