当前位置：首页 > news >正文

《R for Data Science (2e)》免费中文翻译 (第9章) --- Layers（1）

news 2025/9/28 8:45:23

写在前面

本系列推文为《R for Data Science (2)》的中文翻译版本。所有内容都通过开源免费的方式上传至Github，欢迎大家参与贡献，详细信息见：
Books-zh-cn 项目介绍：
Books-zh-cn：开源免费的中文书籍社区
r4ds-zh-cn Github 地址：
https://github.com/Books-zh-cn/r4ds-zh-cn
r4ds-zh-cn 网站地址：
https://books-zh-cn.github.io/r4ds-zh-cn/

在 9 Layers 中，您将了解图形的分层语法。
在 10 Exploratory data analysis 中，您将把可视化与好奇心和怀疑精神结合起来，提出并回答有关数据的有趣问题。
最后，在 11 Communication 中，您将学习如何获取探索性图形，提升它们，并将它们转化为说明性图形，这些图形可以帮助分析新手尽可能快速、轻松地了解正在发生的事情。

这三章让您开始进入可视化世界，但还有很多东西需要学习。了解更多信息的绝对最佳地点是 ggplot2 book：ggplot2: Elegant graphics for data analysis。它更深入地介绍了基本理论，并提供了更多关于如何组合各个部分来解决实际问题的示例。另一个很棒的资源是 ggplot2 扩展库 https://exts.ggplot2.tidyverse.org/gallery/。该站点列出了许多使用新的几何和比例扩展 ggplot2 的软件包。如果您尝试用 ggplot2 做一些看似困难的事情，那么这是一个很好的起点。

9.1 介绍

在 Chapter 1 中，您学到的不仅仅是如何制作 scatterplots、bar charts、boxplots。您学习了一个基础知识，可用于使用 ggplot2 绘制任何类型的绘图。

在本章中，当您学习图形的分层语法时，您将在此基础上进行扩展。我们将从更深入地研究美学映射（aesthetic mappings）、几何对象（geometric objects）和分面（facets）开始。然后，您将了解 ggplot2 在创建绘图时在幕后进行的统计转换。这些转换用于计算要绘制的新值，例如条形图中的条形高度或箱形图中的中位数。您还将了解位置调整，这会修改几何图形在绘图中的显示方式。最后，我们将简要介绍一下坐标系。

我们不会涵盖每个层的每个功能和选项，但我们将引导您了解 ggplot2 提供的最重要和最常用的功能，并向您介绍扩展 ggplot2 的包。

9.1.1 先决条件

本章重点介绍 ggplot2。要访问本章中使用的数据集、帮助页面和函数，请通过运行以下代码加载 tidyverse：

library(tidyverse)

9.2 美学映射

"The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey

请记住，与 ggplot2 包捆绑在一起的 mpg 数据框包含 r mpg |> distinct(model) |> nrow() 种汽车模型的 r nrow(mpg) 个观察值。

mpg
#> # A tibble: 234 × 11
#>   manufacturer model displ  year   cyl trans      drv     cty   hwy fl   
#>   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr>
#> 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p    
#> 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p    
#> 3 audi         a4      2    2008     4 manual(m6) f        20    31 p    
#> 4 audi         a4      2    2008     4 auto(av)   f        21    30 p    
#> 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p    
#> 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p    
#> # ℹ 228 more rows
#> # ℹ 1 more variable: class <chr>

mpg 中的变量包括：

displ：汽车的发动机尺寸，以升为单位。数值变量。
hwy：汽车在高速公路上的燃油效率，以英里/加仑 (mpg) 为单位。行驶相同距离时，燃油效率低的汽车比燃油效率高的汽车消耗更多的燃油。数值变量。
class：汽车类型。一个分类变量。

让我们首先可视化各类汽车的 displ 和 hwy 之间的关系。我们可以使用散点图来做到这一点，其中数值变量 mapped 到 x 和 y aesthetics，分类变量 mapped 到 color 或 shape 等 aesthetic。

# Left
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +geom_point()# Right
ggplot(mpg, aes(x = displ, y = hwy, shape = class)) +geom_point()
#> Warning: The shape palette can deal with a maximum of 6 discrete values because more
#> than 6 becomes difficult to discriminate
#> ℹ you have requested 7 values. Consider specifying shapes manually if you
#>   need that many of them.
#> Warning: Removed 62 rows containing missing values or values outside the scale range
#> (`geom_point()`).

当 class mapped 到 shape 时，我们收到两个 warnings：

1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.

2: Removed 62 rows containing missing values (geom_point()).

由于 ggplot2 同时只能使用六个 shapes，因此默认情况下，当您使用 shape aesthetic 时，其他组将不会绘制。第二个 warning 是相关的 -- 数据集中有 62 辆 SUVs，但它们没有被绘制出来。

类似地，我们也可以将 class map 到 size 或 alpha aesthetics，它们分别控制 points 的大小和透明度。

# Left
ggplot(mpg, aes(x = displ, y = hwy, size = class)) +geom_point()
#> Warning: Using size for a discrete variable is not advised.# Right
ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +geom_point()
#> Warning: Using alpha for a discrete variable is not advised.

这两者也会产生 warnings：

Using alpha for a discrete variable is not advised.

将无序离散（分类）变量（class）Mapping 到有序 aesthetic（size or alpha）通常不是一个好主意，因为它意味着实际上不存在的排名。

一旦你 map 一个 aesthetic，ggplot2 就会处理剩下的事情。它选择合理的尺度来符合 aesthetic，并构建一个图例（legend）来解释 levels 和 values 之间的映射。对于 x 和 y aesthetics，ggplot2 不会创建 legend，但会创建带有刻度线（tick marks）和标签（label）的轴线（axis line）。轴线（axis line）提供与图例（legend）相同的信息；它解释了位置（locations）和值（values）之间的映射（mapping）。

您还可以手动将 geom 的视觉属性设置为 geom 函数的参数（outside of aes()），而不是依赖变量映射来确定外观。例如，我们可以将图中的所有 points 设为蓝色：

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue")

在这里，颜色并不传达有关变量的信息，而仅改变绘图的外观。您需要选择一个对该 aesthetic 有意义的值：

字符串形式的颜色名称，例如 color = "blue"
point 的大小（以毫米为单位），例如 size = 1
point 的 shape 作为数字，例如，shape = 1，如 Figure 9.1 所示。

Figure 9.1: R 有 25 个由数字标识的内置形状。有些看似重复：例如，0、15、22 都是正方形。差异来自于 color 和 fill aesthetics 的相互作用。空心形状 (0--14) 的边框由 color 决定；实心形状（15--20）通过 color 填充；填充形状 (21--24) 具有 color 边框并通过 fill 填充。形状的排列使相似的形状彼此相邻。

到目前为止，我们已经讨论了使用 point geom 时可以在散点图中映射或设置的 aesthetics。您可以在 aesthetic specifications vignette 中了解有关所有可能的 aesthetic mappings 的更多信息：https://ggplot2.tidyverse.org/articles/ggplot2-specs.html。

您可以用于绘图的具体美观效果取决于您用来表示数据的几何图形。在下一节中，我们将更深入地研究 geoms。

9.2.1 练习

创建 hwy 与 displ 的散点图，其中点用粉红色填充三角形。

为什么以下代码没有生成带有蓝点的图？

ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = "blue"))

stroke aesthetic 有什么作用？它适用于什么形状？（提示：使用 ?geom_point）
如果将 aesthetic map 到变量名称以外的其他内容，例如 aes(color = displ < 5) 会发生什么？请注意，您还需要指定 x 和 y。

9.3 几何对象

这两个图有何相似之处？

两个图都包含相同的 x 变量、相同的 y 变量，并且都描述相同的数据。但图并不相同。每个图使用不同的几何对象（geometric object），geom，来表示数据。左侧的图使用 point geom，右侧的图使用 smooth geom，一条平滑线拟合数据。

要更改绘图中的 geom，请更改添加到 ggplot() 的 geom 函数。例如，要绘制上面的图，您可以使用以下代码：

# Left
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()# Right
ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot2 中的每个 geom 函数都带有一个 mapping 参数，该参数可以在 geom layer 中本地定义，也可以在 ggplot() layer 中全局定义。然而，并非每种美学（aesthetic）都适用于每种几何（geom）。您可以设置点（point）的形状（shape），但无法设置线（line）的形状（shape）。如果您尝试，ggplot2 将默默地忽略该美学映射（aesthetic mapping）。另一方面，您可以设置线条（line）的线型（linetype）。geom_smooth() 将为映射（map）到线型（linetype）的变量的每个唯一值绘制一条具有不同线型（linetype）的不同线（line）。

# Left
ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + geom_smooth()# Right
ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + geom_smooth()

在这里，geom_smooth() 将汽车分成三条线（lines）根据它们的 drv 值，该值描述了汽车的传动系统。一条线描述具有 4 值的所有点，一条线描述具有 f 值的所有点，一条线描述具有 r 值的所有点。这里，4 代表四轮驱动，f 代表前轮驱动，r 代表后轮驱动。

如果这听起来很奇怪，我们可以通过将线条叠加在原始数据上，然后根据 drv 对所有内容进行着色来使其更清晰。

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() +geom_smooth(aes(linetype = drv))

请注意，该图在同一图中包含两个 geoms。

许多 geoms，像 geom_smooth()，使用单个几何对象（geometric object）来显示多行数据。对于这些 geoms，您可以将 group aesthetic 设置为分类变量以绘制多个对象。 ggplot2 将为分组变量的每个唯一值绘制一个单独的对象。实际上，每当您将 aesthetic 映射到离散变量（如 linetype 示例中）时，ggplot2 都会自动对这些 geoms 的数据进行分组。依赖此功能很方便，因为 group aesthetic 本身不会向 geoms 添加图例或显着特征。

# Left
ggplot(mpg, aes(x = displ, y = hwy)) +geom_smooth()# Middle
ggplot(mpg, aes(x = displ, y = hwy)) +geom_smooth(aes(group = drv))# Right
ggplot(mpg, aes(x = displ, y = hwy)) +geom_smooth(aes(color = drv), show.legend = FALSE)

如果将 mappings 放置在 geom 函数中，ggplot2 会将它们视为图层的本地映射（local mappings）。它将使用这些 mappings 来扩展或覆盖仅该层的全局映射（global mappings）。这使得在不同图层上展现不同的 aesthetics 成为可能。

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth()

您可以使用相同的想法为每一图层指定不同的 data 。在这里，我们使用红点和空心圆圈来突出显示 two-seater cars。geom_point() 中的本地数据参数仅覆盖该图层的 ggplot() 中的全局数据参数。

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_point(data = mpg |> filter(class == "2seater"), color = "red") +geom_point(data = mpg |> filter(class == "2seater"), shape = "circle open", size = 3, color = "red")

Geoms 是 ggplot2 的基本构建块。您可以通过更改绘图的 geom 来彻底改变绘图的外观，并且不同的 geoms 可以揭示数据的不同特征。例如，下面的直方图和密度图显示高速公路里程的分布是双峰且右偏的，而箱线图则显示两个潜在的异常值。

# Left
ggplot(mpg, aes(x = hwy)) +geom_histogram(binwidth = 2)# Middle
ggplot(mpg, aes(x = hwy)) +geom_density()# Right
ggplot(mpg, aes(x = hwy)) +geom_boxplot()

ggplot2 提供了 40 多种几何图形（geoms），但这些并没有涵盖人们可以绘制的所有可能的绘图。如果您需要不同的 geom，我们建议首先查看扩展包，看看其他人是否已经实现了它（请参阅 https://exts.ggplot2.tidyverse.org/gallery/）。例如，ggridges 包 (https://wilkelab.org/ggridges/) 对于制作山脊线图（ridgeline plots）非常有用，这对于可视化不同级别的分类变量的数值变量的密度非常有用。在下图中，我们不仅使用了新的 geom（geom_density_ridges()），而且还将相同的变量映射到多种 aesthetics（drv to y, fill, and color）并设置 aesthetic（alpha = 0.5）使密度曲线透明。

library(ggridges)ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +geom_density_ridges(alpha = 0.5, show.legend = FALSE)
#> Picking joint bandwidth of 1.28

全面概述 ggplot2 提供的所有 geoms 以及包中所有功能的最佳位置是参考页面：https://ggplot2.tidyverse.org/reference。要了解有关任何单个 geom 的更多信息，请使用帮助（例如?geom_smooth）。

9.3.1 练习

您会使用什么 geom 来绘制 line chart？b oxplot？h istogram？a rea chart？
在本章前面我们使用了 show.legend，但没有解释它：
```
ggplot(mpg, aes(x = displ, y = hwy)) +geom_smooth(aes(color = drv), show.legend = FALSE)
```
show.legend = FALSE 在这里做什么？如果删除它会发生什么？您认为我们为什么之前使用它？
geom_smooth() 的 se 参数有什么作用？
重新创建生成以下图表所需的 R 代码。请注意，图中只要使用分类变量，它就是 drv。

9.4 分面

在 Chapter 1 中，您了解了如何使用 facet_wrap() 进行分面（faceting），它将图分割成子图，每个子图基于分类变量显示数据的一个子集。

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_wrap(~cyl)

要使用两个变量的组合对图进行分面（facet），请从 facet_wrap() 切换到 facet_grid()。facet_grid() 的第一个参数也是一个公式，但现在它是一个双面公式：rows ~ cols。

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl)

默认情况下，每个 facets 的 x 轴和 y 轴共享相同的比例和范围。当您想要跨 facets 比较数据时，这很有用，但当您想要更好地可视化每个 facet 内的关系时，它可能会受到限制。将 faceting 函数中的 scales 参数设置为 "free" 将允许跨行和列使用不同的轴比例， "free_x" 将允许跨行使用不同的比例，"free_y" 将允许跨列使用不同的比例。

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl, scales = "free_y")

9.4.1 练习

如果对连续变量进行分面（facet）会发生什么？
facet_grid(drv ~ cyl) 图中的空单元格是什么意思？运行以下代码。它们与最终的绘图有什么关系？
```
ggplot(mpg) + geom_point(aes(x = drv, y = cyl))
```

下面的代码会绘制什么图？. 的作用是什么？

ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) +facet_grid(drv ~ .)ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) +facet_grid(. ~ cyl)

以本节中的第一个 faceted plot 为例：
```
ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)
```
使用 faceting 代替 color aesthetic 有什么优点？有什么缺点？如果您有更大的数据集，平衡会如何变化？
阅读 ?facet_wrap。nrow 的作用是什么？ncol 的作用是什么？还有哪些其他选项控制各个面板的布局？为什么 facet_grid() 没有 nrow 和 ncol 参数？
以下哪幅图可以更轻松地比较具有不同传动系统的汽车的发动机尺寸 (displ)？这对于何时跨行或列放置 faceting 变量意味着什么？
```
ggplot(mpg, aes(x = displ)) + geom_histogram() + facet_grid(drv ~ .)ggplot(mpg, aes(x = displ)) + geom_histogram() +facet_grid(. ~ drv)
```
使用 facet_wrap() 而不是 facet_grid() 重新创建以下图。 facet 标签的位置如何变化？
```
ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) +facet_grid(drv ~ .)
```