6  Data wrangling in dplyr

6.1 Lesson preamble

6.1.1 Learning Objectives

  • Learn to use data wrangling commands select, filter, %>%, and mutate from the dplyr package.
  • Understand the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and tally to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
  • Learn to switch between long and wide format

6.1.2 Lesson outline

  • Continue data wrangling in dplyr (30 mins)
  • Split-apply-combine techniques in dplyr (20 mins)
  • Using group_by and tally to summarize categorical data (25 mins)
  • Reshaping data (15 mins)

6.2 Getting ready to code

We are going to pick up where we left off last lecture, and continue to work with the same desert animal survey data.

We’ll start by doing our checks. You can set up your yaml block and check your working directory with getwd() on your own. Make sure you are in the directory where you saved our dataset last time and clear your workspace for our new session. We’ll also load the required packages (mostly tidyverse for this class) and data. Make sure what you need to be comfortable is within reach such as tea, coffee, cookies, blankets.

# setwd("~/Documents/UofT/PhD/Teaching/2022-2023/EEB313/2022/Lectures")

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.2
Warning: package 'tibble' was built under R version 4.1.2
Warning: package 'tidyr' was built under R version 4.1.2
Warning: package 'readr' was built under R version 4.1.2
Warning: package 'stringr' was built under R version 4.1.2
Warning: package 'forcats' was built under R version 4.1.2
Warning: package 'lubridate' was built under R version 4.1.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
surveys <- read.csv("portal_data.csv", na.strings = c("",".","NA"))
surveys %>% head()
  record_id month day year plot_id species_id  sex hindfoot_length weight
1         1     7  16 1977       2         NL    M              32     NA
2        72     8  19 1977       2         NL    M              31     NA
3       224     9  13 1977       2         NL <NA>              NA     NA
4       266    10  16 1977       2         NL <NA>              NA     NA
5       349    11  12 1977       2         NL <NA>              NA     NA
6       363    11  12 1977       2         NL <NA>              NA     NA
    genus  species   taxa plot_type
1 Neotoma albigula Rodent   Control
2 Neotoma albigula Rodent   Control
3 Neotoma albigula Rodent   Control
4 Neotoma albigula Rodent   Control
5 Neotoma albigula Rodent   Control
6 Neotoma albigula Rodent   Control
surveys_subset <- surveys[seq(1, 34786, 8), ]

6.3 Data wrangling with dplyr

Wrangling here is used in the sense of maneuvering, managing, controlling, and turning your data upside down and inside out to look at it from different angles in order to understand it. The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++), which means that many operations run much faster than similar tools in R. An additional feature is the ability to work directly with data stored in an external database, such as SQL-databases. The ability to work with databases is great because you are able to work with much bigger datasets (100s of GB) than your computer could normally handle. We will not talk in detail about this in class, but there are great resources online to learn more (e.g. this lecture from Data Carpentry).

6.3.1 Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), arrange, and summarise(). To select columns of a data frame, use select(). The first argument to this function is the data frame (surveys_subset), and the subsequent arguments are the columns to keep.

select(surveys_subset, plot_id, species_id, weight, year) %>% head()
   plot_id species_id weight year
1        2         NL     NA 1977
9        2         NL    218 1978
17       2         NL    218 1978
25       2         NL     87 1980
33       2         NL    211 1982
41       2         NL    152 1982

Note: Unlike in base R, here we don’t need to use quotations around column names.

To choose rows based on a specific criteria, use filter():

filter(surveys_subset, year == 1995) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1     22044     2   4 1995       2         DM   M              37     46
2     22550     8  26 1995       2         DM   F              37     26
3     22997    12   2 1995       2         DM   M              36     50
4     23136    12  21 1995       2         DM   M              33     27
5     22441     7  20 1995       2         PP   F              19     17
6     22669     9  23 1995       2         PP   F              20     17
        genus      species   taxa plot_type
1   Dipodomys     merriami Rodent   Control
2   Dipodomys     merriami Rodent   Control
3   Dipodomys     merriami Rodent   Control
4   Dipodomys     merriami Rodent   Control
5 Chaetodipus penicillatus Rodent   Control
6 Chaetodipus penicillatus Rodent   Control

Note2: To check for equality, R requires two equal signs (==). This is different than object assignment where we use (‘<-’) or (‘=’) to assign values to an object. With filter we want to pull out all rows where year is equal to 1995 not assign the value 1995 to an object named year, so we use the == symbol.

Note3: In general, when you want to find all rows that equal a numeric number you don’t have to use quotes. However, when you want to find all rows that equal a character you do need to put quotes around the value (for instance taxa == "Rodent").

6.3.1.1 An aside on conditionals

Within filter you might want to filter rows using conditionals. Basic conditionals in R are broadly similar to how they’re expressed mathematically:

2 < 3
[1] TRUE
5 > 9
[1] FALSE
5 == 5
[1] TRUE

However, there are a few idiosyncrasies to be mindful of for other conditionals:

2 != 3 # Not equal
[1] TRUE
2 <= 3 # Less than or equal to
[1] TRUE
5 >= 9 # Greater than or equal to
[1] FALSE

Finally, the %in% operator is used to check for membership:

2 %in% c(2, 3, 4) # Checks whether 2 is in c(2, 3, 4), returns logical vector
[1] TRUE

All of the above conditionals are compatible with filter, with the key difference being that filter expects column names as part of conditional statements instead of individual numbers.

6.3.2 Chaining functions together using pipes

But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nesting functions, or pipes.

With intermediate steps, you essentially create a temporary data frame and use that as input to the next function:

temp_df <- select(surveys_subset, plot_id, species_id, weight, year)
filter(temp_df, year == 1995) %>% head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

This can quickly clutter up your workspace with lots of objects.

You can also nest functions (i.e., one function inside of another).

filter(select(surveys_subset, plot_id, species_id, weight, year), year == 1995) %>% head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

This is handy, but can be difficult to read if too many functions are nested as they are evaluated from the inside out.

The last option, forward pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package that is a part of the tidyverse. If you use RStudio, you can type the pipe with Ctrl/Cmd + Shift + M.

Fun fact: The name magrittr comes from the Belgian artist Rene Magritte, who has a painting called “The Treachery of Images” that says in French “This is not a pipe”.

surveys_subset %>% 
    select(., plot_id, species_id, weight, year) %>% 
    filter(., year == 1995) %>% head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

The . refers to the object that is passed from the previous line. In this example, the data frame surveys_subset is passed to the . in the select() statement. Then, the modified data frame (which is the result of the select() operation) is passed to the . in the filter() statement. Put more simply: whatever was the result from the line above will be used in the current line.

Since it gets a bit tedious to write out all the dots, dplyr allows for them to be omitted. In the dplyr family of functions, the first argument is always a data frame, and by default the pipe will pass the output from the line above to this argument. The chunk below, with the . omitted, gives the same output as the one above:

surveys_subset %>% 
    select(plot_id, species_id, weight, year) %>% 
    filter(year == 1995) %>% head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data. Pipes work with non-dplyr functions, too, as long as either the dplyr or magrittr package is loaded.

surveys_subset %>% 
    select(plot_id, species_id, weight, year) %>% 
    filter(year == 1995) %>%
    head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:

surveys_1995 <- surveys_subset %>% 
    select(plot_id, species_id, weight, year) %>% 
    filter(year == 1995)

surveys_1995 %>% head()
  plot_id species_id weight year
1       2         DM     46 1995
2       2         DM     26 1995
3       2         DM     50 1995
4       2         DM     27 1995
5       2         PP     17 1995
6       2         PP     17 1995

6.3.2.1 Challenge

Use the pipe to subset the data frame, keeping only rows where weight is less than 10, and only the columns species_id, sex, and weight.

surveys_subset %>%
  filter(weight < 10) %>%
  select(species_id, sex, weight) %>% head()
  species_id sex weight
1         PF   M      8
2         PF   M      6
3         RM   F      8
4         RM   F      8
5         PF   F      6
6         PF   M      8

We could write a single expression to filter for several criteria, either matching all criteria (&) or any criteria (|):

surveys_subset %>% 
    filter(taxa == 'Rodent' & sex == 'F') %>% 
    select(sex, taxa) %>% head()
  sex   taxa
1   F Rodent
2   F Rodent
3   F Rodent
4   F Rodent
5   F Rodent
6   F Rodent
surveys_subset %>% 
    filter(species == 'clarki' | species == 'audubonii') %>% 
    select(species, taxa) %>% head()
    species    taxa
1 audubonii  Rabbit
2 audubonii  Rabbit
3 audubonii  Rabbit
4 audubonii  Rabbit
5    clarki Reptile
6 audubonii  Rabbit

6.3.3 Creating new columns with mutate

Frequently, you’ll want to create new columns based on the values in existing columns. For instance, you might want to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

surveys_subset %>%
    mutate(weight_kg = weight / 1000) %>% head()
   record_id month day year plot_id species_id sex hindfoot_length weight
1          1     7  16 1977       2         NL   M              32     NA
9        588     2  18 1978       2         NL   M              NA    218
17      1453    11   5 1978       2         NL   M              NA    218
25      3000     5  18 1980       2         NL   F              31     87
33      5558     3  29 1982       2         NL   M              33    211
41      6500     8  16 1982       2         NL   F              33    152
     genus  species   taxa plot_type weight_kg
1  Neotoma albigula Rodent   Control        NA
9  Neotoma albigula Rodent   Control     0.218
17 Neotoma albigula Rodent   Control     0.218
25 Neotoma albigula Rodent   Control     0.087
33 Neotoma albigula Rodent   Control     0.211
41 Neotoma albigula Rodent   Control     0.152

You can also create a second new column based on the first new column within the same call of mutate():

surveys_subset %>%
    mutate(weight_kg = weight / 1000,
           weight_kg2 = weight_kg * 2) %>% head()
   record_id month day year plot_id species_id sex hindfoot_length weight
1          1     7  16 1977       2         NL   M              32     NA
9        588     2  18 1978       2         NL   M              NA    218
17      1453    11   5 1978       2         NL   M              NA    218
25      3000     5  18 1980       2         NL   F              31     87
33      5558     3  29 1982       2         NL   M              33    211
41      6500     8  16 1982       2         NL   F              33    152
     genus  species   taxa plot_type weight_kg weight_kg2
1  Neotoma albigula Rodent   Control        NA         NA
9  Neotoma albigula Rodent   Control     0.218      0.436
17 Neotoma albigula Rodent   Control     0.218      0.436
25 Neotoma albigula Rodent   Control     0.087      0.174
33 Neotoma albigula Rodent   Control     0.211      0.422
41 Neotoma albigula Rodent   Control     0.152      0.304

We can see that there is some NAs in our new column. If we wanted to remove those we could insert a filter() in the chain, paired with the !is.na notation we learned in the last lecture:

surveys_subset %>%
    filter(!is.na(weight)) %>%
    mutate(weight_kg = weight / 1000) %>% head()
  record_id month day year plot_id species_id sex hindfoot_length weight
1       588     2  18 1978       2         NL   M              NA    218
2      1453    11   5 1978       2         NL   M              NA    218
3      3000     5  18 1980       2         NL   F              31     87
4      5558     3  29 1982       2         NL   M              33    211
5      6500     8  16 1982       2         NL   F              33    152
6      8657    11  13 1983       2         NL   F              32    158
    genus  species   taxa plot_type weight_kg
1 Neotoma albigula Rodent   Control     0.218
2 Neotoma albigula Rodent   Control     0.218
3 Neotoma albigula Rodent   Control     0.087
4 Neotoma albigula Rodent   Control     0.211
5 Neotoma albigula Rodent   Control     0.152
6 Neotoma albigula Rodent   Control     0.158

6.3.3.1 Challenge

Create a new data frame from the surveys_subset data that meets the following criteria: contains only the species_id column and a new column called hindfoot_half containing values that are half the hindfoot_length values. In this hindfoot_half column, there should be no NAs and all values should be less than 30. (Hint: think about how the commands should be ordered to produce this data frame.)

## Answer
surveys_hindfoot_half <- surveys_subset %>%
    filter(!is.na(hindfoot_length)) %>%
    mutate(hindfoot_half = hindfoot_length / 2) %>%
    filter(hindfoot_half < 30) %>%
    select(species_id, hindfoot_half)

surveys_hindfoot_half %>% head()
  species_id hindfoot_half
1         NL          16.0
2         NL          15.5
3         NL          16.5
4         NL          16.5
5         NL          16.0
6         NL          16.0

6.4 Split-apply-combine techniques in dplyr

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

dplyr facilitates this workflow through the use of group_by() and summarize(), which collapses each group into a single-row summary of that group. The arguments to group_by() are the column names that contain the categorical variables for which you want to calculate the summary statistics. Let’s view the mean weight by sex.

surveys_subset %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 F              NA
2 M              NA
3 <NA>           NA

The mean weights become NA since there are individual observations that are NA. Let’s remove those observations.

surveys_subset %>%
    filter(!is.na(weight)) %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 F            43.0
2 M            42.1
3 <NA>         60.2

There is one row here that is neither male nor female. These are observations where the animal escaped before the sex could be determined. Let’s remove those as well.

surveys_subset %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(sex) %>%
    summarize(mean_weight = mean(weight))
# A tibble: 2 × 2
  sex   mean_weight
  <chr>       <dbl>
1 F            43.0
2 M            42.1

You can also group by multiple columns:

surveys_subset %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(genus, sex) %>%
    summarize(mean_weight = mean(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
# A tibble: 18 × 3
# Groups:   genus [9]
   genus           sex   mean_weight
   <chr>           <chr>       <dbl>
 1 Baiomys         F            7.5 
 2 Baiomys         M            8   
 3 Chaetodipus     F           23.8 
 4 Chaetodipus     M           24.3 
 5 Dipodomys       F           55.8 
 6 Dipodomys       M           55.5 
 7 Neotoma         F          157.  
 8 Neotoma         M          180.  
 9 Onychomys       F           26.2 
10 Onychomys       M           25.9 
11 Perognathus     F            9   
12 Perognathus     M            8.12
13 Peromyscus      F           22.7 
14 Peromyscus      M           20.3 
15 Reithrodontomys F           11.5 
16 Reithrodontomys M            9.96
17 Sigmodon        F           84   
18 Sigmodon        M           57.1 

Since we will use the same filtered and grouped data frame in multiple code chunks below, we could assign this subset of the data to a new name and use this data frame in the subsequent code chunks instead of typing out the functions each time.

filtered_surveys <- surveys_subset %>%
    filter(!is.na(weight) & !is.na(sex)) %>%
    group_by(genus, sex)

Once the data are grouped, you can also summarize multiple variables at the same time. For instance, we could add a column indicating the minimum weight for each species for each sex:

filtered_surveys %>%
    summarize(mean_weight = mean(weight),
              min_weight = min(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
# A tibble: 18 × 4
# Groups:   genus [9]
   genus           sex   mean_weight min_weight
   <chr>           <chr>       <dbl>      <int>
 1 Baiomys         F            7.5           7
 2 Baiomys         M            8             8
 3 Chaetodipus     F           23.8           5
 4 Chaetodipus     M           24.3           7
 5 Dipodomys       F           55.8          13
 6 Dipodomys       M           55.5          18
 7 Neotoma         F          157.           68
 8 Neotoma         M          180.           48
 9 Onychomys       F           26.2          10
10 Onychomys       M           25.9           9
11 Perognathus     F            9             6
12 Perognathus     M            8.12          5
13 Peromyscus      F           22.7          10
14 Peromyscus      M           20.3           9
15 Reithrodontomys F           11.5           5
16 Reithrodontomys M            9.96          4
17 Sigmodon        F           84            50
18 Sigmodon        M           57.1          35

6.4.0.1 Challenge

  1. Using the surveys_subset dataframe, use group_by() and summarize() to find the mean hindfoot length for the species Ammospermophilus harrisi.

  2. What was the heaviest animal measured in 1979?

# Answer 1
surveys_subset %>%
    filter(!is.na(hindfoot_length)) %>%
    group_by(species) %>%
    summarize(mean_hindfoot_length = mean(hindfoot_length)) # 31
# A tibble: 22 × 2
   species     mean_hindfoot_length
   <chr>                      <dbl>
 1 albigula                    32.4
 2 baileyi                     26.2
 3 eremicus                    20.1
 4 flavus                      15.8
 5 fulvescens                  17.4
 6 fulviventer                 25  
 7 harrisi                     31  
 8 hispidus                    28.3
 9 intermedius                 22.5
10 leucogaster                 20.7
# ℹ 12 more rows
# Answer 2
surveys_subset %>%
    filter(!is.na(weight)) %>%
    filter(year == 1979) %>% 
    filter(weight == max(weight)) %>% 
    select(genus, species) # Neotoma albigula (white-throated woodrat)
    genus  species
1 Neotoma albigula

6.5 Using tally to summarize categorical data

When working with data, it is also common to want to know the number of observations found for each factor or combination of factors. For this, dplyr provides tally(). For example, if we want to group by taxa and find the number of observations for each taxa, we would do:

surveys_subset %>%
    group_by(taxa) %>%
    tally()
# A tibble: 4 × 2
  taxa        n
  <chr>   <int>
1 Bird       57
2 Rabbit      7
3 Reptile     3
4 Rodent   4282

You can also use count() to quickly count the unique values of one or more variables. count() combines group_by() and summarise(), so the following will give the same result as the code above:

surveys_subset %>%
    count(taxa)
     taxa    n
1    Bird   57
2  Rabbit    7
3 Reptile    3
4  Rodent 4282

We can also use tally() or count() when grouping on multiple variables:

surveys_subset %>%
    group_by(taxa, sex) %>%
    tally()
# A tibble: 6 × 3
# Groups:   taxa [4]
  taxa    sex       n
  <chr>   <chr> <int>
1 Bird    <NA>     57
2 Rabbit  <NA>      7
3 Reptile <NA>      3
4 Rodent  F      1979
5 Rodent  M      2141
6 Rodent  <NA>    162
surveys_subset %>% 
  count(taxa, sex)
     taxa  sex    n
1    Bird <NA>   57
2  Rabbit <NA>    7
3 Reptile <NA>    3
4  Rodent    F 1979
5  Rodent    M 2141
6  Rodent <NA>  162

Here, tally() is the action applied to the groups created by group_by() and counts the total number of records for each category.

If there are many groups, tally() is not that useful on its own. For example, when we want to view the five most abundant species among the observations:

surveys_subset %>%
    group_by(species) %>%
    tally()
# A tibble: 32 × 2
   species             n
   <chr>           <int>
 1 albigula          158
 2 audubonii           7
 3 baileyi           356
 4 bilineata          36
 5 brunneicapillus     5
 6 chlorurus           6
 7 clarki              1
 8 eremicus          164
 9 flavus            199
10 fulvescens          8
# ℹ 22 more rows

Since there are 32 rows in this output, we would like to order the table to display the most abundant species first. In dplyr, we say that we want to arrange() the data.

surveys_subset %>%
    group_by(species) %>%
    tally() %>%
    arrange(n) # `n` is the name of the column `tally` generated
# A tibble: 32 × 2
   species             n
   <chr>           <int>
 1 clarki              1
 2 montanus            1
 3 savannarum          1
 4 squamata            1
 5 intermedius         2
 6 melanocorys         2
 7 taylori             4
 8 brunneicapillus     5
 9 gramineus           5
10 leucopus            5
# ℹ 22 more rows

Still not that useful. Since we are interested in the most abundant species, we want to display those with the highest count first, in other words, we want to arrange the column n in descending order:

surveys_subset %>%
    group_by(species) %>%
    tally() %>%
    arrange(desc(n)) %>%
    head(5)
# A tibble: 5 × 2
  species          n
  <chr>        <int>
1 merriami      1324
2 penicillatus   394
3 ordii          378
4 baileyi        356
5 megalotis      328

If we want to include more attributes (like taxa and genus) about these species, we can include these in the call to group_by():

surveys_subset %>%
    group_by(species, taxa, genus) %>%
    tally() %>%
    arrange(desc(n)) %>%
    head(5)
# A tibble: 5 × 4
# Groups:   species, taxa [5]
  species      taxa   genus               n
  <chr>        <chr>  <chr>           <int>
1 merriami     Rodent Dipodomys        1324
2 penicillatus Rodent Chaetodipus       394
3 ordii        Rodent Dipodomys         378
4 baileyi      Rodent Chaetodipus       356
5 megalotis    Rodent Reithrodontomys   328

Here, we are using additional columns that are unique. Be careful not to include anything that would split the group into subgroups, such as sex, year etc.

6.5.0.1 Challenge

How many individuals were caught in the rodent enclosure plot type?

surveys_subset %>%
    group_by(plot_type) %>%
    tally()
# A tibble: 5 × 2
  plot_type                     n
  <chr>                     <int>
1 Control                    1951
2 Long-term Krat Exclosure    639
3 Rodent Exclosure            530
4 Short-term Krat Exclosure   739
5 Spectab exclosure           490

6.5.0.2 Challenge

You saw above how to count the number of individuals of each sex using a combination of group_by() and tally(). How could you get the same result using group_by() and summarize()? (Hint: see ?n.)

surveys_subset %>%
  group_by(sex) %>%
  summarize(n = n())
# A tibble: 3 × 2
  sex       n
  <chr> <int>
1 F      1979
2 M      2141
3 <NA>    229

6.6 Reshaping with gather and spread

The survey data presented here is almost in what we call a long format – every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots. For example, what is the relationship between mean weights of different genera across all plots?

To answer that question, we want each plot to have a single row, with all of the measurements in a single plot having their own column. This is called a wide data format. For the surveys_subset data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we can reduce the dataset and begin to look for some relationships we’d want to examine. We need to create a new table where each row is the values for a particular variable associated with each plot. In practical terms, this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot by genus.

So, in summary:

Long format:

  • every column is a variable
    • first column(s) repeat
  • every row is an observation

Wide format:

  • each row is a measured thing
  • each column is an independent observation
    • first column does not repeat

We can use the functions called pivot_wider() and pivot_longer() (these are newer replacements for spread() and gather(), which were the older functions). Both functions are explained, but take some time to see what you prefer using!

Let’s start by using dplyr to create a data frame with the mean body weight of each genus by plot.

surveys_gw <- surveys_subset %>%
    filter(!is.na(weight)) %>%
    group_by(genus, plot_id) %>%
    summarize(mean_weight = mean(weight))
`summarise()` has grouped output by 'genus'. You can override using the
`.groups` argument.
surveys_gw %>% head()
# A tibble: 6 × 3
# Groups:   genus [2]
  genus       plot_id mean_weight
  <chr>         <int>       <dbl>
1 Baiomys           3         8  
2 Baiomys           5         7  
3 Baiomys          19         8  
4 Chaetodipus       1        21.4
5 Chaetodipus       2        24.5
6 Chaetodipus       3        24.4

6.6.1 Long to Wide with spread and pivot_wider

Now, to make this long data wide, we use spread() from tidyr to spread out the different taxa into columns. spread() takes three arguments: the data, the key column (or column with identifying information), and the values column (the one with the numbers/values). We’ll use a pipe so we can ignore the data argument.

surveys_gw_wide1 <- surveys_gw %>%
  spread(key = genus, value = mean_weight) 

head(surveys_gw_wide1)
# A tibble: 6 × 10
  plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
    <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
1       1      NA        21.4      60.8    155.      28.7       13          20.2
2       2      NA        24.5      55.9    175.      26.6        7          22.4
3       3       8        24.4      47.6    157.      23.5        7.5        22.4
4       4      NA        23.1      57.6    200       29.9        7.67       20.5
5       5       7        16.1      51.3    198.      26.3        7.6        20.5
6       6      NA        24.2      59.1    184.      26.2        7.86       22  
# ℹ 2 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>

Notice that some genera have NA values. That’s because some of those genera don’t have any record in that plot. Sometimes it is fine to leave those as NA. Sometimes we want to fill them as zeros, in which case we would add the argument fill=0. Remember, if arguments are presented in the correct order, you don’t have to specify them.

surveys_gw_wide1_fill0 <- surveys_gw %>%
  spread(genus, mean_weight, fill = 0)

head(surveys_gw)
# A tibble: 6 × 3
# Groups:   genus [2]
  genus       plot_id mean_weight
  <chr>         <int>       <dbl>
1 Baiomys           3         8  
2 Baiomys           5         7  
3 Baiomys          19         8  
4 Chaetodipus       1        21.4
5 Chaetodipus       2        24.5
6 Chaetodipus       3        24.4

Another way to spread your data out is to use pivot_wider(), which takes 3 arguments as well: the data, the names_from column variable that will eventually become the column names, and the values_from column variable that will fill in the values.

surveys_gw_wide2 <- surveys_gw %>% 
  pivot_wider(names_from = genus, values_from = mean_weight)

head(surveys_gw_wide2)
# A tibble: 6 × 10
  plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
    <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
1       3       8        24.4      47.6    157.      23.5        7.5        22.4
2       5       7        16.1      51.3    198.      26.3        7.6        20.5
3      19       8        25.3      41       NA       23.3        8.06       20.4
4       1      NA        21.4      60.8    155.      28.7       13          20.2
5       2      NA        24.5      55.9    175.      26.6        7          22.4
6       4      NA        23.1      57.6    200       29.9        7.67       20.5
# ℹ 2 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>

Now we can go back to our original question: what is the relationship between mean weights of different genera across all plots? We can easily see the weights for each genus in each plot!

6.6.2 Wide to long with gather and pivot_longer

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we can use gather() to sweep up a set of columns into one key-value pair. We give it the arguments of a new key and value column name, and then we specify which columns we either want or do not want gathered up. So, to go backwards from surveys_gw_wide, and exclude plot_id from the gathering, we would do the following:

surveys_gw_long1 <- surveys_gw_wide1 %>%
  gather(genus, mean_weight, -plot_id) 

head(surveys_gw_long1)
# A tibble: 6 × 3
  plot_id genus   mean_weight
    <int> <chr>         <dbl>
1       1 Baiomys          NA
2       2 Baiomys          NA
3       3 Baiomys           8
4       4 Baiomys          NA
5       5 Baiomys           7
6       6 Baiomys          NA

Note that now the NA genera are included in the long format. Going from wide to long to wide can be a useful way to balance out a dataset so every replicate has the same composition.

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are sequential, we don’t even need to list them all out - just use the : operator! Say we only wanted the columns from Baiomys to Onychomys to be gathered together.

surveys_gw_wide1 %>%
  gather(genus, mean_weight_of_some_genera, Baiomys:Onychomys)
# A tibble: 120 × 7
   plot_id Perognathus Peromyscus Reithrodontomys Sigmodon genus  
     <int>       <dbl>      <dbl>           <dbl>    <dbl> <chr>  
 1       1       13          20.2           13        NA   Baiomys
 2       2        7          22.4           10.6      71.5 Baiomys
 3       3        7.5        22.4            9.85     70.7 Baiomys
 4       4        7.67       20.5            9.5      NA   Baiomys
 5       5        7.6        20.5           11.3      NA   Baiomys
 6       6        7.86       22             11.7      83.5 Baiomys
 7       7       NA          21.3           11.2      NA   Baiomys
 8       8        6.83       17.3           10        NA   Baiomys
 9       9        7.75       18.8           11.8      36   Baiomys
10      10       NA          20.5           10.2     110   Baiomys
# ℹ 110 more rows
# ℹ 1 more variable: mean_weight_of_some_genera <dbl>

Another method is to use pivot_longer(), which takes 4 arguments: the data, the names_to column variable that comes from the column names, the values_to column with the values, and cols which specifies which columns we want to keep or drop. Again, we will pipe from the dataset so we don’t have to specify the data argument:

surveys_gw_long2 <- surveys_gw_wide2 %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)

surveys_gw_long2
# A tibble: 216 × 3
   plot_id genus           mean_weight
     <int> <chr>                 <dbl>
 1       3 Baiomys                8   
 2       3 Chaetodipus           24.4 
 3       3 Dipodomys             47.6 
 4       3 Neotoma              157.  
 5       3 Onychomys             23.5 
 6       3 Perognathus            7.5 
 7       3 Peromyscus            22.4 
 8       3 Reithrodontomys        9.85
 9       3 Sigmodon              70.7 
10       5 Baiomys                7   
# ℹ 206 more rows

If the columns are directly adjacent as they are here, we don’t even need to list the all out: we can just use the : operator, as before.

surveys_gw_wide2 %>% 
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = Baiomys:Sigmodon)
# A tibble: 216 × 3
   plot_id genus           mean_weight
     <int> <chr>                 <dbl>
 1       3 Baiomys                8   
 2       3 Chaetodipus           24.4 
 3       3 Dipodomys             47.6 
 4       3 Neotoma              157.  
 5       3 Onychomys             23.5 
 6       3 Perognathus            7.5 
 7       3 Peromyscus            22.4 
 8       3 Reithrodontomys        9.85
 9       3 Sigmodon              70.7 
10       5 Baiomys                7   
# ℹ 206 more rows

6.6.2.1 Challenge

Starting with the surveys_gw_wide2 dataset, how would you create a new dataset that gathers the mean weight of all the genera (excluding NAs) except for the genus Perognathus?

surveys_gw_wide2 %>%
  gather(genus, mean_weight, -plot_id, -Perognathus) %>%
  filter(!is.na(mean_weight)) %>%
  head()
# A tibble: 6 × 4
  plot_id Perognathus genus       mean_weight
    <int>       <dbl> <chr>             <dbl>
1       3        7.5  Baiomys             8  
2       5        7.6  Baiomys             7  
3      19        8.06 Baiomys             8  
4       3        7.5  Chaetodipus        24.4
5       5        7.6  Chaetodipus        16.1
6      19        8.06 Chaetodipus        25.3