Getting started with R

Session 2 of 4: Data import and manipulation

Freddie Heather

Recap from session 1

Recap questions

  • What is a vector?
  • What is an object?
  • What is an object class? - Name four classes
  • What is a function? - How do you identify them in code?
  • What does <- mean?
  • What does == mean?
  • What is the preferred file type for reading data in R?
  • What function(s) is used to read in csv data?

Session 2

  • Data import
  • Data checking
  • Data manipulation
  • Data cleaning

Task

  1. Open the R project from last week.
  2. Create a new script and call it session2 or day2, something informative to you.
  3. Comment your name, date, and a title at the top of the script.

Data Import

Download some example data

  • We will use Reef Life Survey data (a very small subset)
  • Data has location information, species information, and abundance information

Task

  1. Download rls_example_canada.csv from https://github.com/FreddieJH/r_workshop
  2. Store this csv file in your working directory
  3. Read in the csv file, and assign it to an object

Data checking

Functions to look at your data

my_data
print(my_data) # prints the data to the screen

head(my_data)
glimpse(my_data)
tail(my_data)
View(my_data)

Task

  • Run the above functions. What do they do?
  • How many columns are in the dataset?
  • How many rows are in the dataset?
  • What is the class of the first column?

The pipe operator %>%

  • Say I have three fictional functions; import(), clean(), and visualise() and I want to apply them to some dataset x.
# it's hard to read
output <- visualise(clean(import(x)))
  • Using the pipe operator
# easier to read! (does exactly the same)
output <-
  x %>%
  import() %>%
  clean() %>%
  visualise()

Data manipululation

Filtering your data

  • Say that I only care about the “Tappers Cove” site
# Only Tappers cove rows
my_data %>%
  filter(site_name == "Tappers Cove")

Task

Create a new object called arctic_sites which has only the rows where the ‘Realm’ is ‘Arctic’

More than just “is equal to…”

You can use other operators beyond just the == operator that tests for equality:

  • > means “greater than”
  • < means “less than”
  • >= means “greater than or equal to”
  • <= means “less than or equal to”
  • != means “not equal to”
  • & means “and”
  • | means “or”

Being more specific in your selection

  • Using >, <, >=, !=, &, |
# example of an OR statement
my_data %>%
  filter(site_name == "Tappers Cove" | site_name == "Middle Cove")

Task

  1. Select all rows that are either the “Bacon Cove” or “Tappers Cove” sites.
  2. Select all rows that have a latitude higher than 48 degrees.
  3. How many sites have a latitude higher than 48 degrees?
  4. Select rows between 5 and 9 meters depth.

Data manipulation functions (tidyverse)

library(tidyverse) # where the functions are stored

filter() # subsetting the data (e.g., order == "Aves")
select() # Selecting a single or multiple columns
mutate() # Creating a new column
rename() # Renaming a column
summarise() # Summarising a dataset

arrange() # Ordering a column (E.g., sort by smallest arrival time)
arrange(desc()) # ... or by largest arrival time
distinct() # give me only the unique rows (no repeats)
pull() # pull out a column and make it a vector
count() # count the number of rows

select()

Selecting a single or multiple columns

my_data %>%
  select(survey_id, species_name, total)

Task

Create a new object called slim_data which has the columns survey_id, site_code, species_name, total

mutate()

Creating a new column

# base R
# my_data$depth_cm <- my_data$depth*10

# tidyverse way
my_data %>%
  mutate(depth_cm = depth * 100)

# combining the two steps (select and mutate)
my_data %>%
  select(survey_id, site_code, depth, species_name, total) %>%
  mutate(depth_cm = depth * 100)

Task

Create a new object called cunner which is only the species Tautogolabrus adspersus and has a new column called biomass_kg which is the biomass, but in kg instead of g.

rename()

Rename an existing column

  • Say we want to be clear about the unit in the column name
# base R
# note this will overwrite the object my_data
# names(my_data)[names(my_data) == "depth"] <- "depth_m"

# tidyverse way (this won't overwrite)
my_data %>%
  rename(depth_m = depth)

Task

Rename the visibility column to something a bit more informative, maybe to include the units?

summarise()

summarise()

Summarise some data (e.g. taking a mean)

  • similar to the summarise() function
my_data %>%
  select(survey_id, site_code, depth, species_name, biomass) %>%
  mutate(depth_cm = depth * 100) %>%
  rename(depth_m = depth) %>%
  summarise(mean_biomass = mean(biomass, na.rm = TRUE))

Task

What is the mean_biomass of all the rows? What happens if you don’t include na.rm=TRUE in your code? Why?

The .by argument

Group by a factor for analysis

You may see this written as group_by(), it’s the same

# I want the mean abundance for each species at a survey
my_data %>%
  summarise(mean_abun = mean(total, na.rm = TRUE), .by = species_name)

# is the same as this
my_data %>%
  group_by(species_name) %>%
  summarise(mean_abun = mean(total, na.rm = TRUE)) %>%
  ungroup()

Task

Which species, on average, has approximately 1.27 individuals per survey?

arrange()

Sorting a column (alphabetically or numerically)

my_data %>%
  summarise(mean_abun = mean(total, na.rm = TRUE), .by = species_name) %>%
  arrange(mean_abun)

Task

Order the dataframe so the mean_abun is descending (High to low)

count()

Counting the number of occurrences

my_data %>%
  count(site_code)

Task

Looking only at Tautogolabrus adspersus, what is the most common site is it observed at?

Base R vs. Tidyverse

Action Base R code Tidyverse code
Extract a column from a dataframe df$col1 OR df[,"col1"] df %>% pull(col1)
Filter specific rows df[df$col1=="ABC",] df %>% filter(col1 == "ABC")
Select specific columns df[,c(col1,col3:col5)] df %>% select(col1,col3:col5)
Create a new column df$col9 <- sqrt(df$col8) df %>% mutate(col9 = sqrt(col8))
Rename a column names(df)[names(df) == "old_col_name"] <- "new_col_name" df %>% rename(new_col_name = old_col_name)
Summarise data (e.g. mean of a column) mean(df$col1) df %>% summarise(mean(col1))
Sort a column df[order(df$col1), ] df %>% arrange(col1)
Filter unique rows df[!duplicated(df[,c("col1", "col2")]), ] df %>% distinct()
Count number of rows of a group as.data.frame(table(df$col1)) df %>% count(col1)

Extra

Task

Using the example dataset, for Chordates only, get the mean abundance of individuals (rename ‘total’ to ‘abundance’), within each taxonomic family.