Getting started with R

Session 2 of 4: Data import and manipulation

Freddie Heather

Recap from session 1

Recap questions

What is a vector?
What is an object?
What is an object class? - Name four classes
What is a function? - How do you identify them in code?
What does <- mean?
What does == mean?
What is the preferred file type for reading data in R?
What function(s) is used to read in csv data?

Session 2

Data import
Data checking
Data manipulation
Data cleaning

Task

Open the R project from last week.
Create a new script and call it session2 or day2, something informative to you.
Comment your name, date, and a title at the top of the script.

Data Import

Download some example data

We will use Reef Life Survey data (a very small subset)
Data has location information, species information, and abundance information

Task

Download rls_example_canada.csv from https://github.com/FreddieJH/r_workshop
Store this csv file in your working directory
Read in the csv file, and assign it to an object

Data checking

Functions to look at your data

my_data
print(my_data) # prints the data to the screen

head(my_data)
glimpse(my_data)
tail(my_data)
View(my_data)

Task

Run the above functions. What do they do?
How many columns are in the dataset?
How many rows are in the dataset?
What is the class of the first column?

The pipe operator `%>%`

Say I have three fictional functions; import(), clean(), and visualise() and I want to apply them to some dataset x.

# it's hard to read
output <- visualise(clean(import(x)))

Using the pipe operator

# easier to read! (does exactly the same)
output <-
  x %>%
  import() %>%
  clean() %>%
  visualise()

Data manipululation

Filtering your data

Say that I only care about the “Tappers Cove” site

# Only Tappers cove rows
my_data %>%
  filter(site_name == "Tappers Cove")

Task

Create a new object called arctic_sites which has only the rows where the ‘Realm’ is ‘Arctic’

More than just “is equal to…”

You can use other operators beyond just the == operator that tests for equality:

> means “greater than”
< means “less than”
>= means “greater than or equal to”
<= means “less than or equal to”
!= means “not equal to”
& means “and”
| means “or”

Being more specific in your selection

Using >, <, >=, !=, &, |

# example of an OR statement
my_data %>%
  filter(site_name == "Tappers Cove" | site_name == "Middle Cove")

Task

Select all rows that are either the “Bacon Cove” or “Tappers Cove” sites.
Select all rows that have a latitude higher than 48 degrees.
How many sites have a latitude higher than 48 degrees?
Select rows between 5 and 9 meters depth.

Data manipulation functions (tidyverse)

library(tidyverse) # where the functions are stored

filter() # subsetting the data (e.g., order == "Aves")
select() # Selecting a single or multiple columns
mutate() # Creating a new column
rename() # Renaming a column
summarise() # Summarising a dataset

arrange() # Ordering a column (E.g., sort by smallest arrival time)
arrange(desc()) # ... or by largest arrival time
distinct() # give me only the unique rows (no repeats)
pull() # pull out a column and make it a vector
count() # count the number of rows

`select()`

Selecting a single or multiple columns

my_data %>%
  select(survey_id, species_name, total)

Task

Create a new object called slim_data which has the columns survey_id, site_code, species_name, total

`mutate()`

Creating a new column

# base R
# my_data$depth_cm <- my_data$depth*10

# tidyverse way
my_data %>%
  mutate(depth_cm = depth * 100)

# combining the two steps (select and mutate)
my_data %>%
  select(survey_id, site_code, depth, species_name, total) %>%
  mutate(depth_cm = depth * 100)

Task

Create a new object called cunner which is only the species Tautogolabrus adspersus and has a new column called biomass_kg which is the biomass, but in kg instead of g.

`rename()`

Rename an existing column

Say we want to be clear about the unit in the column name

# base R
# note this will overwrite the object my_data
# names(my_data)[names(my_data) == "depth"] <- "depth_m"

# tidyverse way (this won't overwrite)
my_data %>%
  rename(depth_m = depth)

Task

Rename the visibility column to something a bit more informative, maybe to include the units?

`summarise`()

Summarise some data (e.g. taking a mean)

similar to the summarise() function

my_data %>%
  select(survey_id, site_code, depth, species_name, biomass) %>%
  mutate(depth_cm = depth * 100) %>%
  rename(depth_m = depth) %>%
  summarise(mean_biomass = mean(biomass, na.rm = TRUE))

Task

What is the mean_biomass of all the rows? What happens if you don’t include na.rm=TRUE in your code? Why?

The `.by` argument

Group by a factor for analysis

You may see this written as group_by(), it’s the same

# I want the mean abundance for each species at a survey
my_data %>%
  summarise(mean_abun = mean(total, na.rm = TRUE), .by = species_name)

# is the same as this
my_data %>%
  group_by(species_name) %>%
  summarise(mean_abun = mean(total, na.rm = TRUE)) %>%
  ungroup()

Task

Which species, on average, has approximately 1.27 individuals per survey?

`arrange()`

Sorting a column (alphabetically or numerically)

my_data %>%
  summarise(mean_abun = mean(total, na.rm = TRUE), .by = species_name) %>%
  arrange(mean_abun)

Task

Order the dataframe so the mean_abun is descending (High to low)

`count()`

Counting the number of occurrences

my_data %>%
  count(site_code)

Task

Looking only at Tautogolabrus adspersus, what is the most common site is it observed at?

Base R vs. Tidyverse

Action	Base R code	Tidyverse code
Extract a column from a dataframe	df$col1 OR df[,"col1"]	df %>% pull(col1)
Filter specific rows	df[df$col1=="ABC",]	df %>% filter(col1 == "ABC")
Select specific columns	df[,c(col1,col3:col5)]	df %>% select(col1,col3:col5)
Create a new column	df$col9 <- sqrt(df$col8)	df %>% mutate(col9 = sqrt(col8))
Rename a column	names(df)[names(df) == "old_col_name"] <- "new_col_name"	df %>% rename(new_col_name = old_col_name)
Summarise data (e.g. mean of a column)	mean(df$col1)	df %>% summarise(mean(col1))
Sort a column	df[order(df$col1), ]	df %>% arrange(col1)
Filter unique rows	df[!duplicated(df[,c("col1", "col2")]), ]	df %>% distinct()
Count number of rows of a group	as.data.frame(table(df$col1))	df %>% count(col1)

Extra

Task

Using the example dataset, for Chordates only, get the mean abundance of individuals (rename ‘total’ to ‘abundance’), within each taxonomic family.

Getting started with R

Recap from session 1

Recap questions

Session 2

Data Import

Download some example data

Data checking

Functions to look at your data

The pipe operator %>%

Data manipululation

Filtering your data

More than just “is equal to…”

Being more specific in your selection

Data manipulation functions (tidyverse)

select()

Selecting a single or multiple columns

mutate()

Creating a new column

rename()

Rename an existing column

summarise()

summarise()

Summarise some data (e.g. taking a mean)

The .by argument

Group by a factor for analysis

arrange()

Sorting a column (alphabetically or numerically)

count()

Counting the number of occurrences

Base R vs. Tidyverse

Extra

The pipe operator `%>%`

`select()`

`mutate()`

`rename()`

`summarise`()

`summarise`()

The `.by` argument

`arrange()`

`count()`