Getting started with R

Session 4 of 4: Combining learned skills

Freddie Heather

Recap from sessions 1, 2, & 3

Recap questions

  • How do I know if I am working in a project?
  • How would you import a CSV into R?
  • How would you check if the data is imported correctly?
  • How would you visualise your data? dbl-dbl, dbl-chr, dbl.
  • I want to remove rows. How?
  • I want to add a new column? e.g. log_x = log(x). How?
  • I want to calculate the mean for each level of a factor. How?
  • What goes inside the aes() function in ggplot()?

Today: Session 4

  1. Import a csv
  2. Data checks
  3. Initial data clean
  4. Data visualisation
  5. Data manipulation
  6. Repeat steps 4 & 5 (when you would begin stats)

I will run through all steps

Using R markdown (.Rmd)

Hypothesis

  • Your data? You probably have a hypothesis in mind
  • New data? Import, and check data first, then think of a hypothesis

R Scipt preamble

# Freddie Heather
# R workshop session 4/4
# Bringing together everything we've learned so far

# packages
library(tidyverse)

Data import

# imports a csv file and outputs a tibble
# saves tibble into an object called "raw_dat"
raw_dat <- read_csv("analysis1.csv")

Data checking

  • Until now we have been using clean and tidy data
  • Normally, the excel file will have mistakes, we’re only human. If I add white-space after a number in excel, the entire column will be a chr
  • Data checking is a very important step when using real data (esp. ecological data)
raw_dat # how many cols? rows?
glimpse(raw_dat) # checking the col classes (e.g. abundance is int, not chr)
head(raw_dat)
View(raw_dat)

Initial clean

  • Avoid removing cols or rows in the initial clean
  • Convert data type as.numeric(), as.character(), as.factor() (look out for errors when converting)
clean_dat <-
  raw_dat %>%
  rename(size_cm = size) %>%
  mutate(abundance = as.numeric(abundance))

Data visualisation (one var)

  • Quick look at single variables
  • Play around with the variables, visualise as much as you can, this is also where you pick up errors
# abundance is numeric
clean_dat %>%
  ggplot() +
  aes(x = abundance) +
  geom_density()

# Or using a histogram?
clean_dat %>%
  ggplot() +
  aes(x = abundance) +
  geom_histogram()

Data visualisation (two vars)

  • Quick look at correlations
# abundance is numeric
# sex is categorical
clean_dat %>%
  ggplot() +
  aes(x = sex, y = abundance) +
  geom_violin()

# abundance is numeric
# sea surface temperature (SST) is numeric
clean_dat %>%
  ggplot() +
  aes(x = sst, y = abundance) +
  geom_point()

Data manipulation

  • During the [fictitious] initial visualisation I noticed:
  • Abundance is very heavily skewed (I want to transform the data)
  • I found some outliers (I want to remove them)
  • %in% is like == but for multiple comparisons
plot_dat <-
  clean_dat %>%
  filter(sex %in% c("m", "f", "juv")) %>% # found an extra var "g"
  filter(abundance != 93845) %>% # outlier removed
  mutate(log1p_abun = log1p(abundance)) # transformation

Data visualisation (again)

  • Now I’ve modified the data
# log1p_abun is new col I created
plot_dat %>%
  ggplot() +
  aes(x = sst, y = log1p_abun) +
  geom_point()

Extra functions/arguments

  • case_when() is like ifelse() but more flexible
  • stat_smooth() puts a generic line through the data
plot_dat %>%
  mutate(
    sex = case_when(
      sex == "g" ~ "f", # we assume the 'g' should be an 'f'
      TRUE ~ sex
    )
  ) %>%
  ggplot() +
  aes(x = sst, y = log1p_abun) +
  geom_point(alpha = 0.5) + # transparency because of overlap
  stat_smooth(se = TRUE) + # adds a generic smooth to the plot (with  or w/o error)
  facet_wrap(~sex) # I want to look at the each sex separately

# stat_smooth(method = "lm") to make linear line

What next?

  • Are you happy with the plot_dat or are there more changes you want to make (e.g. you found more true-outliers, or you want to do more transformations)
  • If happy this is where you might begin statistical analyses

Your turn

Function cheatsheet

# Packages ----------------------------------------------------------------
library(tidyverse) # where the functions are stored
# if package does not exist: `install.packages()` first then, `library()`

# Data import -------------------------------------------------------------
read_csv() # import data in csv format

# Data checking -----------------------------------------------------------
head() # first six rows
View() # view tibble in new window
glimpse() # view structure of data

# Data manipulation -------------------------------------------------------
filter() # subsetting the data (e.g., order == "Aves")
select() # Selecting a single or multiple columns
mutate() # Creating a new column
rename() # Renaming a column
summarise(.by = ) # Summarising a variable (e.g., taking a mean)

arrange() # Ordering a column (E.g., sort by smallest arrival time)
arrange(desc()) # ... or by largest arrival time
distinct() # give me only the unique rows (no repeats)
pull() # pull out a single column and make it a vector
count(.by = ) # count the number of rows per group

# Data visualisation ------------------------------------------------------
ggplot() +
  aes(x = column1, y = column2, colour = column3) +
  geom_point() # or geom_violin() or geom_boxplot() or geom_XXXX()

Your own hypothesis (& data?)

Task

  1. Working inside a project (new or existing)
  2. Create new script and save (check to see if XXXX.R is in your directory)
  3. Read in csv (your own or example_data.csv)
  4. Check import of data
  5. Think of hypothesis (if not done already)
  6. Visualise data (initial)
  7. Manipulate data
  8. Visualise again
  9. Any qualitative support for hypothesis?
  10. Statistics begin (another day)

Downloading big data

Download parquet file from here: AODN_output.parquet

# the RLS data would be too big to download from online
# lets download it and then convert it to a csv
install.packages("arrow")
library(arrow)

# convert from one filetype to another
read_parquet("AODN_output.parquet") %>%
  write_csv("AODN_output.csv")

# read in the csv as you normally would
rls_raw <-
  read_csv("AODN_output.csv")

Downloading other example data

# the 'datasets' package is a collection of example datasets
# https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
install.packages("datasets")
library(datasets)
# E.g.
raw_data <- datasets::PlantGrowth
raw_data <- datasets::pressure
raw_data <- datasets::co2

# or -----------------------------------------------
install.packages("nycflights13")
library(nycflights13)
raw_data <- nycflights13::flights

# or -----------------------------------------------
install.packages("babynames")
library(babynames)
raw_data <- babynames::babynames

Yay! πŸŽ‰πŸŽ‰πŸŽ‰

From raw data, to cleaning, manipulating, and visualising in R