3 Intro to dplyr

3.1 Load in the dataset

mcm <- read.csv("/Users/eleanorchodroff/Google Drive/QuantitativeMethods/datasets/McMurrayJongmanFricativeAcoustics.csv")
View(mcm)

3.2 Installing packages

In general, it’s useful to know how to install and update packages. Packages are bundles of code that you can import into R so that you can use that code If you’re connected to the internet, you can install / update a package with the following code:

install.packages("readr")

You’re not done yet! You now “own” the package but you still need to pull the package “off the shelf” to use the code. You can import the package with either the require() or library() functions – they’re basically the same thing.

Alternatively, you can click the “Packages” tab in the lower righthand corner, and check the box for the packages you want to use (that’s the code that’s generated automatically in the third line there).

The readr package is actually newer code that also reads in datasets. It sometimes will interpret columns a little better (or at least differently) than the built-in read.csv function. When you select Import in the Environments window, you should note there is also an option to choose “From text (readr)”: this option uses the readr package. The option we previously used was “From text (base)”, which simply uses what’s called “base R” – or R’s built-in code.

require(readr)

The readr package imports datasets and is one of many packages in the so-called tidyverse. (Google it! It’s cool!) The tidyverse contains several R packages that are incredibly useful for data analysis and visualization. These packages include ggplot2, dplyr, tidyr, readr, and some others. The tidyverse is also a philosophy for maintaining datasets and general data and code hygiene. While we may not have enough time to go over this in detail, we will implicitly be adhering to several (but possibly not all) of these principles throughout the module. The principles are definitely worth reading about.

Now back to installing packages: we can install all the tidyverse packages at once with the following line of code:

install.packages("tidyverse")

We’ll now import them into our workspace so we can use the code – yes, you do have to write this line of code everytime you restart R and want to access the code again.

require(tidyverse)

Or alternatively:

library(tidyverse)

3.3 Introduction to dplyr

If you don’t already have it loaded, import the dplyr package using either the require() or library() function. It should already be ready to go if you ran the library(tidyverse) function above.

library(dplyr)

dplyr has a few very useful functions:

  • filter()
  • select()
  • mutate()
  • group_by()
  • summarise()

Note that some of these are effectively equivalent to things we’ve seen before, but dplyr code is sometimes a little cleaner than what we’ve seen before (and by clean, I don’t mean that what we were doing doesn’t work; I just mean it’s more streamlined, less clunky, etc.). We’ll be focusing on the filter(), group_by(), and summarise() functions, but I highly recommend looking up how select() and mutate() work if you like this style of coding.

3.3.1 filter()

Get a subset of the rows in the dataset using the filter() function. This is almost no different from the subset() function.

mcm2 <- filter(mcm, dur_f > 50)

3.3.2 The pipe %>%

dplyr let’s you use what’s called a pipe, which is written like this: %>%

The pipe allows for an assembly line of functions: it starts with the original dataset, then applies the function to that dataset. If you had more functions (which we will soon), then the output of one function serves as the input to the next function in the list of pipes

The two lines of code below do the exact same thing:

mcm2 <- filter(mcm, dur_f > 70)
mcm2 <- mcm %>% filter(dur_f > 70)

If you’re a nerd like me, you can think of it like topicalization: It is the mcm dataset from which we are filtering (or retaining) only those rows with durations greater than 70.

Allowing multiple pipes brings us to one of the most useful sequences of functions in R / dplyr: group_by() %>% summarise()

3.3.3 Getting descriptive statistics: group_by() %>% summarise()

SUPER USEFUL!

Why do we need this? Remember the summary() function? Well, that was useful, but let’s say we wanted to get the mean duration of all 8 fricative categories in our dataset We would have to create 8 separate subsets of the data and run the mean() function (or summary()) function on all 8 subsets. That’s effortful. Instead, we can use the group_by() function to create subsets, and then derive the mean using the summarise/summarize() function – yes, both spellings work. The output will then be stored in a new dataset that we’re calling f_means:

f_means <- mcm %>% 
  group_by(Fricative) %>% 
  summarise(meandur = mean(dur_f))
View(f_means)

The above line of code takes the mcm dataset, groups by the fricative category, then gets the mean of the dur_f column and fills that value in to the meandur column in f_means

You can have all sorts of functions embedded in the summarise function:

  • mean()
  • sd()
  • median()
  • max()
  • min()
  • length() – this is one way to get the number of tokens
f_means <- mcm %>% 
  group_by(Fricative) %>% 
  summarise(meandur = mean(dur_f), sddur = sd(dur_f))

f_means <- mcm %>% 
  group_by(Fricative) %>% 
  summarise(meandur = mean(dur_f), 
            sddur = sd(dur_f), 
            mediandur = median(dur_f))

f_means <- mcm %>% 
  group_by(Fricative) %>% 
  summarise(meandur = mean(dur_f), 
            sddur = sd(dur_f), 
            mediandur = median(dur_f), 
            maxdur = max(dur_f), 
            mindur = min(dur_f), 
            count = length(dur_f))

View(f_means)

An important note If you have “NA” values in your vector (column), R will be unable to take the mean or perform other standard mathematical functions on it. You’ll know if this happens to you because the returned value (e.g., for the mean) will be “NA”. To fix this, use the optional argument “na.rm = TRUE”, where na.rm means “remove the NA values”.

f_means <- mcm %>% 
  group_by(Fricative) %>% 
  summarise(meandur = mean(dur_f, na.rm = TRUE), 
            sddur = sd(dur_f, na.rm = TRUE))

Let’s say you wanted to get talker-specific means for each fricative. You can add more groups to the group_by function:

f_means <- mcm %>% 
  group_by(Talker, Fricative) %>% 
  summarise(meandur = mean(dur_f), 
            sddur = sd(dur_f), 
            mediandur = median(dur_f), 
            maxdur = max(dur_f), 
            mindur = min(dur_f), 
            count = length(dur_f))

View(f_means)

3.3.4 Get unique elements

Use the unique() function to see the individual categories in a column

unique(mcm$Talker)
unique(mcm$Fricative)

3.3.5 Get the length of unique elements

Use length(unique()) to get the number of unique categories in a column. The length() function is simply wrapped around the unique() function. (You could alternatively write two lines of code to get this.)

length(unique(mcm$Talker))

unique_items <- unique(mcm$Talker)
length(unique_items)

3.4 Getting help

You can get to get help using a question mark preceding the function that you want more information on. You can also get help via Google or the Help tab in the lower lefthand corner. I highly recommend Google!

?mean
?max
?filter
?subset

3.5 Practice

Create an R script to save the answers to these questions.

  1. Import ‘L2_English_Lexical_Decision_Data.csv’ into R and call it ‘lex’. This data set contains reaction times (RT) in milliseconds to words and nonwords of English from L2 English speaking participants. More info about the data and project here.
  2. Create a subset of lex using filter() that contains only the data points where the dominant language (lex$domLang) is not English.
  3. Create a variable called ‘langs’ that contains a list of the unique dominant languages in the newly created subset.
  4. Get the number of unique languages in ‘langs’ using code. (Don’t just look at the environment window.)
  5. Get the number of unique participants in the subset. Participant IDs are in the ‘workerID’ column.
  6. From the subset of non-English participants, remove data points that have reaction times below 500 ms and above 2000 ms. This will likely require two steps.
  7. For each participant in this new subset, get the mean reaction time, the standard deviation of the reaction time, and the median reaction time. Store this data in a dataset called ‘subj_data’.
  8. What is the mean, median and range of by-participant means?
  9. What is the mean, median and range of by-participant standard deviations?
  10. What is the mean, median and range of by-participant medians?
  11. For each dominant language in the new subset, get the mean reaction time. Which language has the lowest mean, which has the highest mean?
  12. For each dominant language in the new subset, get the number of unique participants. Hint: in the summarise() function you will need to use length(unique()).

You can find the answer key here.