Almost everyone who writes computer commands starts by copying and modifying existing commands. To do this, you need to be able to read command expressions. Once you can read, you will know enough to identify the patterns you need for any given task and consider what needs to be modified to suit your particular purpose.

As you read this post, you will likely get an inkling of what the commands used in examples are intended to do, but that’s not what’s important now. Instead, focus on

You are not expected at this point to be able to write R expressions. You’ll have plenty of opportunity to do that once you’ve learned to read and recognize the several different patterns used in R data wrangling and visualization commands.

Note to experienced programmers. You may be wondering where common programming constructs like looping and conditional flow, indexing, lists, and function definition fit in with this book. This book uses a couple of domain-specific, sub-languages, particularly dplyr and ggplot2. A “sub-language” is a part of a computer language that can be used almost like a language of it’s own. The functions in dplyr and ggplot2 already contain within them those programming constructs, so there is no need to use them explicitly. This is analogous to driving a car. The sub-language is the use of the steering wheel, brake, and accelerator. You can use these to accomplish your task without having to know about how the engine or suspension work.

Language Patterns and Syntax

In a human language like English or Chinese, syntax is the arrangement of words and phrases to create well-formed sentences. For example, “Four horses pulled the king’s carriage,” combines noun phrases (“Four horses”, “the king’s carriage”) with a verb.

Consider this pair of English language sentence patterns, a statement and a question:

  1. I did go to school .
  2. Did I go to school ?

The content of each of the boxes can be replaced by an equivalent object.

  • I can be replaced with “you”, “we”, “he”, “Janice”, “the President”, and so on.
  • go to can be replaced with “stay in”, “attend”, “drive to”, “see”, …
  • school can be replaced with “the lake”, “the movie”, “Fred’s parents’ house”, …

Such replacements produce sentences like these:

  • Did we stay in Fred’s parents’ house?
  • Janice did drive to the lake.
  • Did the President see the movie?
  • Did he attend school?

Each sentence expresses something different, but they all follow the same patterns.

Commands: Sentences in R

Unlike natural languages like English, R has just a few basic sentence patterns that suffice for many data wrangling, statistics, and visualization tasks. Every computation involves the transformation of inputs into an output. The object that performs computations is called a function. The inputs are called arguments to the function. That is, every computation is the application of a function to arguments to produce an output.

The basic syntax involves some punctuation.

  • function ( arguments )
    This is called function application. Functions can take two (or more) arguments, which are always separated by commas, e.g. function(argument1, argument2)

  • name %>% function ( arguments )
    This is called chaining syntax.

  • name %>% function ( arguments ) %>% function ( arguments ) This is an extended form of chaining syntax. Such chains can be extended indefinitely.

Any of the above forms can be preceeded with name <- The output of the function will be stored under the name to the left of <-.

Seven components of commands

There are seven types of components in the data wrangling, visualization, and statistics commands we will be working with.

  1. Functions
  2. Arguments
  3. Data frames
  4. Variables
  5. Formulas
  6. Constants and Names
  7. Punctuation

Just as it helps in English to know what’s a noun and what’s a verb, etc., by identifying each kind of component in a command you’ll have an easier time understanding what the command does.

1. Functions are the objects that transform an input into an output. They are easy to spot. Functions are used by applying them to arguments, so you will see an opening parenthesis right after the function name. The corresponding closing parenthesis comes after the arguments.

There are a few functions that have a different syntax, e.g. the simple mathematical functions which are given {} two arguments, e.g. or . This is called infix notation and is meant to mimic traditional arithmetic notation. Some of the other infix functions you’ll encounter are , , , and so on.

2. Arguments describe the details of what a function is to do. They appear between the parentheses that follow a function name. One important exception: data frames are typically presented as an input to a function using the chaining notation %>%.

Many functions take named arguments where the name of the argument is followed by a = sign and then the value of that argument. For instance, by = x gives x as the value of the argument named by.

3. Data frames contain tidy data. A data frame comprises one or more variables. It’s easy to distinguish functions from data frames and variables: the function name will always be followed immediately by an open parenthesis. Often, data frames appear at the start of a chain, just before the first . Alternatively, there may be a named data = argument whose value will be a data frame.

4. Variables are the columns in a data frame. In these notes, they will only be in function arguments, that is, between parentheses.

5. Formulas express relationships between variables. Formulas will always appear as a function argument. They are easy to identify, since they always involve the tilde character: Example: height ~ age

6. Constants are single values, most commonly a number or a character string. Character strings will always be in quotation marks, "like this." Numerals are the written form of numbers, for instance -42, 1984, 3.14159. Sometimes you’ll see numerals written in scientific notation, e.g. 6.0221413e+23 or 6.62606957e-34.

In contrast to a character string constant, a name will never appear in quotes. Names are used to identify R objects, be they data frames, functions, or variables.

7. Punctuation ties the various other components together.

  • The parentheses that contain the arguments to a function.
  • The commas that separate the arguments to a function.
  • The <- punctuation, called “assignment,” that gives a name to the output of a function.
  • the %>% punctuation, called “pipe,” that sends the output of a function into another function as an input.

Writing conventions

To help distinguish data frames from the variables in them, these notes will use a simple naming convention.

  • The names of data frames will start with a CAPITAL letter. For instance: WorldCities, NCI60, BabyNames, and so on.
  • The names of variables within data frames will start with a lower-case letter. For instance: latitude, country, population, date, sex, count, countryRegion, population_density.

These are {}, not rules enforced by the R language. As you create your own data frames and variables, it is up to you to follow the convention.

Example: Classifying objects in an expression. Consider this command:

BabyNames %>% 
  filter(name == "Arjun") %>%
  summarise(total = sum(count))
##   total
## 1  5578

The statement involves a data frame (hints: starts with a capital letter, not followed by a opening parenthesis), three functions (hint: look for the names followed by an opening parenthesis), a named argument total (hint: inside the parentheses and followed by a single equal sign). The string "Arjun" is a constant. The remaining names — name and count — are variables. (Hint: they are involved in the arguments to functions).

The filter() function is being given the argument name == "Arjun". This can be confusing. Although == somewhat resembles =, the single = is always just punctuation in a named argument. The double == is a function — one of those few infix functions that don’t involve parentheses.

Named arguments and functions in arguments

The previous example, BabyNames %>% head(4) involved an argument to a function; head() is given the number 4 to specify how many cases to show. Sometimes the arguments will be the name of a variable or a function applied to a variable. Here’s an example:

BabyNames %>%
  group_by(sex) %>%
  summarise(total = sum(count))

Taking apart the above expression, you can see three functions, group_by(), summarise() and sum(). The name of functions is always followed by an open parenthesis. Inside those parentheses are the arguments. The argument to group_by() is sex, a variable. How do you know? It’s evidently not a data frame — it’s not capitalized. It’s neither a character string nor a numerical constant. And it’s not a function — sex isn’t followed by an opening parenthesis. By the process of elimination, this suggests that sex is a variable. The argument to summarise() is the expression total = sum(count).

The argument to summarise() is in named-argument form. Note that sum() is a function. You can tell this from the expression: sum is followed immediately by an open parenthesis.

Constant objects

Sometimes you will use assignment to store a constant. For instance:

data_file_name <- "tiny.cc/mosaic/engines.csv"
age_cutoff <- 21

Reminder: The two kinds of constants we will use are quoted character strings and numerals.

Naming constants in this way can help to make your data wrangling expressions more readable. For instance, by using age_cutoff in your expressions, you make it easily to update your expressions if you decide to change the age cutoff; just change the 21 to whatever the new value is to be.

Wrangling command pattern

Wrangling statements start with a data table, which is piped into a function, called a “data verb,” that transforms the input into another data table. Wrangling statements often have several steps, each connected by a pipe from the earlier step. A nice style is to put each step on its own line. If the step is to be followed by another, the pipe connecting them must be at the end of the line with the first step.

To illustrate, here’s a command sequence for counting the number of babies named “Princess” each year, for both boys and girls:

Princess <-
  BabyNames %>%
  filter(name == "Princess") %>%
  group_by(year, sex) %>%
  summarise(yearlyTotal = sum(count))

Visualization command pattern

Data visualization always starts with a data frame. There can be many layers and specifications to a graphic. Each layer or specification is accomplished with a function (generally beginning with gf_ in these notes). The various layers and specdifications are piped together. Here’s an example, a visualization of the Princess data frame that involves the data, an vertical-line annotation, and a specification of axis limits.

Princess %>%
  gf_point(yearlyTotal ~ year, color = ~ sex) %>%
  gf_vline(xintercept = 1978) %>%
  gf_lims(y = c(0,640), x = c(1880, 2015))
The number of babies born each year given the name 'Princess'.

Figure 1: The number of babies born each year given the name ‘Princess’.

Judging from Figure 1, the name “Princess” has been increasing in popularity over the last 40 years. One possible explanation is the popularity of the musician Prince. The vertical line in the graph marks the year that Prince’s first album was released: 1978.

Statistics command pattern

Statistics commands often involve just a single function application. They may be preceeded by multi-line wrangling statements to produce a data frame in the right format for the statistical calculations. But the statistics themselves can be done in a single line.

The statistics commands we will use always have two basic arguments: a formula and a data= argument specifying the data frame to use as input. Additional arguments may be needed to specify details of the calculation. For example, here is a calculation of the mean height (with its confidence interval) of the people in the Galton data frame.

df_stats(height ~ sex, data = Galton, average = mean, confint = ci.mean)
##   sex  average confint_lower confint_upper
## 1   F 64.11016      63.88627      64.33405
## 2   M 69.22882      68.98900      69.46863

Other examples:

  • a t-test

    t_test(height ~ sex, data = Galton)
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  height by sex
    ## t = -30.662, df = 895.02, p-value < 2.2e-16
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  -5.446293 -4.791018
    ## sample estimates:
    ## mean in group F mean in group M 
    ##        64.11016        69.22882
  • linear regression (of the child’s height on the mother’s, taking into account sex)

    lm(height ~ mother + sex, data = Galton)
    ## 
    ## Call:
    ## lm(formula = height ~ mother + sex, data = Galton)
    ## 
    ## Coefficients:
    ## (Intercept)       mother         sexM  
    ##     41.4495       0.3531       5.1767