conveRt to R: the short course

# conveRt to R: the short course
### Chris Hanretty
### January 2020

---

# About this course

- This course is a **two-day / ten hour** course on R
 - It is targeted at social scientists who **already know** an
   existing stats package (Stata, SPSS, others)
 - It could be useful for people who learned R some time ago and
   forgot it, or who are not familiar with modern R programming
   (`tidyverse`)
 - It focuses on **data wrangling**, plotting, and estimating
   different kind of **regression models**.
 - It does not cover R programming, writing functions, or language
   fundamentals. 
 
---

# Structure of the course

- **Unit 1:** installing R/RStudio and understanding the R ecosystem;
 - **Unit 2:** Data tidying and plotting
 - **Unit 3:** Regression modelling and exporting results
 - **Unit 4:** More on data tidying, including reshaping and merging
 - **Unit 5:** Different regression modelling strategies
 - **Unit 6:** Things R does poorly
]
.pull-right[
We will try to cover units 1-4 in the first day. If we have time, we
will try making nice figures.

We will cover units 5 and 6 on the morning of day two. In the afternoon, we will "translate" your existing replication code into R.
]

---

# How we'll work

Although there's a lot to cover, I also want you to be trying things
out on your laptops.

You'll spend time typing in code from the screen. It's **better to
type** than to copy-paste.

It's also better to **work in pairs**. Pair-programming is a
recognised development technique. We'll switch at random points.

**Pair up now!**

---

# An Intro to R

- **What is R?** R is a *programming language* and *environment* for
   statistical computing.
- **Who is behind R?** Ross Ihaka and Robert Gentleman
   (Univ. Auckland) developed it 26 years ago; it is now developed by
   a core team supported by the R Foundation.
- **Why would you use it?** Because it's free, because it's open
   source, because many people constantly expand its functionality
   
]
   
---

# An Intro to RStudio

- **What is RStudio?** RStudio is a front-end for the R programming
   language
- **Who is behind RStudio?** RStudio is developed by RStudio Inc., a
   commercial company.
- **Why would you use it?** Because it's free, because it's open
   source, because it reduces the costs of learning R, and because it
   integrates nicely with other things (version control, literate
   programming)
   
]

---

# How do I get R, or RStudio?

- You can download R on its own
   through [http://www.r-project.org](www.r-project.org) or the UK
   mirror
   at
   [https://www.stats.bris.ac.uk/R/](https://www.stats.bris.ac.uk/R/)
 - You can download RStudio, including R,
   at
   [https://rstudio.com/products/rstudio/download/](https://rstudio.com/products/rstudio/download/)
 - Both R and RStudio are available for lots of different types of
   computers, you may need administrative access to your computer to
   install it.

---

# Should I use RStudio?

- **You should probably use RStudio**.
 - I don't use RStudio, but something very geeky (R integration into
   the text editor Emacs through Emacs Speaks Statistics). I **would not**
   recommend this to people learning R unless you already use Emacs.
 - RStudio encourages the use of **literate programming** and does not
   hide any of R's functionality from R. It is not a set of training
   wheels, it just makes things nicer.
   
---

# How does R work?

- R can work in **interactive mode** or by **running code from a
   file**.
 - In this respect it is similar to Stata but unlike SPSS.
 - We'll start running R interactively and then progress from there.

---

# My first R code

Following computer science tradition, let's run the following code.

```r
print("Hello world!")
```

```
## [1] "Hello world!"
```

What happened?

---

# Three things to note

- The code we type in is in `typewriter font` on a grey background; the
code we get out is in `typewriter font` on a yellow background.
 - We didn't just get "Hello world!", we also got `[1]`. This is R's way
of printing to the screen; it's telling us the position we're at.
 - We didn't need to put anything at the end of the line, we just hit return.

---

# Three things to try

- Try capitalizing `print(...)`. What happens?

- Try putting a space between `print` and `("Hello world!")`. What
happens?

- Try just entering `"Hello world!"`. What happens?

# Three things you just learned

- R is **case-sensitive**.
 - R does not care about **whitespace**
 - R will **print** results by default

---

# R as a calculator

We can now use R as a calculator. Let's try the following.

```r
2 + 2
```

```
## [1] 4
```

```r
4 * 2
```

```
## [1] 8
```

```r
8 / 3
```

```
## [1] 2.666667
```

```r
exp(log(8)- log(3))
```

```
## [1] 2.666667
```

---

# Assignment

We will often want to save the results of our calculations, rather than spit them out to the screen.

To do so, we'll use the **assignment operator**, `<-`.

(We could also use `=` as the assignment operator. *This is
deprecated*. You will get confused with other uses of `=`, and the test
for equality `==`).

Here's an example:

```r
foo <- log(8)
bar <- log(3)
```

I could have called these variable names anything, but `foo` and `bar` are traditional.

---

# Concatenation

We will often want to work on sequences of values, rather than specific values.

To do so, we'll use the concatenation operator, `c(...)`

```r
fibonacci <- c(2, 3, 5, 8, 13, 21, 34, 55)
```

---

# Logicals

It can be useful to know whether our values meet certain conditions.

For example, we want to know whether our values are even, or prime, or bigger than a million.

In addition to **character values** (which we saw when we called `print("Hello world!")`), R also allows **logical values**, or `TRUE` and `FALSE`.

Here we check if our numbers are double digit numbers or not.

```r
is_double_digit <- fibonacci > 9
is_double_digit
```

```
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
```

---

# Data frames

We will often want to work with multiple values representing different
variables organized by observations (**"tidy data"**).

We store these in a **data frame**. We can either build our own or load a
built-in data frame.

```r
dat <- data.frame(iso3c = c("BEL",
"CAN", "EST",
"HUN", "MLT",
"UZB", "HRV", "CRC"),
fibs = fibonacci)
```
]
.pull-right[

```
##   iso3c fibs
## 1   BEL    2
## 2   CAN    3
## 3   EST    5
## 4   HUN    8
## 5   MLT   13
## 6   UZB   21
## 7   HRV   34
## 8   CRC   55
```
]

---

# Data frames (2)

Alternately,

```r
data("USArrests")
USArrests
```

```
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6
```

---

# Looking at data frames

We will often want to see a portion of the data frame, rather than printing out the whole data frame.

We can use different *functions* like `summary(...)`, `head(...)`,
`tail(...)` or even `View(...)`.

Try these functions now. How would you describe them?

---

# Accessing variables in data frames

We use data frames because they organize (the values of) our variables.

But how do we **access** variables inside a data frame?

We use the **dollar sign**. To access the variable named `bar` in the data
frame named `foo`, we type `foo$bar`.

How would we conduct a `summary` of the `Murder` variable in the
`USArrests` data frame?

```r
summary(USArrests$Murder)
```

or alternately

```r
mean(USArrests$Murder)
```

---

# Recap

What have we learned so far?

- R is a case-sensitive programming language which doesn't care about
   whitespace
 - That the results of operations are **assigned** to variables using
   the assignment operator `<-`.
 - That we call different **functions**, which are things which use
   brackets.
 - That we can group variables of different kinds in containers called
   **data frames**
 - That we access the variables (columns) in data frames using the
   dollar sign (`$`).
   
**Question**: Go back over the previous slides and copy out all the
functions you've been introduced to.

???

summary, head, tail, view, data.frame, c, data, print, log

Technically also + and /.

---

# Switch places!

---

# Packages

One key advantage of using R is that other people write code to do new
stuff.

Code can be bundled up into **packages**.

Packages need to be **installed** once, and **loaded** each time you
start R afresh.

Let's load a common package, and install a very useful one.

---

# Loading a package

Let's load the **tidyverse** package. To do so, we type

```r
library("tidyverse")
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
```

```
## ✔ ggplot2 3.3.0.9000     ✔ purrr   0.3.3     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.4     
## ✔ tidyr   1.0.2          ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0
```

```
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```

(You can think of this as getting the package from the `library`).

This package should already be installed. You should receive a
message, but a good message.

---

# Loading a library (2)

(In this particular instance, we could have been lazy and omitted the
quotes. But putting double quotes around character strings passed as
arguments to functions is consistent with how R normally works).

To see what new **functions** we have access to, we can type

```r
help(package = "readr")
```

This code also demonstrates a new pattern. Here, we called the `help`
function and **named** one of the arguments (`package`), using the
equals sign. We'll see more of this later.

---

# Installing an R library

Let's now install an R package which is very useful.

```r
install.packages("rio", dependencies = TRUE)
```

This command will seek out the `rio` package from CRAN, the
Comprehensive R Archive Network. This means that you will need a
working internet connection to carry out this command.

You only need to install a package *once* per machine.

If you couldn't load `tidyverse` before, try installing it by adapting
the code above.

---

# Where do packages come from?

Most stable R packages are on **CRAN**. Some packages are available on
author's GitHub pages.

You can see **all 15,385 packages** (as of Friday 14th February) at [https://cran.r-project.org/web/packages/available_packages_by_name.html](https://cran.r-project.org/web/packages/available_packages_by_name.html)

Alternately, you can check out some "Task Views":

- [Econometrics](https://cran.r-project.org/web/views/Econometrics.html)
 - [Social Sciences](https://cran.r-project.org/web/views/SocialSciences.html)
 - [Bayesian methods](https://cran.r-project.org/web/views/Bayesian.html)

---

# Breaking things

Let's now try and get a flavour of R's error codes.

**Mispelling the name of a function**:

```r
sumary(USArrests$Murder)
```

```
## Error in sumary(USArrests$Murder): could not find function "sumary"
```

**Mispelling the name of a variable**:

```r
summary(USArrests$Muder)
```

```
## Length  Class   Mode 
##      0   NULL   NULL
```

---

# Breaking things

**Overwriting with something of the wrong length**:

```r
dat$fibs <- c(2, 3, 5, 8, 13)
```

```
## Error in `$<-.data.frame`(`*tmp*`, fibs, value = c(2, 3, 5, 8, 13)): replacement has 5 rows, data has 8
```

---

---

# How to respond to errors

- Check, very carefully, what you typed. You might have **misspelled**
   something.
 - Search the web for the error message. See whether there is a
   **Stack Overflow** page in the list of results.
 - If you know the error is associated with a particular function,
   check out the **help page** for that function: `help("function_name")`.
 - Try and remove elements from your code until you get to something
   that works, then add stuff back in (create a **"minimal working
   example"**, or MWE).

---

# Recap

- So far, you've learned some very basic elements of the R
   language, including **assignment** and **functions**.
 - You've also learned about R **libraries**. You'll need this knowledge
   for the sessions that follow.
 - You've learned about **R data frames**, but not yet learned how to read
   in data

---

# What's next

- Reading in data
 - Cleaning data
 - A new language element, the pipe  (`%>%`)
 - Graphing data

---