class: center, middle, inverse, title-slide # conveRt to R: the short course ### Chris Hanretty ### January 2020 --- <style type="text/css"> .remark-slide-content { font-size: 20px; line-height: 175%; } </style> # About this course - This course is a **two-day / ten hour** course on R - It is targeted at social scientists who **already know** an existing stats package (Stata, SPSS, others) - It could be useful for people who learned R some time ago and forgot it, or who are not familiar with modern R programming (`tidyverse`) - It focuses on **data wrangling**, plotting, and estimating different kind of **regression models**. - It does not cover R programming, writing functions, or language fundamentals. --- # Structure of the course .pull-left[ The course is split into six units: - **Unit 1:** installing R/RStudio and understanding the R ecosystem; - **Unit 2:** Data tidying and plotting - **Unit 3:** Regression modelling and exporting results - **Unit 4:** More on data tidying, including reshaping and merging - **Unit 5:** Different regression modelling strategies - **Unit 6:** Things R does poorly ] .pull-right[ We will try to cover units 1-4 in the first day. If we have time, we will try making nice figures. We will cover units 5 and 6 on the morning of day two. In the afternoon, we will "translate" your existing replication code into R. ] --- # How we'll work Although there's a lot to cover, I also want you to be trying things out on your laptops. You'll spend time typing in code from the screen. It's **better to type** than to copy-paste. It's also better to **work in pairs**. Pair-programming is a recognised development technique. We'll switch at random points. **Pair up now!** --- # An Intro to R .pull-left[  ] .pull-right[ - **What is R?** R is a *programming language* and *environment* for statistical computing. - **Who is behind R?** Ross Ihaka and Robert Gentleman (Univ. Auckland) developed it 26 years ago; it is now developed by a core team supported by the R Foundation. - **Why would you use it?** Because it's free, because it's open source, because many people constantly expand its functionality ] --- # An Intro to RStudio .pull-left[  ] .pull-right[ - **What is RStudio?** RStudio is a front-end for the R programming language - **Who is behind RStudio?** RStudio is developed by RStudio Inc., a commercial company. - **Why would you use it?** Because it's free, because it's open source, because it reduces the costs of learning R, and because it integrates nicely with other things (version control, literate programming) ] --- # How do I get R, or RStudio? - You can download R on its own through [http://www.r-project.org](www.r-project.org) or the UK mirror at [https://www.stats.bris.ac.uk/R/](https://www.stats.bris.ac.uk/R/) - You can download RStudio, including R, at [https://rstudio.com/products/rstudio/download/](https://rstudio.com/products/rstudio/download/) - Both R and RStudio are available for lots of different types of computers, you may need administrative access to your computer to install it. --- # Should I use RStudio? - **You should probably use RStudio**. - I don't use RStudio, but something very geeky (R integration into the text editor Emacs through Emacs Speaks Statistics). I **would not** recommend this to people learning R unless you already use Emacs. - RStudio encourages the use of **literate programming** and does not hide any of R's functionality from R. It is not a set of training wheels, it just makes things nicer. --- # How does R work? - R can work in **interactive mode** or by **running code from a file**. - In this respect it is similar to Stata but unlike SPSS. - We'll start running R interactively and then progress from there. --- # My first R code Following computer science tradition, let's run the following code. ```r print("Hello world!") ``` ``` ## [1] "Hello world!" ``` What happened? --- # Three things to note - The code we type in is in `typewriter font` on a grey background; the code we get out is in `typewriter font` on a yellow background. - We didn't just get "Hello world!", we also got `[1]`. This is R's way of printing to the screen; it's telling us the position we're at. - We didn't need to put anything at the end of the line, we just hit return. --- # Three things to try - Try capitalizing `print(...)`. What happens? - Try putting a space between `print` and `("Hello world!")`. What happens? - Try just entering `"Hello world!"`. What happens? -- # Three things you just learned - R is **case-sensitive**. - R does not care about **whitespace** - R will **print** results by default --- # R as a calculator We can now use R as a calculator. Let's try the following. ```r 2 + 2 ``` ``` ## [1] 4 ``` -- ```r 4 * 2 ``` ``` ## [1] 8 ``` -- ```r 8 / 3 ``` ``` ## [1] 2.666667 ``` ```r exp(log(8)- log(3)) ``` ``` ## [1] 2.666667 ``` --- # Assignment We will often want to save the results of our calculations, rather than spit them out to the screen. To do so, we'll use the **assignment operator**, `<-`. (We could also use `=` as the assignment operator. *This is deprecated*. You will get confused with other uses of `=`, and the test for equality `==`). Here's an example: ```r foo <- log(8) bar <- log(3) ``` I could have called these variable names anything, but `foo` and `bar` are traditional. --- # Concatenation We will often want to work on sequences of values, rather than specific values. To do so, we'll use the concatenation operator, `c(...)` ```r fibonacci <- c(2, 3, 5, 8, 13, 21, 34, 55) ``` --- # Logicals It can be useful to know whether our values meet certain conditions. For example, we want to know whether our values are even, or prime, or bigger than a million. In addition to **character values** (which we saw when we called `print("Hello world!")`), R also allows **logical values**, or `TRUE` and `FALSE`. Here we check if our numbers are double digit numbers or not. ```r is_double_digit <- fibonacci > 9 is_double_digit ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE ``` --- # Data frames We will often want to work with multiple values representing different variables organized by observations (**"tidy data"**). We store these in a **data frame**. We can either build our own or load a built-in data frame. .pull-left[ ```r dat <- data.frame(iso3c = c("BEL", "CAN", "EST", "HUN", "MLT", "UZB", "HRV", "CRC"), fibs = fibonacci) ``` ] .pull-right[ ``` ## iso3c fibs ## 1 BEL 2 ## 2 CAN 3 ## 3 EST 5 ## 4 HUN 8 ## 5 MLT 13 ## 6 UZB 21 ## 7 HRV 34 ## 8 CRC 55 ``` ] --- # Data frames (2) Alternately, ```r data("USArrests") USArrests ``` ``` ## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7 ## Connecticut 3.3 110 77 11.1 ## Delaware 5.9 238 72 15.8 ## Florida 15.4 335 80 31.9 ## Georgia 17.4 211 60 25.8 ## Hawaii 5.3 46 83 20.2 ## Idaho 2.6 120 54 14.2 ## Illinois 10.4 249 83 24.0 ## Indiana 7.2 113 65 21.0 ## Iowa 2.2 56 57 11.3 ## Kansas 6.0 115 66 18.0 ## Kentucky 9.7 109 52 16.3 ## Louisiana 15.4 249 66 22.2 ## Maine 2.1 83 51 7.8 ## Maryland 11.3 300 67 27.8 ## Massachusetts 4.4 149 85 16.3 ## Michigan 12.1 255 74 35.1 ## Minnesota 2.7 72 66 14.9 ## Mississippi 16.1 259 44 17.1 ## Missouri 9.0 178 70 28.2 ## Montana 6.0 109 53 16.4 ## Nebraska 4.3 102 62 16.5 ## Nevada 12.2 252 81 46.0 ## New Hampshire 2.1 57 56 9.5 ## New Jersey 7.4 159 89 18.8 ## New Mexico 11.4 285 70 32.1 ## New York 11.1 254 86 26.1 ## North Carolina 13.0 337 45 16.1 ## North Dakota 0.8 45 44 7.3 ## Ohio 7.3 120 75 21.4 ## Oklahoma 6.6 151 68 20.0 ## Oregon 4.9 159 67 29.3 ## Pennsylvania 6.3 106 72 14.9 ## Rhode Island 3.4 174 87 8.3 ## South Carolina 14.4 279 48 22.5 ## South Dakota 3.8 86 45 12.8 ## Tennessee 13.2 188 59 26.9 ## Texas 12.7 201 80 25.5 ## Utah 3.2 120 80 22.9 ## Vermont 2.2 48 32 11.2 ## Virginia 8.5 156 63 20.7 ## Washington 4.0 145 73 26.2 ## West Virginia 5.7 81 39 9.3 ## Wisconsin 2.6 53 66 10.8 ## Wyoming 6.8 161 60 15.6 ``` --- # Looking at data frames We will often want to see a portion of the data frame, rather than printing out the whole data frame. We can use different *functions* like `summary(...)`, `head(...)`, `tail(...)` or even `View(...)`. Try these functions now. How would you describe them? --- # Accessing variables in data frames We use data frames because they organize (the values of) our variables. But how do we **access** variables inside a data frame? We use the **dollar sign**. To access the variable named `bar` in the data frame named `foo`, we type `foo$bar`. How would we conduct a `summary` of the `Murder` variable in the `USArrests` data frame? -- ```r summary(USArrests$Murder) ``` or alternately ```r mean(USArrests$Murder) ``` --- # Recap What have we learned so far? - R is a case-sensitive programming language which doesn't care about whitespace - That the results of operations are **assigned** to variables using the assignment operator `<-`. - That we call different **functions**, which are things which use brackets. - That we can group variables of different kinds in containers called **data frames** - That we access the variables (columns) in data frames using the dollar sign (`$`). **Question**: Go back over the previous slides and copy out all the functions you've been introduced to. ??? summary, head, tail, view, data.frame, c, data, print, log Technically also + and /. --- class: center, middle, inverse # Switch places! --- # Packages One key advantage of using R is that other people write code to do new stuff. Code can be bundled up into **packages**. Packages need to be **installed** once, and **loaded** each time you start R afresh. Let's load a common package, and install a very useful one. --- # Loading a package Let's load the **tidyverse** package. To do so, we type ```r library("tidyverse") ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ``` ``` ## ✔ ggplot2 3.3.0.9000 ✔ purrr 0.3.3 ## ✔ tibble 2.1.3 ✔ dplyr 0.8.4 ## ✔ tidyr 1.0.2 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` (You can think of this as getting the package from the `library`). This package should already be installed. You should receive a message, but a good message. --- # Loading a library (2) (In this particular instance, we could have been lazy and omitted the quotes. But putting double quotes around character strings passed as arguments to functions is consistent with how R normally works). To see what new **functions** we have access to, we can type ```r help(package = "readr") ``` This code also demonstrates a new pattern. Here, we called the `help` function and **named** one of the arguments (`package`), using the equals sign. We'll see more of this later. --- # Installing an R library Let's now install an R package which is very useful. ```r install.packages("rio", dependencies = TRUE) ``` This command will seek out the `rio` package from CRAN, the Comprehensive R Archive Network. This means that you will need a working internet connection to carry out this command. You only need to install a package *once* per machine. If you couldn't load `tidyverse` before, try installing it by adapting the code above. --- # Where do packages come from? Most stable R packages are on **CRAN**. Some packages are available on author's GitHub pages. You can see **all 15,385 packages** (as of Friday 14th February) at [https://cran.r-project.org/web/packages/available_packages_by_name.html](https://cran.r-project.org/web/packages/available_packages_by_name.html) Alternately, you can check out some "Task Views": - [Econometrics](https://cran.r-project.org/web/views/Econometrics.html) - [Social Sciences](https://cran.r-project.org/web/views/SocialSciences.html) - [Bayesian methods](https://cran.r-project.org/web/views/Bayesian.html) --- # Breaking things Let's now try and get a flavour of R's error codes. **Mispelling the name of a function**: ```r sumary(USArrests$Murder) ``` ``` ## Error in sumary(USArrests$Murder): could not find function "sumary" ``` **Mispelling the name of a variable**: ```r summary(USArrests$Muder) ``` ``` ## Length Class Mode ## 0 NULL NULL ``` --- # Breaking things **Overwriting with something of the wrong length**: ```r dat$fibs <- c(2, 3, 5, 8, 13) ``` ``` ## Error in `$<-.data.frame`(`*tmp*`, fibs, value = c(2, 3, 5, 8, 13)): replacement has 5 rows, data has 8 ``` --- class: inverse background-color: black background-image: url(bobross.gif) --- # How to respond to errors - Check, very carefully, what you typed. You might have **misspelled** something. - Search the web for the error message. See whether there is a **Stack Overflow** page in the list of results. - If you know the error is associated with a particular function, check out the **help page** for that function: `help("function_name")`. - Try and remove elements from your code until you get to something that works, then add stuff back in (create a **"minimal working example"**, or MWE). --- # Recap - So far, you've learned some very basic elements of the R language, including **assignment** and **functions**. - You've also learned about R **libraries**. You'll need this knowledge for the sessions that follow. - You've learned about **R data frames**, but not yet learned how to read in data --- # What's next - Reading in data - Cleaning data - A new language element, the pipe (`%>%`) - Graphing data --- class: center, middle, inverse