class: center, middle, inverse, title-slide # conveRt to R: the short course ### Chris Hanretty ### January 2020 --- class: center, middle, inverse # Unit 2: Data tidying and plotting --- # Reminder .huge[ Remember to call ```r library("tidyverse") ``` otherwise things will not work ] --- # Preparatory stuff We'll be using a particular file to demonstrate file i/o. It's called `vdem_short.csv` and you can get it at [http://www.chrishanretty.co.uk/vdem_short.csv](http://www.chrishanretty.co.uk/vdem_short.csv) **Download** this file into the directory you are using for this session. (If you don't have a directory, **create one!**). --- # Setting the working directory Stata users will be used to "setting the working directory". You can set the working directory in one of two ways: - Use the `setwd(...)` command - From the Session menu, "Set Working Directory" On my machine, I might type ```r setwd("~/Dropbox/teaching/rstats/") ``` On your machine, you might type ```r setwd("C:\\Users\\Alice\\Documents\\Courses\\R course") ``` --- # Reading in the data ```r dat <- read.csv("vdem_short.csv") ``` If this doesn't work, you can try the following command, which will allow you to select the file using your operating system. ```r filepath <- file.choose() dat <- read.csv(filepath) ``` Check you can access the data. Use some of the functions we used in unit 1 (`summary(...)`, `head(...)`). --- # Description of the data - **country_name**: the name of the country - **country_text_id**: the ISO 3166 3 letter code for the country - **year**: the year in which the observation was made - **v2x_polyarchy**: a continuous measure of electoral democracy on a 0-1 scale from the V-Dem project - **e_boix_regime**: a dichotomous measure of democracy from Carles Boix. - **e_migdppc**: GDP per capita in US dollars - **e_regionpol_6C**: region the country is located in. --- # Piping .left-column[  ] .right-column[ There's one **operator** in the R language we didn't touch on in the previous unit. That is the **pipe operator**, or `%>%`. The pipe operator is used to connect sequences of operations. It passes the output of the previous function as the first argument to the next function. We can show how the pipe operator works when we look at how to **filter** data. ] --- # Filtering Suppose we want to work on a **subset** of our data. For example, we might want to look only at observations with **complete information** on GDP per capita, and which don't have extreme values. Here's code which achieves that: ```r dat2 <- dat %>% filter(!is.na(e_migdppc)) %>% filter(e_migdppc < 100000) %>% filter(e_migdppc > 1000) ``` This code uses 2 new functions (`is.na` and `filter`) and a new operator (`!`). It creates a new data frame which has only non-missing values for GDP (i.e., not-NA values), and which has values of GDP per capita which are both less than 100,000 and greater than 1,000. --- # Did it work? To find out whether this worked, we can use another helpful function, `nrow(...)`. ```r nrow(dat) ``` ``` ## [1] 178 ``` ```r nrow(dat2) ``` ``` ## [1] 154 ``` --- # Filtering (2) An alternative way of writing this without the pipe would be: ```r dat2 <- filter(dat, !is.na(e_migdppc)) dat2 <- filter(dat2, e_migdppc < 100000) dat2 <- filter(dat2, e_migdppc > 1000) ``` Why should we prefer using the pipe? It's **tidier** and **uses fewer characters**. --- # Selecting variables In the same way that we use `filter` to select **rows** of data, we use `select` to select **columns** of data (that is, variables). If you look at the data (by using `head(dat)` or `summary(dat)`) you'll see that the data contains the name of each country and the ISO 3 letter code. Maybe you are confident that you and your audience always recognise ISO 3 letter codes. There are two ways of dropping this variable. ```r dat <- dat %>% select(country_text_id, year, v2x_polyarchy, e_boix_regime, e_migdppc, e_regionpol_6C) ``` or alternately ```r dat <- dat %>% select(-country_name) ``` --- # Selecting (2) In the first pattern, we list explicitly the variables we wish to keep. In the second pattern, we list explicitly the variables we wish to drop. The second pattern can be rewritten as: ```r dat <- dat %>% select(everything(), -country_name) ``` This is because, according the [documentation](https://dplyr.tidyverse.org/reference/select.html), "If the first expression [in the call] is negative, select() will automatically start with all variables." The document also describes how to select ranges of variables and variables based on patterns, and is well worth reading. --- # Doing things to variables Over the next few slides I'll look at various ways of transforming variables. The general pattern for these transformations is something like: ```r dat <- dat %>% mutate(newvar = do_something(oldvar)) ``` This means we'll be **keeping the same data frame** (and not making any copies of it). This also means we'll be **creating new variables** (and not over-writing existing variables). If you create lots of new versions of variables, you can always use `select` to trim your data frame back down. --- # Transforming continuous variables .small[ It's best to transform some variables before using them in an analysis. These transformations can be very simple. We may, for example, wish to measure GDP per capita in thousands of dollars per year, rather than in dollars per year. These transformations can be more complicated. We may wish to take the log of a strictly positive variable like GDP per capita. Here I show both: ] ```r dat <- dat %>% mutate(gdppercap = e_migdppc / 1000, gdpercap_l = log(gdppercap)) ``` .small[ Notice that I create the logged variable `gdppercap_l` from the `gdppercap` variable I created in the previous line. ] --- # Discretizing continuous variables .small[ Sometimes we want to reduce the complexity of continuous variables. We might want to recode them into halves, or terciles, or quartiles. Here's an example of how to do this, using the `ifelse` function. We'll split our continuous measure of democracy (`v2x_polyarchy`) into halves. ] ```r dat <- dat %>% mutate(discret_dem = ifelse(v2x_polyarchy > median(v2x_polyarchy), "More democratic than autocratic", "More autocratic than democratic")) ``` .small[ We could also have created a dummy variable (not run): ] ```r dat <- dat %>% mutate(discret_dem = ifelse(v2x_polyarchy > median(v2x_polyarchy), 1, 0)) ``` If we wanted to use a different percentile other than the (median) 50th percentile, we could have used the `quantile` function, illustrated in the next slide. --- # Discretizing continuous variables (2) .small[ The `ifelse` command is helpful for dichotomizing things. For more complicated scenarios, the `case_when` command is helpful. Here is an example of how to turn our continuous democracy score into a trichotomy. ] ```r dat <- dat %>% mutate(discret_dem = case_when(v2x_polyarchy > quantile(v2x_polyarchy, 2/3) ~ "Democratic", v2x_polyarchy > quantile(v2x_polyarchy, 1/3) ~ "Average", TRUE ~ "Autocratic")) ``` .small[ Here's a translation: if the value of v2x_polyarchy is greater than the 66th percentile, assign the value that follows the tilde (`~`), which is "Democratic". **Otherwise**, if the value of v2x_polyarchy is greater than the 33rd percentile, assign "Average". **In all other cases**, assign the value "Autocratic". ] --- # Fixing individual values Suppose that we know that the value for a particular observation is **wrong**. We could fix this in the raw data, but maybe it's someone else's data and we want to **record** (=be transparent about) our alterations. Let's consider two cases: fixing a continuous variable and fixing a character variable --- # Fixing a single continuous variable Let's replace UK GDP per capita (in thousands of dollars) for the most recent year with a different value. ```r dat2 <- dat %>% mutate(gdppercap = ifelse(country_text_id == "GBR" & year == max(year), 36, gdppercap)) ``` -- Did it work? ```r dat2 %>% filter(country_text_id == "GBR" & year == max(year)) %>% select(gdppercap) ``` ``` ## gdppercap ## 1 36 ``` --- # Fixing discrete values At the moment, Australia and New Zealand are recorded as belonging to "Western Europe and North America". .pull-left[ ```r dat %>% filter(e_regionpol_6C == "Western Europe and North America") %>% select(country_text_id) ``` ] .pull-right[ country_text_id 1 AUS 2 AUT 3 BEL 4 CAN 5 CHE 6 CYP 7 DEU 8 DNK 9 ESP 10 FIN 11 FRA 12 GBR 13 GRC 14 IRL 15 ISL 16 ITA 17 LUX 18 MLT 19 NLD 20 NOR 21 NZL 22 PRT 23 SWE 24 USA ] --- # Putting Oz in its place Let's put Australia in its rightful place. ```r dat <- dat %>% mutate(new_regions = ifelse(country_text_id %in% c("AUS", "NZL"), "Asia and Pacific", e_regionpol_6C)) ``` Note the `%in%` operator. You can read that as **"is contained in", or "matches any one of"**. Now, has this worked? --- class: inverse background-color: black background-image: url(bobross.gif) --- # Has this worked? Let's check... ```r table(dat$new_regions) ``` ``` ## ## 1 2 3 4 ## 28 30 25 21 ## 5 6 Asia and Pacific ## 50 22 2 ``` --- # Factors R stores non-numeric variables either as **character strings** or **factors**. **Factors** are discrete variables with a set number of possible values or factor **levels**. When R reads in data, it will **by default** convert character variables to factor. (Though this will [change with R4.0.0](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/)!!) Often if you are manipulating factors it is easiest to convert to character and then reconvert. Here's an example... --- ```r dat <- dat %>% mutate(e_regionpol_6C = as.character(e_regionpol_6C), e_regionpol_6C = ifelse(country_text_id %in% c("AUS", "NZL"), "Asia and Pacific", e_regionpol_6C)) table(dat$e_regionpol_6C) ``` ``` ## ## Asia and Pacific Eastern Europe and Central Asia ## 30 30 ## Latin America and the Caribbean Middle East and Northern Africa ## 25 21 ## Sub-Saharan Africa Western Europe and North America ## 50 22 ``` --- # Recoding multiple discrete values Using `ifelse` is cumbersome when recoding multiple discrete values. For this, `recode` is more useful. ```r dat <- dat %>% mutate(hemisphere = recode(e_regionpol_6C, "Asia and Pacific" = "Southern", "Eastern Europe and Central Asia" = "Northern", "Latin America and the Caribbean" = "Southern", "Middle East and Northern Africa" = "Northern", "Sub-Saharan Africa" = "Southern", "Western Europe and North America" = "Northern")) ``` --- class: center, inverse, middle # Unit 2b: Visualization --- # Reminder .huge[ Remember to call ```r library("tidyverse") ``` otherwise things will not work ] --- .pull-left[  ] .pull-right[ - We've now got data that is slightly tidier -- can we **visualize it**? - To visualize data, we'll use the `ggplot2` library (loaded when we loaded `tidyverse`) - `ggplot2` is used regularly by the graphics teams at the New York Times and the Financial Times. It is capable of producing publication quality graphics with limited effort. ] --- # Let's scatterplot - Scatterplots are (to my mind) the most basic and yet effective form of data visualization - They require two variables to be mapped to the x and y axes. In the language of ggplot, these are the necessary **aesthetics**. - We can **map other variables** on to other aesthetics - We can also **customize the axes** - We'll start with the most basic version --- # Version 1 ```r p1 <- ggplot(dat, aes(x = e_migdppc, y = v2x_polyarchy)) + geom_point() print(p1) ``` <!-- --> ```r ### Default units are (inexplicably) inches ggsave(p1, file = "gdp_by_democracy.png", width = 7, height = 4) ``` --- # Problems with version 1 - The **axes** are not intelligible unless you know our variable names - The data on the horizontal axis are compressed and could benefit from **transformation** - **Outliers** could usefully be labelled - The **background** is ugly and may not be accepted for print publication - We could encode additional information using other aesthetics We will solve these by adding elements to our simple template --- # Axis labelling ```r p2 <- p1 + scale_x_continuous("GDP per capita in US dollars") + scale_y_continuous("Level of electoral democracy\n[0-1, higher values indicate more democracy]") ``` (**Note:** `\n` means new line; ggplot will not word-wrap automatically) --- # Version 2 ``` ## Warning: Removed 16 rows containing missing values (geom_point). ``` <!-- --> --- # Axis labelling (2) Let's try labelling the x axis by telling ggplot2 we're using dollars. ```r p2a <- p2 + scale_x_continuous("GDP per capita in US dollars", labels = scales::dollar_format()) ``` ``` ## Scale for 'x' is already present. Adding another scale for 'x', which will ## replace the existing scale. ``` See the `scales` [package documentation](https://scales.r-lib.org/reference/index.html) for more possible label scales. --- # Version 2a ``` ## Warning: Removed 16 rows containing missing values (geom_point). ``` <!-- --> --- # Variable/axis transformations We have two options. We can **transform the variable**, or we can **transform the axis**. ```r dat <- dat %>% mutate(gdppc_l = log(e_migdppc)) p3a <- ggplot(dat, aes(x = gdppc_l, y = v2x_polyarchy)) + geom_point() p3b <- p2a + scale_x_log10("GDP per capita in US dollars", labels = scales::dollar_format()) ``` ``` ## Scale for 'x' is already present. Adding another scale for 'x', which will ## replace the existing scale. ``` --- # Both versions <!-- --> --- # (Manual) Outlier labelling So far, we have used just one function beginning **`geom_`**, namely **`geom_point`**. When we used that function, we didn't supply any data to it. It just inherited the same dataframe we passed to `ggplot`. We will now use a new geom, namely `geom_text`, and pass a **subset** of our data. --- # The code ```r p4 <- p3b + geom_text(data = filter(dat, e_migdppc > 40000 & v2x_polyarchy < 0.5), aes(label = country_text_id), adj = 1) ``` --- ```r p4 ``` ``` ## Warning: Removed 16 rows containing missing values (geom_point). ``` <!-- --> --- # Changing the background ```r p5 <- p4 + theme_bw() p5b <- p4 + theme_minimal() ``` --- ```r plot_grid(p5, p5b, labels = c('BW', 'Minimal'), label_size = 12) ``` <!-- --> --- # Mapping on another variable - We have used position on the horizontal and vertical axis to encode information about two variables - There are other aesthetics that we could use to encode information: * the **size** or **saturation/intensity** of the plotted point (continuous) * the **colour** or **shape** of the plotted point (categorical) - Let's experiment with mapping the colour and shape of plotted points to the region of the plotted point. --- # The code ```r p6 <- ggplot(dat, aes(x = e_migdppc, y = v2x_polyarchy)) + geom_point(aes(shape = e_regionpol_6C, colour = e_regionpol_6C)) + scale_shape_discrete("Region") + scale_colour_discrete("Region") + scale_x_log10("GDP per capita in US dollars", labels = scales::dollar_format()) + scale_y_continuous("Level of electoral democracy\n[0-1, higher values indicate more democracy]") + geom_text(data = filter(dat, e_migdppc > 40000 & v2x_polyarchy < 0.5), aes(label = country_text_id), adj = 1) + theme_minimal() + theme(legend.position = "bottom") ``` --- ```r print(p6) ``` ``` ## Warning: Removed 16 rows containing missing values (geom_point). ``` <!-- --> --- # Small multiples You might think the last plot is "too busy", and that it does not make sense to encode additional information about region using shape and colour. An alternative is to plot multiple small scatterplots, one for each region. This is **incredibly easy** in `ggplot2`. ```r p7 <- p6 + facet_wrap(~e_regionpol_6C) ``` --- ```r print(p7) ``` ``` ## Warning: Removed 16 rows containing missing values (geom_point). ``` <!-- --> --- # Adding on trend lines We can use a further `geom` to improve our plot. This is `geom_smooth`. By default, `geom_smooth` will give you a local regression with an estimated standard error. This is a bit much for my tastes, so I pass the arguments `method = "lm"` and `se = FALSE`. We'll see later why the value passed to the method argument is "lm". --- ```r p8 <- p7 + geom_smooth(method = "lm", se = FALSE) ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> --- # Recap You've learned: - How to read in a CSV file - How to select rows of data (`filter`) and columns of data (`select`) - How to transform continuous data - How to recode categorical data, and how R uses `factors` - How to visualize data using `ggplot2` --- # What about you? Can you reproduce a graphic from one of your articles? Do you need a different `geom`? Check out the list at [https://ggplot2.tidyverse.org/reference/#section-layer-geoms](https://ggplot2.tidyverse.org/reference/#section-layer-geoms) Useful geoms include: - `geom_boxplot()` and `geom_pointrange()`; the latter is particularly useful for coefficient plots - `geom_col()` (usually preferable to `geom_bar()`) - `geom_line()` Do you need tips on visualization? Check out Kieran Healy's [Data Visualization A practical introduction](https://socviz.co/), a brilliant book available for free in draft. --- class: center, middle, inverse