class: center, middle, inverse, title-slide # Data management with R ### Ariane Aumaitre ### European University Institute ### 14/05/2020 --- background-image: url(https://i.ibb.co/kyMCbQ7/r-first-then.png) background-size: 1000px background-position: 50% 50% class: bottom ##### Illustration: Allison Horst --- background-image: url(https://i.imgur.com/bmJmSXI.png) background-size: 800px background-position: 50% 75% # Goals of this workshop * Giving you a comprehensive set of tools to **manage your data** * Focus on how to **get your data ready** for analysis 💻 * We will cover **importing**, **tidying**, **transforming** and some **visualization** --- # Getting started: libraries and data First, libraries: ```r library(tidyverse) ``` Loading a csv file: ```r data <- read.csv("workshop_data.csv") head(data) ``` ``` ## person_id household_id year_born sex employed retired education ## 1 101 1 1949 1 0 1 300 ## 2 102 1 1951 2 0 0 300 ## 3 301 3 1965 2 1 0 500 ## 4 302 3 1993 2 1 0 500 ## 5 303 3 1996 2 0 0 344 ## 6 401 4 1958 2 0 0 500 ## net_income gross_income ## 1 0.0 0.00 ## 2 0.0 0.00 ## 3 18470.4 22095.67 ## 4 6240.0 7042.30 ## 5 2210.0 2396.34 ## 6 3120.0 3362.77 ``` --- #Importing data: packages <style type="text/css"> .tg {border-collapse:collapse;border-color:#aaa;border-spacing:0;} .tg td{background-color:#fff;border-bottom-width:1px;border-color:#aaa;border-style:solid;border-top-width:1px; border-width:0px;color:#333;font-family:Josefin Sans;font-size:16px;overflow:hidden;padding:16px 20px; word-break:normal;} .tg th{background-color:#f38630;border-bottom-width:1px;border-color:#aaa;border-style:solid;border-top-width:1px; border-width:0px;color:#fff;font-family:Josefin Sans;font-size:16px;font-weight:normal;overflow:hidden; padding:14px 18px;word-break:normal;} .tg .tg-y3zf{background-color:#ffffff;border-color:#00bffe;text-align:center;vertical-align:top} .tg .tg-f4o9{background-color:#ffffff;border-color:#00bffe;font-style:italic;text-align:center;vertical-align:top} .tg .tg-ogoq{background-color:#ffffff;border-color:#00bffe;font-style:italic;text-align:center;vertical-align:top} .tg .tg-xads{background-color:#ffffff;border-color:#00bffe;color:#00bffe;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-24j6{background-color:#ffffff;border-color:#00bffe;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-elfb{background-color:#ffffff;border-color:#00bffe;font-weight:bold;text-align:center;vertical-align:top} .tg .tg-b2qw{background-color:#ffffff;border-color:#00bffe;text-align:center;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-xads">Package</th> <th class="tg-xads">Files</th> <th class="tg-xads">Functions</th> </tr> </thead> <tbody> <tr> <td class="tg-24j6">haven</td> <td class="tg-ogoq">Stata, SPSS, SAS</td> <td class="tg-y3zf">read_dta("yourfile.dta")<br>read_sav()<br>read_sas()<br> <br></td> </tr> <tr> <td class="tg-elfb">readr</td> <td class="tg-f4o9">csv</td> <td class="tg-b2qw">read_csv("yourfile.csv")<br><br> </td> </tr> <tr> <td class="tg-24j6">readxl</td> <td class="tg-ogoq">xls / xlsx</td> <td class="tg-y3zf">read_excel("yourfile.xlsx")<br>read_excel("yourfile.xlsx", sheet="name")<br></td> </tr> </tbody> </table> --- class: inverse, center, middle # 1. Getting to know your data --- # Quick revision of data types 📋 * In **R**, uni-dimendional data are called **vectors**. They can be **numeric, integers, character, factors or logical**. ```r character_vector <- c("a", "b") numeric_vector <- c(1,2) ``` * Useful functions to convert vectors: `as.numeric(data$variable)`, `as.character()`,... * Most times we will be working with many vectors at the same time, brought together into a **data frame** or a **tibble** ```r my_df <- data.frame(character_vector, numeric_vector) ``` * Not everything in R can be stored in a data frame - sometimes you will be working with **lists** --- # Getting to know your data🔎️ `head()` shows the first 6 rows of your data. To see more (or less), you can do `head(data, n)` ```r head(data) ``` ``` ## person_id household_id year_born sex employed retired education ## 1 101 1 1949 1 0 1 300 ## 2 102 1 1951 2 0 0 300 ## 3 301 3 1965 2 1 0 500 ## 4 302 3 1993 2 1 0 500 ## 5 303 3 1996 2 0 0 344 ## 6 401 4 1958 2 0 0 500 ## net_income gross_income ## 1 0.0 0.00 ## 2 0.0 0.00 ## 3 18470.4 22095.67 ## 4 6240.0 7042.30 ## 5 2210.0 2396.34 ## 6 3120.0 3362.77 ``` --- # Getting to know your data🔎 `str()` gives you an overview of data types ```r str(data) ``` ``` ## 'data.frame': 33734 obs. of 9 variables: ## $ person_id : int 101 102 301 302 303 401 402 403 501 601 ... ## $ household_id: int 1 1 3 3 3 4 4 4 5 6 ... ## $ year_born : int 1949 1951 1965 1993 1996 1958 1991 1997 1938 1941 ... ## $ sex : int 1 2 2 2 2 2 2 1 2 2 ... ## $ employed : int 0 0 1 1 0 0 1 0 0 0 ... ## $ retired : int 1 0 0 0 0 0 0 0 0 0 ... ## $ education : int 300 300 500 500 344 500 500 344 0 100 ... ## $ net_income : num 0 0 18470 6240 2210 ... ## $ gross_income: num 0 0 22096 7042 2396 ... ``` --- # Getting to know your data🔎 `summary()` gives you summary statistics of your data. You can apply it to the whole data frame or **to a single variable** ```r summary(data$net_income) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0 0 92 7722 13571 248676 5362 ``` ```r summary(data$gross_income) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 0.0 99.7 9744.9 15758.4 398312.6 5362 ``` --- class: inverse, center, middle # 2. Data cleaning and transformation --- background-image: url(https://www.khstats.com/img/dplyr_image.jpg) background-size: 400px background-position: 50% 90% # Let's start cleaning 🧹 * You will learn how to **filter**, **count** and **arrange** observations, **select** columns, **transform** variables, **create** new columns... * We will also learn how to **group** dataframes to **summarize** information. * Our main tool: the **dplyr** package --- # Filtering data </br> * The `filter()` function 'returns rows with matching conditions' ```r filtered_data <- filter(data, net_income >0) head(filtered_data) ``` ``` ## person_id household_id year_born sex employed retired education ## 1 301 3 1965 2 1 0 500 ## 2 302 3 1993 2 1 0 500 ## 3 303 3 1996 2 0 0 344 ## 4 401 4 1958 2 0 0 500 ## 5 402 4 1991 2 1 0 500 ## 6 602 6 1971 2 1 0 500 ## net_income gross_income ## 1 18470.4 22095.67 ## 2 6240.0 7042.30 ## 3 2210.0 2396.34 ## 4 3120.0 3362.77 ## 5 5304.0 5751.23 ## 6 26000.0 33732.85 ``` --- # How to use filter() </br> * Comparison operators: `==, >, <, >=, <=, !=` ```r filter(data, net_income > 0, sex == 2) ``` * The `%in%` operator ```r filter(data, year_born %in% 1950:1960) #you can use any kind of vector after the in #use ! to state "not in" filter(data, !education %in% 200:450) ``` * Filtering NAs out with `is.na()` (be careful ⚠️) ```r filter(data, !is.na(education)) ``` --- background-image: url(https://i.imgur.com/8oZIMsr.png) background-size: 800px background-position: 50% 80% # How to use filter() </br> * Logical operators: `&` and `|` (or) ```r filter(mydata, year_born > 1999 | year_born < 1952) ``` --- # Some syntax: the pipe %>% Compare these two chunks *that produce the same output* .pull_left[ ```r filtered_data <- filter(data, net_income >0) ``` ] .pull_right[ ```r filtered_data <- data%>% filter(net_income>0) ``` ] * The **pipe operator** can be roughly ranslated as **'then'**: _take my data, do this operation, then take the output and perform another one_ * A way of combining **multiple operations** in your code * It helps writing shorter, more efficient code and **avoid repetitions** * Code becomes readable from left-to-right, top-to-bottom --- background-image: url(https://i.imgur.com/qG4Ad2E.png) background-size: 400px background-position: 50% 20% #An illustration </br> </br> </br> </br> </br> </br> </br> </br> </br> </br> </br> .center[] --- # The select() function </br> Use `select()` to keep only the variables you're interested in ```r selected_data <- data%>% select(sex, year_born, net_income) head(selected_data) ``` ``` ## sex year_born net_income ## 1 1 1949 0.0 ## 2 2 1951 0.0 ## 3 2 1965 18470.4 ## 4 2 1993 6240.0 ## 5 2 1996 2210.0 ## 6 2 1958 3120.0 ``` --- # Using select() </br> * Select by **column name**: ```r data%>% select(a,b,c) ``` * Select by column **position**: ```r data%>% select(1:5, 8) ``` * Select all columns **except one**: ```r data%>% select(-1) ``` * Negative selection of a **group of columns**: ```r data%>% select(-c(a,b,c)) ``` --- # Chaining operations and arrange() </br> `arrange()` can be used to order observations by a variable, in ascending or descending order ```r arranged_data <- data%>% filter(sex ==2)%>% select(year_born, gross_income)%>% arrange(desc(gross_income)) head(arranged_data) ``` ``` ## year_born gross_income ## 1 1954 259113.2 ## 2 1951 251850.7 ## 3 1963 229012.6 ## 4 1969 162385.6 ## 5 1954 156385.6 ## 6 1961 134490.9 ``` --- # Transforming data with mutate() </br> `mutate()` can be used for creating new variables, while preserving existing ones ```r new_data <- data%>% mutate(taxes_and_transfers = gross_income - net_income) summary(new_data$taxes_and_transfers) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0 0 0 2023 2112 162088 5362 ``` ```r head(new_data) ``` ``` ## person_id household_id year_born sex employed retired education ## 1 101 1 1949 1 0 1 300 ## 2 102 1 1951 2 0 0 300 ## 3 301 3 1965 2 1 0 500 ## 4 302 3 1993 2 1 0 500 ## 5 303 3 1996 2 0 0 344 ## 6 401 4 1958 2 0 0 500 ## net_income gross_income taxes_and_transfers ## 1 0.0 0.00 0.00 ## 2 0.0 0.00 0.00 ## 3 18470.4 22095.67 3625.27 ## 4 6240.0 7042.30 802.30 ## 5 2210.0 2396.34 186.34 ## 6 3120.0 3362.77 242.77 ``` --- # Useful functions within mutate() </br> * `ifelse()` to return a column filled with one of two values depending on a condition ```r data%>% mutate(high_education = ifelse(education == 500, 1, 0)) ``` * `case_when()` is the easiest way to expand ifelse: ```r data%>% mutate(edu_clean = case_when(education %in% 0:100 ~ "Primary or below", education %in% 200:354 ~ "Lower secondary", education %in% 400:450 ~ "Upper secondary", education == 500 ~ "Higher education")) ``` * `replace()` to change values in a column for a replacement * `cumsum()` --- background-image: url(https://pbs.twimg.com/media/EFASUtGUcAANGYr.jpg) background-size: 400px background-position: 50% 90% </br> # Packages you may find useful * **stringr** for working with text data and character vectors - functions for string cleaning, trimming and working with patterns * **forcats** for working with categorical data: recoding factors, collapsing values... * **lubridate** makes working with dates easier --- # Working with groups </br> * If you want to work with groups within your data, you need to use the `group_by()` function * `group_by()` does not produce any output, but whichever operation comes afterwards will be performed *by group* * After group by, you can either use `mutate()` if you want to keep all your observations or `summarize()` if you are only interested in the group summary * If you want to keep working with the dataframe as a whole after your group operations, you should always remember to `ungroup()` --- # summarize() </br> * After grouping by your variable(s) of interest, you can use `summarize()` with any function you want to create summary variables ```r summary_data <- data%>% mutate(age = 2018 - year_born, age_group = case_when(age %in% 0:18 ~ "0-18", age %in% 19:35 ~ "19-35", age %in% 36:65 ~ "36-65", age > 65 ~ "65+"))%>% group_by(age_group)%>% summarize(mean_income = mean(net_income, na.rm = TRUE), min_income = min(net_income, na.rm = TRUE), max_income = max(net_income, na.rm = TRUE)) head(summary_data) ``` ``` ## # A tibble: 4 x 4 ## age_group mean_income min_income max_income ## <chr> <dbl> <dbl> <dbl> ## 1 0-18 222. 0 27213. ## 2 19-35 6350. 0 57258. ## 3 36-65 11906. 0 248676. ## 4 65+ 548. 0 151738. ``` --- # Useful functions for inequality </br> * Useful packages: **laeken**, **ineq** * They include functions for computing different inequality and poverty measures, as well as some basic plotting ```r library(ineq) ``` ``` ## Warning: package 'ineq' was built under R version 3.5.2 ``` ```r summary_data <- data%>% summarize(gini_net = Gini(net_income), gini_gross = Gini(gross_income)) summary_data ``` ``` ## gini_net gini_gross ## 1 0.7099678 0.7291097 ``` --- # Other useful dplyr functions </br> * `distinct()` to select unique rows from a data frame * `top_n()` to filter the top n observations (you can use it in combination with `arrange()`) * `ntile()` ranks values into the number of groups you provide * `transmute()` works just like `mutate`, but drops existing variables (it works as a combination of `select` and `mutate`) * `rename()` to change the name of the variables in your data frame * All functions: https://dplyr.tidyverse.org/reference/index.html