Data management with R

class: center, middle, inverse, title-slide

# Data management with R
### Ariane Aumaitre
### European University Institute
### 14/05/2020

---

background-image: url(https://i.ibb.co/kyMCbQ7/r-first-then.png)
background-size: 1000px
background-position: 50% 50%
class: bottom

##### Illustration: Allison Horst
---
background-image: url(https://i.imgur.com/bmJmSXI.png)
background-size: 800px
background-position: 50% 75%

# Goals of this workshop
&nbsp;
&nbsp;
* Giving you a comprehensive set of tools to **manage your data**

* Focus on how to **get your data ready** for analysis 💻

* We will cover **importing**, **tidying**, **transforming** and some **visualization**

---

# Getting started: libraries and data

First, libraries:

```r
library(tidyverse)
```

Loading a csv file:

```r
data <- read.csv("workshop_data.csv")

head(data)
```

```
##   person_id household_id year_born sex employed retired education
## 1       101            1      1949   1        0       1       300
## 2       102            1      1951   2        0       0       300
## 3       301            3      1965   2        1       0       500
## 4       302            3      1993   2        1       0       500
## 5       303            3      1996   2        0       0       344
## 6       401            4      1958   2        0       0       500
##   net_income gross_income
## 1        0.0         0.00
## 2        0.0         0.00
## 3    18470.4     22095.67
## 4     6240.0      7042.30
## 5     2210.0      2396.34
## 6     3120.0      3362.77
```

---

#Importing data: packages
&nbsp;
&nbsp;
&nbsp;

---
class: inverse, center, middle
# 1. Getting to know your data
---
# Quick revision of data types 📋
&nbsp;
&nbsp;
&nbsp;

* In **R**, uni-dimendional data are called **vectors**. They can be **numeric, integers, character, factors or logical**.

```r
character_vector <- c("a", "b")
numeric_vector <- c(1,2)
```

* Useful functions to convert vectors: `as.numeric(data$variable)`, `as.character()`,...

* Most times we will be working with many vectors at the same time, brought together into a **data frame** or a **tibble**

```r
my_df <- data.frame(character_vector, numeric_vector)
```
* Not everything in R can be stored in a data frame - sometimes you will be working with **lists**

---
# Getting to know your data🔎️
&nbsp;
&nbsp;  
`head()` shows the first 6 rows of your data.

To see more (or less), you can do `head(data, n)`

```r
head(data)
```

```r
str(data)
```

```
## 'data.frame':	33734 obs. of  9 variables:
##  $ person_id   : int  101 102 301 302 303 401 402 403 501 601 ...
##  $ household_id: int  1 1 3 3 3 4 4 4 5 6 ...
##  $ year_born   : int  1949 1951 1965 1993 1996 1958 1991 1997 1938 1941 ...
##  $ sex         : int  1 2 2 2 2 2 2 1 2 2 ...
##  $ employed    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ retired     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ education   : int  300 300 500 500 344 500 500 344 0 100 ...
##  $ net_income  : num  0 0 18470 6240 2210 ...
##  $ gross_income: num  0 0 22096 7042 2396 ...
```
---
# Getting to know your data🔎
&nbsp;
&nbsp;

`summary()` gives you summary statistics of your data. 
You can apply it  to the whole data frame or **to a single variable**

```r
summary(data$net_income)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0      92    7722   13571  248676    5362
```

```r
summary(data$gross_income)
```

```
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0     99.7   9744.9  15758.4 398312.6     5362
```
---
class: inverse, center, middle
# 2. Data cleaning and transformation
---
background-image: url(https://www.khstats.com/img/dplyr_image.jpg)
background-size: 400px
background-position: 50% 90%
# Let's start cleaning 🧹
&nbsp;
&nbsp;

* You will learn how to **filter**, **count** and **arrange** observations, **select** columns, **transform** variables, **create** new columns...

* We will also learn how to **group** dataframes to **summarize** information.

* Our main tool: the **dplyr** package

---
# Filtering data
</br>
* The `filter()` function 'returns rows with matching conditions'

```r
filtered_data <- filter(data, net_income >0)

head(filtered_data)
```

```
##   person_id household_id year_born sex employed retired education
## 1       301            3      1965   2        1       0       500
## 2       302            3      1993   2        1       0       500
## 3       303            3      1996   2        0       0       344
## 4       401            4      1958   2        0       0       500
## 5       402            4      1991   2        1       0       500
## 6       602            6      1971   2        1       0       500
##   net_income gross_income
## 1    18470.4     22095.67
## 2     6240.0      7042.30
## 3     2210.0      2396.34
## 4     3120.0      3362.77
## 5     5304.0      5751.23
## 6    26000.0     33732.85
```

---
# How to use filter()

</br>

* Comparison operators: `==, >, <, >=, <=, !=`

```r
filter(data, net_income > 0, sex == 2)
```
* The `%in%` operator

```r
filter(data, year_born %in% 1950:1960) #you can use any kind of vector after the in

#use ! to state "not in"
filter(data, !education %in% 200:450)
```

* Filtering NAs out with `is.na()` (be careful ⚠️)

```r
filter(data, !is.na(education))
```

---
background-image: url(https://i.imgur.com/8oZIMsr.png)
background-size: 800px
background-position: 50% 80%
  
# How to use filter()
  </br>

* Logical operators: `&` and `|` (or)

```r
filter(mydata, year_born > 1999 | year_born < 1952)
```
---

# Some syntax: the pipe %>%

Compare these two chunks *that produce the same output*

.pull_left[

```r
filtered_data <- filter(data, net_income >0)
```
]
.pull_right[

```r
filtered_data <- data%>%
filter(net_income>0)
```
]

* The **pipe operator** can be roughly ranslated as **'then'**: _take my data, do this operation, then take the output and perform another one_

* A way of combining **multiple operations** in your code

* It helps writing shorter, more efficient code and **avoid repetitions**

* Code becomes readable from left-to-right, top-to-bottom

---
background-image: url(https://i.imgur.com/qG4Ad2E.png)
background-size: 400px
background-position: 50% 20%

#An illustration
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

.center[![](https://i.imgur.com/itdDc7Q.gif)]

---
# The select() function

</br>
Use `select()` to keep only the variables you're interested in

```r
selected_data <- data%>%
                    select(sex, year_born, net_income)

head(selected_data)
```

```
##   sex year_born net_income
## 1   1      1949        0.0
## 2   2      1951        0.0
## 3   2      1965    18470.4
## 4   2      1993     6240.0
## 5   2      1996     2210.0
## 6   2      1958     3120.0
```

---

# Using select()
</br>

* Select by **column name**:

```r
data%>%
  select(a,b,c)
```

* Select by column **position**:

```r
data%>%
  select(1:5, 8)
```

* Select all columns **except one**:

```r
data%>%
  select(-1)
```

* Negative selection of a **group of columns**:

```r
data%>%
  select(-c(a,b,c))
```

---
# Chaining operations and arrange()
</br>

`arrange()` can be used to order observations by a variable, in ascending or descending order

```r
arranged_data <- data%>%
  filter(sex ==2)%>%
  select(year_born, gross_income)%>%
  arrange(desc(gross_income))

head(arranged_data)
```

```
##   year_born gross_income
## 1      1954     259113.2
## 2      1951     251850.7
## 3      1963     229012.6
## 4      1969     162385.6
## 5      1954     156385.6
## 6      1961     134490.9
```

---

# Transforming data with mutate()

</br>

`mutate()` can be used for creating new variables, while preserving existing ones

```r
new_data <- data%>%
  mutate(taxes_and_transfers = gross_income - net_income)

summary(new_data$taxes_and_transfers)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0    2023    2112  162088    5362
```

```r
head(new_data)
```

```
##   person_id household_id year_born sex employed retired education
## 1       101            1      1949   1        0       1       300
## 2       102            1      1951   2        0       0       300
## 3       301            3      1965   2        1       0       500
## 4       302            3      1993   2        1       0       500
## 5       303            3      1996   2        0       0       344
## 6       401            4      1958   2        0       0       500
##   net_income gross_income taxes_and_transfers
## 1        0.0         0.00                0.00
## 2        0.0         0.00                0.00
## 3    18470.4     22095.67             3625.27
## 4     6240.0      7042.30              802.30
## 5     2210.0      2396.34              186.34
## 6     3120.0      3362.77              242.77
```
---
# Useful functions within mutate()
</br>

* `ifelse()` to return a column filled with one of two values depending on a condition

```r
data%>%
  mutate(high_education = ifelse(education == 500, 1, 0))
```

* `case_when()` is the easiest way to expand ifelse:

```r
data%>%
  mutate(edu_clean = case_when(education %in% 0:100 ~ "Primary or below",
                                   education %in% 200:354 ~ "Lower secondary",
                                   education %in% 400:450 ~ "Upper secondary",
                                   education == 500 ~ "Higher education"))
```

* `replace()` to change values in a column for a replacement

* `cumsum()`

---
background-image: url(https://pbs.twimg.com/media/EFASUtGUcAANGYr.jpg)
background-size: 400px
background-position: 50% 90%
</br>
# Packages you may find useful

* **stringr** for working with text data and character vectors - functions for string cleaning, trimming and working with patterns

* **forcats** for working with categorical data: recoding factors, collapsing values...

* **lubridate** makes working with dates easier

---
# Working with groups
</br>
* If you want to work with groups within your data, you need to use the `group_by()` function

* `group_by()` does not produce any output, but whichever operation comes afterwards will be performed *by group*

* After group by, you can either use `mutate()` if you want to keep all your observations or `summarize()` if you are only interested in the group summary

* If you want to keep working with the dataframe as a whole after your group operations, you should always remember to `ungroup()`

---
# summarize()
</br>
* After grouping by your variable(s) of interest, you can use `summarize()` with any function you want to create summary variables

```r
summary_data <- data%>%
  mutate(age = 2018 - year_born,
         age_group = case_when(age %in% 0:18 ~ "0-18",
                      age %in% 19:35 ~ "19-35",
                      age %in% 36:65 ~ "36-65",
                      age > 65 ~ "65+"))%>%
  group_by(age_group)%>%
  summarize(mean_income = mean(net_income, na.rm = TRUE),
            min_income = min(net_income, na.rm = TRUE),
            max_income = max(net_income, na.rm = TRUE))

head(summary_data)
```

```
## # A tibble: 4 x 4
##   age_group mean_income min_income max_income
##   <chr>           <dbl>      <dbl>      <dbl>
## 1 0-18             222.          0     27213.
## 2 19-35           6350.          0     57258.
## 3 36-65          11906.          0    248676.
## 4 65+              548.          0    151738.
```
---
# Useful functions for inequality

</br> 
* Useful packages: **laeken**, **ineq**

* They include functions for computing different inequality and poverty measures, as well as some basic plotting

```r
library(ineq)
```

```
## Warning: package 'ineq' was built under R version 3.5.2
```

```r
summary_data <- data%>%
  summarize(gini_net = Gini(net_income),
            gini_gross = Gini(gross_income))

summary_data
```

```
##    gini_net gini_gross
## 1 0.7099678  0.7291097
```

---

# Other useful dplyr functions
</br>
* `distinct()` to select unique rows from a data frame

* `top_n()` to filter the top n observations (you can use it in combination with `arrange()`)

* `ntile()` ranks values into the number of groups you provide

* `transmute()` works just like `mutate`, but drops existing variables (it works as a combination of `select` and `mutate`)

* `rename()` to change the name of the variables in your data frame

* All functions: https://dplyr.tidyverse.org/reference/index.html