Chapter 8 Life expectancy and GDP from graphs to animation
Install required packages
#install.packages(c("gganimate", "gifski", "av"))
Load packages
Load Life expectancy data
Make a line graph of life expectancy for South Africa
This comes in wide-format data
## # A tibble: 1 × 302
## country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South A… 33.5 33.5 33.5 33.5 33.5 33.5 33.5 33.5 33.5 33.5
## # ℹ 291 more variables: `1810` <dbl>, `1811` <dbl>, `1812` <dbl>, `1813` <dbl>,
## # `1814` <dbl>, `1815` <dbl>, `1816` <dbl>, `1817` <dbl>, `1818` <dbl>,
## # `1819` <dbl>, `1820` <dbl>, `1821` <dbl>, `1822` <dbl>, `1823` <dbl>,
## # `1824` <dbl>, `1825` <dbl>, `1826` <dbl>, `1827` <dbl>, `1828` <dbl>,
## # `1829` <dbl>, `1830` <dbl>, `1831` <dbl>, `1832` <dbl>, `1833` <dbl>,
## # `1834` <dbl>, `1835` <dbl>, `1836` <dbl>, `1837` <dbl>, `1838` <dbl>,
## # `1839` <dbl>, `1840` <dbl>, `1841` <dbl>, `1842` <dbl>, `1843` <dbl>, …
life_expectancy_sa <- life_expectancy |>
filter(country == "South Africa") |>
names_to = "Year",
values_to = "LifeExpectancy") |>
mutate(Year_Numeric = as.numeric(Year))
## # A tibble: 301 × 4
## country Year LifeExpectancy Year_Numeric
## <chr> <chr> <dbl> <dbl>
## 1 South Africa 1800 33.5 1800
## 2 South Africa 1801 33.5 1801
## 3 South Africa 1802 33.5 1802
## 4 South Africa 1803 33.5 1803
## 5 South Africa 1804 33.5 1804
## 6 South Africa 1805 33.5 1805
## 7 South Africa 1806 33.5 1806
## 8 South Africa 1807 33.5 1807
## 9 South Africa 1808 33.5 1808
## 10 South Africa 1809 33.5 1809
## # ℹ 291 more rows
graph the data
8.1 Make the same graph but for three random countries
## # A tibble: 3 × 1
## country
## <chr>
## 1 Nauru
## 2 Djibouti
## 3 South Sudan
life_expectancy_random3 <- life_expectancy |>
filter(country %in% random3$country) |>
names_to = "Year",
values_to = "LifeExpectancy") |>
mutate(Year_Numeric = as.numeric(Year))
## # A tibble: 903 × 4
## country Year LifeExpectancy Year_Numeric
## <chr> <chr> <dbl> <dbl>
## 1 Djibouti 1800 29.9 1800
## 2 Djibouti 1801 29.9 1801
## 3 Djibouti 1802 29.9 1802
## 4 Djibouti 1803 29.9 1803
## 5 Djibouti 1804 29.9 1804
## 6 Djibouti 1805 29.9 1805
## 7 Djibouti 1806 29.9 1806
## 8 Djibouti 1807 29.9 1807
## 9 Djibouti 1808 29.9 1808
## 10 Djibouti 1809 29.9 1809
## # ℹ 893 more rows
life_expectancy_random3 |>
ggplot() +
geom_line(aes(x = Year_Numeric, y = LifeExpectancy, color=country))
## Warning: Removed 150 rows containing missing values or values outside the
## scale range (`geom_line()`).
Filter for life expectancy of zero. Did we ever complete this inclass?
life_expectancy <- life_expectancy |>
filter(country %in% random3$country) |>
names_to = "Year",
values_to = "LifeExpectancy") |>
mutate(Year_Numeric = as.numeric(Year))
## Rows: 195 Columns: 302
## ── Column specification ─────────────────────────────────────────────
## Delimiter: ","
## chr (199): country, 1901, 1903, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 19...
## dbl (103): 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810,...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Look out, there are k
’s in the data for 1,000! We should replace k
with 000
and .
with nothing.
Check out stringr cheat sheet! Test case for 10k
## [1] NA
## [1] 10000
Try out the conversion
testconversion <- gdp |>
mutate(`1800` = case_when(
str_detect(`1800`, "k") ~ as.numeric(str_remove(`1800`, "k")) * 1000,
!str_detect(`1800`, "k") ~ as.numeric(`1800`),
str_detect(`1901`, "k") ~ as.numeric(str_remove(`1901`, "k")) * 1000,
!str_detect(`1901`, "k") ~ as.numeric(`1901`)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `1800 = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
Ok, what were those NAs about, and can we find a better way to do this compared to copying and pasting code.
Apply a function to many columns simultaneously:
gdp_numeric <- gdp |>
select(-country) |>
str_detect(.x, "k") ~ as.numeric(str_remove(.x, "k")) * 1000,
!str_detect(.x, "k") ~ as.numeric(.x)
)) |>
cbind(country = gdp$country)
## Warning: There were 198 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `1901 = (structure(function (..., .x = ..1, .y = ..2,
## . = ..1) ...`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 197 remaining
## warnings.
Ok, now pivot
gdp_long <- gdp_numeric |>
names_to = "Year",
values_to = "GDP_Per_Capita") |>
mutate(Year = as.numeric(Year))
And graph
8.1.1 Join the life expectancy data together with income data
needs to be combined with life_expectancy
Save the long form of life expectancy
life_expectancy_long <- life_expectancy |>
names_to = "Year",
values_to = "LifeExpectancy") |>
mutate(Year = as.numeric(Year))
Now join them!
Graph a specific country and year
life_gdp |>
filter(Year == 2000) |>
ggplot() +
geom_point(aes(x = GDP_Per_Capita, y = LifeExpectancy)) +
## Warning: Removed 1 row containing missing values or values outside the scale
## range (`geom_point()`).
Load joined data
## Rows: 58695 Columns: 5
## ── Column specification ─────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (4): Year, GDP_Per_Capita, LifeExpectancy, Population
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
8.2 Size points by population
pop_gdp_le |>
filter(Year == 2000) |>
ggplot() +
geom_point(aes(x = GDP_Per_Capita,
y = LifeExpectancy,
size = Population)
) +
8.3 Web scrape region data
so that we can color the points by region.
Search “country by region plain text”
Save the table to cSV file
Load region table from CSV file
## # A tibble: 249 × 7
## No `Country or Area` `ISO-alpha3 Code` `M49 Code` `Region 1` `Region 2`
## <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1 Afghanistan AFG 4 Southern A… <NA>
## 2 2 Åland Islands ALA 248 Northern E… <NA>
## 3 3 Albania ALB 8 Southern E… <NA>
## 4 4 Algeria DZA 12 Northern A… <NA>
## 5 5 American Samoa ASM 16 Polynesia <NA>
## 6 6 Andorra AND 20 Southern E… <NA>
## 7 7 Angola AGO 24 Middle Afr… Sub-Sahar…
## 8 8 Anguilla AIA 660 Caribbean Latin Ame…
## 9 9 Antarctica ATA 10 Antarctica <NA>
## 10 10 Antigua and Barbuda ATG 28 Caribbean Latin Ame…
## # ℹ 239 more rows
## # ℹ 1 more variable: Continent <chr>
all_country_data <- pop_gdp_le |>
region_data |> select(`Country or Area`, Continent), by = c("country" = "Country or Area"))
You can disambiguate functions using ::
, e.g. dplyr::select
specify you want the select
function specifically from the dplyr
8.4 Life expectancy, GDP and Region
all_country_data |>
filter(Year == 2000) |>
ggplot() +
x = GDP_Per_Capita,
y = LifeExpectancy,
size = Population,
color = Continent
) +
There were some missing continent data!
Let’s inspect data with missing continent data.
## # A tibble: 29 × 6
## country Year GDP_Per_Capita LifeExpectancy Population Continent
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 UAE 2000 92400 69.1 3280000 <NA>
## 2 Bolivia 2000 5390 66.5 8590000 <NA>
## 3 Brunei 2000 74300 72.5 334000 <NA>
## 4 Cote d'Ivoire 2000 4000 51.1 16800000 <NA>
## 5 Congo, Dem. Rep. 2000 707 53.9 48600000 <NA>
## 6 Congo, Rep. 2000 4580 53 3130000 <NA>
## 7 Cape Verde 2000 4370 70.3 458000 <NA>
## 8 Czech Republic 2000 24900 75.2 10200000 <NA>
## 9 Micronesia, Fed. St… 2000 3450 62.6 112000 <NA>
## 10 UK 2000 39800 78 58900000 <NA>
## # ℹ 19 more rows
We are missing a lot of countries because of inconsistent spelling. So, Why not use ISO codes!?!