AE 01: Visualizing world development indicators using the grammar of graphics and {ggplot2}

Application exercise

Modified

January 28, 2025

Important

Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME to get started.

This AE is due January 28 at 11:59pm.

The World Bank maintains an extensive database of global development data. In this application exercise we will visualize a handful of development indicators.

Getting started

Packages

We need to use two packages for these exercises:

{tidyverse} for the data visualization
{scales} for formatting plot labels

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()   masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Data

The data are stored as a CSV (comma separated values) file in the data folder of your repository. Let’s read it from there and save it as an object called wdi.

The data is stored in a CSV (Comma-Separated Values) file in the data folder of the repository. We’ll import the data and save it as an object called wdi.

wdi <- read_csv("data/wdi.csv")

Get to know the data

We can use the glimpse() function to get an overview (or “glimpse”) of the data.

# add code here

What does each observation (row) in the data set represent?

Each observation represents a ___.

How many observations (rows) are in the data set?

There are 2821 observations in the dataset.

How many variables (columns) are in the data set?

There are ___ columns in the dataset.

Variables of interest

The data contains the following variables:

country - name of the country
iso2c and iso3c - standardized two and three-letter designations respectively for each country
region - the World Bank classifies countries into seven distinct geographic regions
year - the year for which the measures are reported
life_exp - life expectancy at birth, measured in total (years)
pop - total population
gdp - GDP per capita (inflation-adjusted 2015 U.S. dollars)

Visualizing data with {ggplot2}

{ggplot2} is the package and ggplot() is the core function used to create a plot.

ggplot() creates the initial base coordinate system, and we will add layers to that base. We first specify the data set we will use with data = wdi.

ggplot(data = wdi)

The mapping argument defines which variables in the data frame are mapped with specific aesthetics, or the visual channels used to communicate information in the graph.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
)

The geom_*() function specifies the type of plot we want to use to represent the data. In the code below, we use geom_point() which creates a plot where each observation is represented by a point.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point()

Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that this results in a warning as well. What does the warning mean?

Add response here.

GDP vs. life expectancy

Step 1 - Your turn

Modify the following plot to change the color of all points to a different color.

Specifying colors in R

See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for many color options you can use by name in R or use the hex code for a color of your choice.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point(color = "orange")

Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 2 - Your turn

Add labels for the title and \(x\) and \(y\) axes.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point(color = "orange") +
  labs(
    x = "___",
    y = "___",
    title = "___"
  )

Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 3 - Your turn

An aesthetic is a visual property of one of the objects in your plot. Commonly used aesthetic options are:

color
fill
shape
size
alpha (transparency)

Modify the plot below, so the color of the points is based on the variable region.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp, color = region)
) +
  geom_point() +
  labs(
    x = "GDP (in 2015 dollars)",
    y = "Life expectancy",
    title = "GDP vs. life expectancy, by ______"
  )

Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 4 - Your turn

Expand on your plot from the previous step to make the size of your points based on pop.

# add code here

Step 5 - Your turn

Expand on your plot from the previous step to make the transparency (alpha) of the points 0.5.

# add code here

Step 6 - Your turn

Expand on your plot from the previous step by using facet_wrap() to display the association between GDP and life expectancy for each region.

# add code here

Step 7 - Demo

Improve your plot from the previous step by making the \(x\) and \(size\) guides more legible.

Tip

Make use of the {scales} package, specifically the scale_x_continuous() and scale_size_area() functions.

# add code here

Step 8 - Demo

What other improvements could we make to this plot?

Recommended improvements

Remove the legend for region since it duplicates the facet labels
Move the legend to the top of the plot
Add a clear title for the legend
Transform the \(x\) axis to use log-10 scaling (account for skewness in the GDP variable)

# add code here

Life expectancy by region

Step 1 - Your turn

Create side-by-side vertical box plots of life_exp by region.

# add code here

Step 2 - Your turn

Many of the region labels on the \(x\) axis are overlapping and difficult to read. Recreate the plot orienting the box plots horizontally.

# add code here

What does this graph tell you about life expectancy?

Add response here.

Acknowledgments

This assignment is inspired by Bechdel + data visualization