AE 01: Visualizing world development indicators using the grammar of graphics and {ggplot2}

Application exercise
Modified

January 28, 2025

Important

Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME to get started.

This AE is due January 28 at 11:59pm.

The World Bank maintains an extensive database of global development data. In this application exercise we will visualize a handful of development indicators.

Getting started

Packages

We need to use two packages for these exercises:

  • {tidyverse} for the data visualization
  • {scales} for formatting plot labels
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()   masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Data

The data are stored as a CSV (comma separated values) file in the data folder of your repository. Let’s read it from there and save it as an object called wdi.

The data is stored in a CSV (Comma-Separated Values) file in the data folder of the repository. We’ll import the data and save it as an object called wdi.

wdi <- read_csv("data/wdi.csv")

Get to know the data

We can use the glimpse() function to get an overview (or “glimpse”) of the data.

# add code here
  • What does each observation (row) in the data set represent?

Each observation represents a ___.

  • How many observations (rows) are in the data set?

There are 2821 observations in the dataset.

  • How many variables (columns) are in the data set?

There are ___ columns in the dataset.

Variables of interest

The data contains the following variables:

  • country - name of the country
  • iso2c and iso3c - standardized two and three-letter designations respectively for each country
  • region - the World Bank classifies countries into seven distinct geographic regions
  • year - the year for which the measures are reported
  • life_exp - life expectancy at birth, measured in total (years)
  • pop - total population
  • gdp - GDP per capita (inflation-adjusted 2015 U.S. dollars)

Visualizing data with {ggplot2}

{ggplot2} is the package and ggplot() is the core function used to create a plot.

  • ggplot() creates the initial base coordinate system, and we will add layers to that base. We first specify the data set we will use with data = wdi.
ggplot(data = wdi)

  • The mapping argument defines which variables in the data frame are mapped with specific aesthetics, or the visual channels used to communicate information in the graph.
ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
)

  • The geom_*() function specifies the type of plot we want to use to represent the data. In the code below, we use geom_point() which creates a plot where each observation is represented by a point.
ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point()
Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that this results in a warning as well. What does the warning mean?

Add response here.

GDP vs. life expectancy

Step 1 - Your turn

Modify the following plot to change the color of all points to a different color.

Specifying colors in R

See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for many color options you can use by name in R or use the hex code for a color of your choice.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point(color = "orange")
Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 2 - Your turn

Add labels for the title and \(x\) and \(y\) axes.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp)
) +
  geom_point(color = "orange") +
  labs(
    x = "___",
    y = "___",
    title = "___"
  )
Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 3 - Your turn

An aesthetic is a visual property of one of the objects in your plot. Commonly used aesthetic options are:

  • color
  • fill
  • shape
  • size
  • alpha (transparency)

Modify the plot below, so the color of the points is based on the variable region.

ggplot(
  data = wdi,
  mapping = aes(x = gdp, y = life_exp, color = region)
) +
  geom_point() +
  labs(
    x = "GDP (in 2015 dollars)",
    y = "Life expectancy",
    title = "GDP vs. life expectancy, by ______"
  )
Warning: Removed 639 rows containing missing values or values outside the scale range
(`geom_point()`).

Step 4 - Your turn

Expand on your plot from the previous step to make the size of your points based on pop.

# add code here

Step 5 - Your turn

Expand on your plot from the previous step to make the transparency (alpha) of the points 0.5.

# add code here

Step 6 - Your turn

Expand on your plot from the previous step by using facet_wrap() to display the association between GDP and life expectancy for each region.

# add code here

Step 7 - Demo

Improve your plot from the previous step by making the \(x\) and \(size\) guides more legible.

Tip

Make use of the {scales} package, specifically the scale_x_continuous() and scale_size_area() functions.

# add code here

Step 8 - Demo

What other improvements could we make to this plot?

Recommended improvements
  • Remove the legend for region since it duplicates the facet labels
  • Move the legend to the top of the plot
  • Add a clear title for the legend
  • Transform the \(x\) axis to use log-10 scaling (account for skewness in the GDP variable)
# add code here

Life expectancy by region

Step 1 - Your turn

Create side-by-side vertical box plots of life_exp by region.

# add code here

Step 2 - Your turn

Many of the region labels on the \(x\) axis are overlapping and difficult to read. Recreate the plot orienting the box plots horizontally.

# add code here

What does this graph tell you about life expectancy?

Add response here.

Acknowledgments