AE 08: Scraping articles from the Cornell Review
Application exercise
We will use the following packages in this application exercise.
- {tidyverse}: For data import, wrangling, and visualization.
- {rvest}: For scraping HTML files.
- {robotstxt}: For verifying if we can scrape a website.
Data scraping
This will be done in the scrape-cornell-review.R
R script. Save the resulting data frame in the data folder.
# load packages
# check that we can scrape data from the cornell review
# read the first page
<- read_html("")
page # page <- read_html("data/cornell-review-raw.html") # use this if we break the website
# extract desired components
<- html_elements(x = page, css = "______") |>
titles html_text2()
<- html_elements(x = page, css = "______") |>
authors html_text2()
<- html_elements(x = page, css = "______") |>
article_dates html_text2()
<- html_elements(x = page, css = "______") |>
topics html_text2()
<- html_elements(x = page, css = "______") |>
abstracts html_text2()
<- html_elements(x = page, css = "______") |>
post_urls html_______(______)
# create a tibble with this data
## add code here
# clean up the data
## add code here
# save to disk
write_csv(x = review, file = "data/cornell-review.csv")