Lecture 11
Cornell University
INFO 2951 - Spring 2025
February 27, 2025
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting data from the source code of websites reproducibly and transforming it into a structured dataset
|>
read_html()
- Read HTML data from a url or character stringhtml_element()
/ html_elements()
- Select a specified element(s) from HTML documenthtml_table()
- Parse an HTML table into a data framehtml_text()
- Extract text from an elementhtml_text2()
- Extract text from an element and lightly format it to match how text looks in the browserhtml_name()
- Extract elements’ nameshtml_attr()
/ html_attrs()
- Extract a single attribute or all attributesae-09
Instructions
ae-09
(repo name will be suffixed with your GitHub name).renv::restore()
to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.When working in a Quarto document, your analysis is re-run each time you render
If web scraping in a Quarto document, you’d be re-scraping the data each time you render, which is undesirable (and not nice)!
An alternative workflow:
Source: Brian Resnick, Researchers just released profile data on 70,000 OkCupid users without permission, Vox.