Exploratory data analysis

Project
Modified

January 22, 2025

Settle on a single idea and state your research question(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your research question(s).

Write the data collection and exploratory data analysis in the eda.qmd file in your project repo. It should include the following sections:

1 If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate .qmd file or .R script in the repo.

Warning

Thorough EDA requires substantial review and analysis of your data. You should not expect to complete this phase in a single day. You should expect to iterate through 20-30 charts, sets of summary statistics, etc., to get a good understanding of your data.

Visualizations are not expected to look perfect at this point since they are mainly intended for you and your team members. Standard expectations for visualizations (e.g. clearly labeled charts and axes, optimized color palettes) are not necessary at the EDA stage.

This document provides an example of what EDA might look like for your project.

Evaluation criteria

Category Less developed projects Typical projects More developed projects
Research question(s) Question is not clearly stated or significantly limits potential analysis. Clearly states the research question(s), which have moderate potential for interesting analyses. Clearly states complex research question(s) that leads to significant potential for interesting analyses.
Data cleaning Data is minimally cleaned, with little documentation and description of the steps undertaken.

Completes all necessary data cleaning for subsequent analyses.

Describes cleaning steps with some detail.

Completes all necessary data cleaning for subsequent analyses.

Describes all cleaning steps in full detail, so that the reader has an excellent grasp of how the raw data was transformed into the analysis-ready dataset.

Data description

Simple description of some aspects of the dataset, little consideration for sources.

The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper.

Answers all relevant questions in the “Datasheets for Datasets” paper. All expectations of typical projects + credits and values data sources.
Data limitations

The limitations are not explained in depth.

There is no mention of how these limitations may affect the meaning of results.

Identifies potential harms and data gaps, and describes how these could affect the meaning of results. Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, and the impact of results on people. It is evident that significant thought has been put into the limitations of the collected data.
Exploratory data analysis

Motivation for choice of analysis methods is unclear.

Does not justify decisions to either confirm / update research questions and data description.

Sufficient plots (20-30) and summary statistics to identify typical values in single variables and connections between pairs of variables.

Uses exploratory analysis to confirm/update research questions and data description.

All expectations of typical projects + analysis methods are carefully chosen to identify important characteristics of data.