Exploratory data analysis
Settle on a single idea and state your research question(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your research question(s).
Write the data collection and exploratory data analysis in the eda.qmd
file in your project repo. It should include the following sections:
- Research question(s). State your research question(s) clearly.
- Data collection and cleaning. Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.1
- Data description. Have an initial draft of your data description section. Your data description should be about your analysis-ready data.
- Data limitations. Identify any potential problems with your dataset.
- Exploratory data analysis. Perform an (initial) exploratory data analysis. This should include summary statistics, visualizations, and any other relevant information about the data. This is where you start to understand your data and identify any initial trends or relationships. You should also identify any potential problems with your dataset.
1 If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate .qmd
file or .R
script in the repo.
Thorough EDA requires substantial review and analysis of your data. You should not expect to complete this phase in a single day. You should expect to iterate through 20-30 charts, sets of summary statistics, etc., to get a good understanding of your data.
Visualizations are not expected to look perfect at this point since they are mainly intended for you and your team members. Standard expectations for visualizations (e.g. clearly labeled charts and axes, optimized color palettes) are not necessary at the EDA stage.
This document provides an example of what EDA might look like for your project.
- Questions for reviewers. List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.
Evaluation criteria
Category | Less developed projects | Typical projects | More developed projects |
Research question(s) | Question is not clearly stated or significantly limits potential analysis. | Clearly states the research question(s), which have moderate potential for interesting analyses. | Clearly states complex research question(s) that leads to significant potential for interesting analyses. |
Data cleaning | Data is minimally cleaned, with little documentation and description of the steps undertaken. | Completes all necessary data cleaning for subsequent analyses. Describes cleaning steps with some detail. |
Completes all necessary data cleaning for subsequent analyses. Describes all cleaning steps in full detail, so that the reader has an excellent grasp of how the raw data was transformed into the analysis-ready dataset. |
Data description | Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper. |
Answers all relevant questions in the “Datasheets for Datasets” paper. | All expectations of typical projects + credits and values data sources. |
Data limitations | The limitations are not explained in depth. There is no mention of how these limitations may affect the meaning of results. |
Identifies potential harms and data gaps, and describes how these could affect the meaning of results. | Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, and the impact of results on people. It is evident that significant thought has been put into the limitations of the collected data. |
Exploratory data analysis | Motivation for choice of analysis methods is unclear. Does not justify decisions to either confirm / update research questions and data description. |
Sufficient plots (20-30) and summary statistics to identify typical values in single variables and connections between pairs of variables. Uses exploratory analysis to confirm/update research questions and data description. |
All expectations of typical projects + analysis methods are carefully chosen to identify important characteristics of data. |