Proposal

Project
Modified

February 9, 2026

There are two main purposes of the project proposal:

Identify 3 research questions you’re interested in potentially investigating for the final project. They should also be answerable with data that you can feasibly collect or access by the end of the semester. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.

Write the proposal in the proposal.qmd file in your project repo.

Important

You must use one of these research questions in the proposal for the final project, unless instructed otherwise when given feedback.

Criteria for research questions

Your research questions should be original, interesting, and answerable with data that you can feasibly collect or access by the end of the semester. As a proof of concept, you should include some (if not all) of the data you would need to answer the question in the data folder of your project repo, and provide a glimpse of the data using the skimr::skim() function.

As a general rule of thumb, the datasets should meet the following criteria:

  • At least 500 observations
  • At least 8 columns
  • At least 6 of the columns must be useful and unique explanatory variables.
    • Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
  • You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
  • You may not use data from a secondary data archive. In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).

Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be okay; use these numbers as guidance for a successful proposal, not as minimum requirements.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

TipUsing generative AI to find datasets

You can use generative AI tools to help you find datasets, but do not use generative AI to generate datasets for you. If it cannot provide an external link to the original data source, do not use it.

Proposal components

Research question

When formulating a well-written research question, be sure to consider the following:

  • Why is this question important? What is the motivation for answering this question?
  • What is the general research topic? What kinds of hypotheses could you derive?
  • What types of variables are involved? Categorical? Quantitative?
  • What is your target population?
  • Is the question original?
  • Can the question be answered?

Data

For each dataset related to this research question:

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

  • Address ethical concerns about the data, if any.

Glimpse of data

For each dataset:

  • Place the file containing your data in the data folder of the project repo.
  • Use the skimr::skim() function to provide a glimpse of the dataset.
TipDid you install a new package for the proposal?

Make sure to update your {renv} lockfile.

Evaluation criteria

Category Less developed projects Typical projects More developed projects
Ideas Fewer than three research questions are included.

Ideas are vague and/or data is impossible or excessively difficult to collect.
Three distinct research questions are included and all or most datasets could feasibly be collected or accessed by the end of the semester.

Each dataset is described alongside a note about availability with a source cited.
Three distinct research questions are included and all or most datasets could feasibly be collected or accessed by the end of the semester.

Each dataset is described alongside a note about availability with (possibly multiple) sources cited.

Each research question could reasonably be part of a data science project.