Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 primary or originally-collected datasets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.
Write the proposal in the proposal.qmd
file in your project repo.
You must use one of the datasets in the proposal for the final project, unless instructed otherwise when given feedback.
Criteria for datasets
The datasets should meet the following criteria:
- At least 500 observations
- At least 8 columns
- At least 6 of the columns must be useful and unique explanatory variables.
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
- You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
- You may not use data from a secondary data archive. In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).
Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.
If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be okay; use these numbers as guidance for a successful proposal, not as minimum requirements.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- CDC
- Chicago Open Data Portal
- Data.gov
- Data is Plural
- Election Studies
- European Statistics
- FiveThirtyEight
- General Social Survey
- Goodreads
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- National Weather Service
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- Project Gutenberg
- Reddit posts and/or comments
- Sports Reference
- Statistics Canada
- The National Bureau of Economic Research
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
Introduction and data
For each dataset:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Research question
Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each dataset, include the following:
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per dataset is required.)
- Statement on why this question is important.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Glimpse of data
For each dataset:
- Place the file containing your data in the
data
folder of the project repo. - Use the
skimr::skim()
function to provide a glimpse of the dataset.
Make sure to update your {renv} lockfile.
Evaluation criteria
Category | Less developed projects | Typical projects | More developed projects |
---|---|---|---|
Dataset ideas | Fewer than three dataset ideas are included. Dataset ideas are vague and impossible or excessively difficult to collect. |
Three datasets ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with a source cited. |
Three datasets ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with (possibly multiple) sources cited. Each dataset could reasonably be part of a data science project, driven by an interesting research question. |