Overview
Important dates
- Proposal due Wed, Mar 12th
- Data collection and exploratory data analysis due Wed, Mar 26th
- Preregistration of analyses due Wed, Apr 16th
- Draft report due Wed, Apr 23rd
- Peer review due Fri, Apr 25th
- Presentation + slides due on Fri, May 2nd
- Final report + reproducibility due on Tue, May 6th
The details will be updated as the project date approaches.
Introduction
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
This project is designed to give you experience with the full cycle of data science, from collecting observations to modeling to making arguments. Alumni often tell us that the final project was the most useful and memorable part of this class.
It should be something you are proud of and can display as part of a portfolio for job applications. The idea is that you take what you’ve learned through the course (via application exercises, labs, homeworks, exams) and apply it to a specific domain of interest to you. To do well on the final project, simply show us what you’ve learned!
A high-level overview
Pick a topic you find interesting, where there may be some quantitative data to analyze. Think hobbies, past classes you’ve found interesting, something you really care about, etc. It will be a lot easier to dedicate time to the project if you find the topic fundamentally interesting.
To get started, review the curriculum we’ve covered over the semester; it has been designed to give you a sense of a common data science workflow. Apply that workflow to your project:
- Collect data. Find (good) data (which will depend on the research questions you end up formulating). This may take some time, and you may not find exactly the data you want in one file or in one sitting. Your interest may also shift, the more you search and realize what kind of data is available within the topic. Be willing to keep looking for (additional) data and iterating on your topic!
- Explore your data. Start with the most basic types of analyses (summary statistics, histograms, scatterplots, etc.) to get a sense of the data. What could the data tell you? What kind of questions would it fail to answer?
- Write down a few concrete research questions and hypotheses. What have you noticed in exploring your data? What do you know about the processes that generated the data? Discuss these ideas with your collaborators (if working with others), with friends in the course, and with course staff. Do you need to gather any additional data? Slightly different data?
- Select your tools. Which analyses would be most appropriate for the type of data you’ve gathered and the questions you’d like to ask? What types of investigations has this class prepared you to conduct?
- Build models, analyze them, and test your hypotheses. Don’t throw out analyses that fail to show significance; these can still teach you something about the context from which your data came, or the data itself, if interpreted properly.
- Interpret your results. This is so important. Your results don’t matter unless you can make people understand why they matter. What do these results mean for the “real world” beyond your dataset? Were you limited in your conclusions because of issues with the data? What were the issues and how did they specifically impact your analyses?
Rinse and repeat. Doesn’t that look like a nice sequence of tasks? Unfortunately, it never works exactly that way! You will constantly go back and forth between steps. Instead of imagining the steps as 1 ➡️ 2 ➡️ … ➡️ 5, think of them as 1 ↔︎️ 2 ↔︎️ … ↔︎️ 5. Keep trying things until you feel like you’ve put together an interesting story for your final report.
Prepare a focused final report. Pick the clearest and most interesting analyses and contextualize them in your final report. Don’t just present numbers and plots: explain to the reader what they mean. Answer the question “so what?”. Put additional analyses tangential to the final direction of your project in (optional) appendices.
The project is very open ended. There is no limit on what tools or packages you may use but sticking to packages we learned in class is recommended. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible.
You will work on the project with your lab teams.
The three primary deliverables for the final project are
- A project proposal with three dataset ideas.
- A reproducible project report of your analysis, with one required draft along the way.
- A presentation with slides.
There will be additional submissions throughout the semester to facilitate completion of the final report and presentation.
Teams
Projects will be completed in teams of 3-5 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.
Some lab section meetings will be devoted to work on the project, so all teams will be formed within each lab section (i.e. only students in your lab section can be your team members). The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.
Team conflicts
Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.
When you have conflict, you should follow this procedure:
Refer to the team contract and follow it to address the conflict.
If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.
If your team is unable to resolve your conflict, please contact info2951@cornell.edu and explain your situation.
We’ll ask to meet with all the group members and figure out how we can work together to move forward.
Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.
Project grade adjustments
Remember, do not do the work for a slacking team member. This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Your team will initially receive a final grade assuming that all team members contributed to your project. If you have a 5-person team, but only 3 persons contributed, your team will likely receive a lower grade initially because only 3 persons worth of effort exists for a 5-person project. About a week after the initial project grades are released, adjustments will be made to each individual team member’s group project grade.
We use your project’s Git history (to view the contributions of each team member) and the peer evaluations to adjust each team members’ grades. Both adjustments to increase or decrease your grade are possible based on each individual’s contributions.
For example, if you have a 4-person team, but only 3 contributing members, the 3 contributing members may have their grades increased to reflect the effort of only 3 contributing members. The non-contributing member will likely have their grade decreased significantly.
I am serious about every member of the team equitably contributing to the project. Students who fail to contribute equitably may receive up to a 100% deduction on their project grade.
Please be patient for the grade adjustments. The adjustments take time to do them fairly. Please know that the instructor handles this entire process himself, and I take it very seriously. If you think your initial group project grade is unfair, please wait for your grade adjustment before you contact us.
The slacking team member
Please do not cover for a slacking/freeloading team member. Please do not do their work for them! This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Remember, we have your Git history. We can see who contributes to the project and who doesn’t. If a team member rarely commits to Git and only makes very small commits, we can see that they did not contribute their fair share.
All students should make their project contributions through their own GitHub account. Do not commit changes to the repository from another team member’s GitHub account. Your Git history should reflect your individual contributions to the project.
Overall grading
The grade breakdown is as follows:
Total | 155 pts |
---|---|
Project proposal | 10 pts |
Data collection and exploratory data analysis | 15 pts |
Preregistration of analyses | 5 pts |
Draft report | 10 pts |
Peer review | 5 pts |
Final report | 70 pts |
Slides + presentation | 15 pts |
Slides + presentation (peer) | 5 pts |
Reproducibility + organization | 20 pts |
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.