Proposals due Friday, July 9 at 11:59pm

Report, slides, and repository due Tuesday, July 20 at 11:59pm

Presentations will occur on Wednesday, July 21 and Thursday, July 22 during class

Introduction

TLDR: Pick (or create) a dataset and do something with it. That is your final project.

The final project for this class will consist of analysis on a dataset of your own choosing or creation. The dataset may already exist, or you may collect your own data using a survey, by conducting an experiment, or by scraping the web. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a dataset in a meaningful way.

Brief project logistics

The four deliverables for the final project are

  • A written proposal detailing the data you chose, along with some EDA and at least two questions you are interested in answering
  • A written, reproducible report detailing your analysis
  • A GitHub repository corresponding to your report
  • A brief set of slides that correspond to your intended oral presentation (ten minutes). No late projects are accepted.

The grade breakdown is as follows:

Total 125 pts
Proposal 15 pts
Written report 60 pts
Slides 15 pts
Repository 5 pts
Presentation 15 pts
Participation 10 pts
Overall neatness and presentation style 5 pts

Data sources

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and at
least 10 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include categorical variables, discrete numerical variables, and continuous numerical variables.

All analyses must be done in RStudio, and your final written report and analysis must be reproducible. This means that you must create an R Markdown document attached to a GitHub repository that will create your written report exactly upon knitting.

If you are using a dataset that comes in a format that we haven’t encountered in class (for instance, a .DAT file), make sure that you are able to load it into RStudio as this can be tricky depending on the source. If you are having trouble, ask for help before it is too late.

Reusing datasets from class: Do not reuse datasets used in examples / homework in the class.

Some resources that may be helpful:

Additions:

Project components

Proposal

Your proposal must be done using R Markdown. You should describe the dataset that you would like to use, and define the variables in the dataset that you intend to explore. You must include some EDA (ex. univariate or bivariate plots, tables of summary statistics, etc), and you must also list at least two questions that you are interested in answering using the data.

There is no page limit or requirement. For submission, submit the .pdf document to Gradescope by Friday, July 9 at 11:59pm. The main purpose of this component of the project is to help you get started, and so Becky can give feedback/suggestions about the data and questions of interest.

Written report

Your written report must be done using R Markdown. You must contribute to the GitHub repository with regular meaningful commits/pushes. Before you finalize your write up, make sure the printing of code chunks is turned off with the option echo = FALSE.

Your final report must match your GitHub repository exactly. The mandatory components of the report are as follows, but feel free to expand with additional sections as necessary. There is no page limit or requirement – however, you must comprehensively address all aspects below. Please be judicious in what you decide to include in your final write-up. For submission, submit the .pdf document to Gradescope.

The written report is worth 60 points, broken down as

Total 60 pts
Introduction/data 10 pts
Methodology 20 pts
Results 20 pts
Discussion 10 pts

Introduction and data

The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).

Methodology

The methodology section should include the variables used to address your research question, as well as any useful visualizations or summary statistics. As well, you should introduce and justify the statistical method(s) that you believe will be useful in answering your research question.

Results

Showcase how you arrived at answers to your question using any techniques we have learned in this class (and some beyond, if you’re feeling adventurous). Provide the main results from your analysis. The goal is not to do an exhaustive data analysis (i.e., do not calculate every statistic and procedure you have learned for every variable), but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions.

Discussion

This section is a conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project should also be included.

Repository

In addition to your Gradescope submissions, we will be checking your GitHub repository. This repository should include:

  • RMarkdown files (formatted to clearly present all of your code and results) that will output the proposal and final write-up in two separate documents
  • Meaningful README file on the GitHub repository that contains a codebook for relevant variables
  • Dataset(s) (in csv or RData format, in a /data folder)
  • Presentation (if using Keynote/PowerPoint/Google Slides, export to PDF and put in repo, in a /presentation folder). To do this, create a folder in your file environment in Rstudio called presentation and upload the slides there.

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.

Slides

In addition to the write-up, you must also create presentation slides that summarize and showcase your project. Introduce your research question and dataset, showcase visualizations, and provide some conclusions. These slides should serve as a brief visual accompaniment to your write-up and will be graded for content and quality. They can also be used for your Presentation. For submission, convert these slides to a .pdf document to be submitted to Gradescope.

Presentation

On the last two days of class, Wednesday, July 21 and Thursday, July 22 during class, everyone will present their projects. There will be a seven-minute time limit for each presentation followed by three minutes for questions, for a total of ten-minutes per presentation. You may, and should, use the slides as detailed in the previous section during your presentation.

Participation

You are expected to attend both days of presentations, and be actively engaged by asking questions and providing feedback.

Asynchronous options

If you cannot attend or give presentations due to taking the course asynchronously, you must let Becky know by Monday, July 12 at 11:59pm. Note that only students who enrolled in the asynchornous section of the course may take this option.

You will still be required to complete the final project. The details of the proposal, written report, slides, repository, and overall neatness/presentation style are the same as for the synchronous students. The presentation and participation will be slightly modified as follows:

Presentation

You will be required to upload a 7-minute video presentation by Wednesday, July 21 at 10:15am EST. You will receive questions and feedback from Becky, and you must respond to the questions by e-mail within 24 hours of receiving them. You will post the presentation video in Warpwire, which is accessible from the the course Sakai site (bottom of the left-hand tool bar).

To post your video on Warpwire:

  • Click the Warpwire tab in the course Sakai site.
  • Click the “+” and select “Upload files”.
  • Locate the video on your computer and click to upload.
  • Once you’ve uploaded the video to Warpwire, click to share the video and make a copy of the video’s URL. You will need this when you submit the video.

To post the video for assignment:

  • Click the Assignments tab in the course Sakai site.
  • Click the Project assignment.
  • Click the Warpwire icon (between the flag and shopping cart icons). Select your video, then click “Insert 1 item.” This will embed your video in the conversation. Under the video, paste the URL to your video.
  • You’re done!

For the presentation, you can speak over your slide deck, similar to the lecture content videos. I recommend using Zoom to record your presentation; however, you can use whatever platform works best for your group. Below are a few resources to help you record video presentations:

You are expected to watch each day’s recorded lecture/presentations. For each day of live presentations, you are required to submit feedback/questions via a provided form by 11:59pm of that night. For example, for presentations that take place on Wednesday, July 21, you should submit the feedback by 11:59pm Wednesday, July 21.

Overall notes

The project is very open ended. For instance, in creating a compelling visualization(s) of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

Before you finalize your write up, make sure the printing of code chunks is turned off with the option echo = FALSE.

Finally, pay attention to details in your write-up and presentation. Neatness, coherency, and clarity will count.

Tips

  • Ask questions if any of the expectations are unclear.

  • Code: In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.

Grading

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.