Analyses

Analysis Schedule
Videos
Learning Objectives
IMRAD Structure
IMRAD Section Details
IMRAD Submission
Grading
FAQ
Additional Advice and Notes

Analysis Schedule

For the Fall 2020 semester, you will complete three analyses.

Analysis	Files	Deadline
Heart Disease	`heart-analysis.zip`	Wednesday, November 11
Credit Fraud	`credit-analysis.zip`	Wednesday, November 18
MLB Pitching	`pitch-analysis.zip`	Wednesday, December 2
Analysis Revisit	None needed!	Wednesday, December 9

The files and documentation needed to get started with each analysis are provided in two formats:

A GitHub repo that you may fork, clone, download, or use however you’d like. Use the link in the Analysis column above. Using GitHub is not necessary but if you already use it as a part of your workflow, we have provided this resource for you.
A .zip file that is a mirror of the GitHub repo. Use the link in the Files column above.

Use whichever is easier for you. Whichever you choose, start by reading the README.md file. Both methods are setup such that you only need to edit the analysis.Rmd file contained within. To complete your analysis, simply knit this file, re-zip the folder provided, and submit!

Analysis 04

In place of Analysis 04, you will revisit one of the three previous analyses. You pick! Now that you’re familiar with the analyses, and have had time to work on your own, revisit an analysis with fresh eyes and more video examples from the instructor.

Submit a .zip file as your did for the original analysis.
Include files called analysis_original.Rmd and analysis_original.html that contain your original work.
Add a section to the updated analysis called “Changelog” that briefly describes the changes made from the original analysis.

You can pick whatever you’d like to focus on in the revision. You don’t not need to completely redo an analysis, it would actually be best to keep your revision focused. It could be as simple as comparing an additional modeling technique, or just cleaning up your R Markdown document and your writing.

Videos

Analysis	Video	Mirror	Topics
Heart Disease	1.1 - YouTube	1.1 - MediaSpace	Goal, Data, Response, Features
Heart Disease	1.2 - YouTube	1.2 - MediaSpace	Data Splitting, Missing Data, EDA
Heart Disease	1.3 - YouTube	1.3 - MediaSpace	Baseline Performance, Binary Classification
Heart Disease	1.4 - YouTube	1.4 - MediaSpace	ML Pipelines
Heart Disease	1.5 - YouTube	1.5 - MediaSpace	Recap and Next Steps

Learning Objectives

Our hope is that the process of completing these analyses will help you see how to put what we have learned this semester into practice. Specifically, after completing these analyses and receiving feedback, we hope that students are able to:

formulate practical, real-world, problems as machine learning problems.
implement learning methods using a statistical computing environment.
evaluate effectiveness of learning methods when used as a tool for data analysis.
communicate analysis results in a concise and meaningful manner.

Beyond that, we hope you use these analyses as an opportunity for exploration. Hopefully the foundation we have created throughout the semester will give you the confidence to try some things we only briefly discussed, or did not discuss at all!

IMRAD Structure

For these analyses, you will do the following:

Analyze the provided data however your please! Each analysis has a suggested goal, but you may use the data however you choose!
Write a report in R Markdown using the IMRAD organization structure.
- Write an abstract.
- Write an introduction.
- Write a methods section.
- Write a results section.
- Write a discussion.

Please review the this IMRAD Cheat Sheet developed by the Carnegie Mellon University Global Communication Center.

IMRAD Section Details

Abstract

Even though it is the first thing to appear in the report, the abstract should be the last thing that you write. Generally the abstract should serve as a summary of the entire report. Reading only the abstract, the reader should have a good idea about what to expect from the rest of the document. Abstracts can be extremely variable in length, but a good heuristic is to use a sentence for each of the main sections of the IMARD:

Introduction: Why are you doing this analysis?
Methods: What did you do?
Results: What did you find?
Discussion: What does it mean? Why does it matter?

Introduction

The introduction should discuss the “why” of your analysis and a brief “what” of your data. Essentially, you need to motivate why the analysis that you are about to do should be done. Why does this analysis need to be done? What is the goal of this analysis? The introduction should also provide enough background on the subject area for a reader to understand your analysis. Do not assume your reader knows anything about the subject area that your data comes from. If the reader does not understand your data, there is no way the reader will understand your motivation. Since data is provided for you, but not a scenario, you can create any reasonable scenario that you would like.

You do not need to provide a complete data dictionary in the introduction, but you should include one in the appendix. Often the data would be introduced in the Methods section, but here the data is very closely linked to the motivation of the analysis. It at least needs to be introduced in the the introduction.

Consider including some exploratory data analysis here, and providing some of it to the reader in the report if you feel it helps present and introduce the data.

Methods

The methods section should discuss what you did. The methods that you are using are those learned in class. This section should contain the bulk of your “work.” This section will contain most of the R code that is used to generate the results. Your R code is not expected to be perfect idiomatic R, but it is expected to be understood by a reader without too much effort. The majority of your code should be suppressed from the final report, but consider displaying code that helps illustrate the analysis you performed, for example, training of models.

Consider adding subsections in this section. One potential set of subsections could be data and models. The data section would describe your data in detail. What are the rows? What are the columns? Which columns will be used as the response and the features? What is the source of the data? How will it be used in performing your analysis? What if any preprocessing have you done to it? The models section would describe the modeling methods that you will consider, as well as strategies for comparison and evaluation.

Your goal is not to use as many methods as possible. Your task is to use appropriate methods to accomplish the stated goal of the analysis.

Results

The results section should contain numerical or graphical summaries of your results. What are the results of applying the methods you described? Consider reporting a “final” or “best” model you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. The results sections is about reporting your results. In addition to tables or graphics, state the result in plain English.

Discussion

The discussion section should contain discussion of your results. That is, the discussion section is used for commenting on your results. This should also frame your results in the context of the data. What do your results mean? Why would someone care about these results? Results are often just numbers, here you need to explain what they tell you about the analysis you are performing. The results section tells the reader what the results are. The discussion section tells the reader why those results matter. The discussion section is the most important section. It takes numbers on a page and gives them meaning.

Appendix

The appendix section should contain any additional code, tables, and graphics that are not explicitly referenced in the narrative of the report. The appendix should contain a data dictionary.

IMRAD Submission

Submit a .zip file to Compass that contains:

A .Rmd file that is your IMRAD.
A .html file that is the result of knitting your .Rmd file.
Any additional files needed to knit your .Rmd file

You may simply use the files and template provided. If so, you only need to modify the given analysis.Rmd file, knit, and zip the folder before submitting.

Submit your .zip file to the correct assignment on Compass2g. You are granted an unlimited number of submissions. Your last submission before each of the deadlines will be graded, provided you do not make a late submission. Because we will be releasing solutions, we cannot offer a super flexible late submission policy like with quizzes. You may submit up to 24 hours late with a 3 point reduction. If you make a submission after the initial deadline, this penalty will be applied. No exceptions! When in doubt, it is generally not worth re-submitting after the deadline to fix a small mistake.

Grading

For Fall 2020 all grading will be based on completion. Submit a reasonable looking report and you will receive full credit. Feel free to simply follow along with the videos posted above and submit those results as your analysis. We’re more concerned with your engagement than being critical of your work via grading. That said, you should be critical of your own work, as that is how you will learn and become more proficient at analysis.

Beyond here was the “original” plan which no longer applies.

The analyses will be graded out of 20 points. Each of the following criteria will be worth two points. A score of 0, 1, and 2 is possible within each criteria:

0: Criteria is largely ignored or incorrect.
1: Criteria is only partially satisfied.
2: Criteria is met.

The following criteria will be evaluated:

[eye-test] Final rendered document passes the “eye test.”
- Do not “show” too much code. That is, most code should only appear in the source .Rmd file. Only make code visible in the final report if it is a short, concise, easy to understand way to communicate what you have done.
- Final document should be free of any rendering errors.
- Final document should obviously follow the suggest IMRAD structure, and at a glance, should contain the relevant content.
- No R warnings or messages should be visible in the final document.
[code-style] R code and R Markdown follow suggested STAT 432 style guidelines.
- Any data files submitted should be loaded with a relative reference. Absolute references should only be used for files accessible via the web.
- See here for additional information.
- When in doubt, follow the tidyverse style guide.
[imrad-style] Document is well written and follows the IMRAD template.
- Appropriate content is in the appropriate section.
- Do not narrate what your code does from a code perspective. Narrate your document according to what is happening from a data analysis perspective.
- Use complete sentences. Mostly use paragraphs. Use bulleted lists where appropriate.
- Use spell check! There is a spell check button in RStudio!
[data-exp] Data is well explained.
- Reader should understand what a row of the data is. To do so will likely require some explanation of the domain the data comes from.
- Reader should understand what a column of the data is. To do so will likely require some explanation of the domain the data comes from.
- Reader should be made aware of the source of the data including the original and intermediate sources.
[pred-just] There is a clear (context driven) justification for making predictions about the response variable.
- It should be made clear to the reader why it is useful to predict this variable, in the context of the data’s domain.
[feat-style] There is a clear (context driven) justification for using the features considered.
- It should be clear to the reader that the features used will all be available at test time, that is, when making future predictions.
[train-test] Train and test data are used for appropriate tasks.
[causation] No causal claims are made.
[result-scrutiny] Any “chosen” model is scrutinized beyond a single numeric metric.
- Simply reporting for example an RMSE or accuracy of a chosen model is insufficient.
[issues] Potential shortcomings of the analysis are made clear to the reader.
- Suppose someone was going to actually use your suggested model and you will be held accountable for any issue caused by using the model in practice.

The instructor and graders reserve the right to apply additional deductions for submissions that are extremely poor, containing so little content that it cannot be evaluated based on the above criteria. Our hope is that this grading structure allows students to feel confident in their grade, while being able to perform the analysis however they choose.

For Analysis 01, you will be given feedback according the the above criteria, but for Analysis 01 only, grading will actually be based on completion. That is, if you submit anything before the deadline, you will receive full credit. But please, do not abuse this policy! Our hope is that you will still attempt the analysis in order to obtain feedback that you can carry forward to future analyses so that you feel confident in the grading procedure.

FAQ

How long should the report be?

There is no explicit minimum. There is an implicit maximum. On one hand, you need to provide results and evidence to support your decisions, and you need to be thorough and diligent as you walk through the steps of findings. On the other hand, a well-crafted data analysis will utilize brevity and conciseness. If you have a point to make, get to it. If you find yourself writing things simply for the sake of padding the word count, you are writing the wrong things. Respect your reader’s time.

What if I don’t get a good results?

If you review the grading criteria above, you will note that your grade does not depend on the strength of your results.
Making a strong conclusion that is not supported by your analysis could result in a grade reduction.
Making a correct but weak conclusion that is supported by your analysis will result in full marks.

Do I need to use method X or Y or Z?

That is 100% up to you.

Additional Advice and Notes

KISS
- More code does not make your report better.
- More words does not make your report better.
- A good report contains good analysis that is explained simply.
Attempt to grade your own submission using the guidelines above!
While we are grading on somewhat specific criteria, in theory, there is a lot of flexibility built into these analyses in the hopes that you will make them your own! As such, you are free to post your results wherever you choose, including on GitHub or part of a portfolio.
Because of the volume of grading that is needed, direct feedback may be minimal. To compensate, will we post solutions and create some video “feedback” for each analysis. If based on the given feedback and a comparison to the solutions documents you are unsure of your errors, please ask on Piazza!

Home