class: center, middle, inverse, title-slide # STATS 250 Lab 04 ## Probability and Scatterplots ### Nick Seewald
nseewald@umich.edu ### Week of 09/21/2020 --- class: center, middle # Reminders 💡 Your tasks for the week running Friday 9/18 - Friday 9/25 (plus an extra): | Task | Due Date | Submission | |:-----|:---------|:-----------| | Quiz 1 | Monday 9/21 11:59PM ET | Canvas | | MWrite 1 Initial Draft | Wednesday 9/23 5PM ET | Canvas | Homework 3 | Friday 9/25 8AM ET | course.work | | Lab 4 | Friday 9/25 8AM ET | Canvas | *No office hours or Piazza today because of the quiz!* --- # Lab 2 Comments - Really great work on lab 2! - R orders levels of categorical variables *alphabetically* - It's a good idea to have descriptive axis labels for plots. People might not know what "ConditionTreeHealth" means! - In Dive Deeper 2, the data were collected by the City of Ann Arbor. Only trees on city-owned land are included! - In Dive Deeper 4, it turns out that the map only contains information about trees on city-owned land. Since the City of Ann Arbor does not own the Arboretum, the data could not be used to answer that research question. --- # Homework 2 Comments 📝 - A causal *statement* is any sentence that is about causation. - "There is not evidence to say that eating chia seeds causes weight loss" **is** a causal statement - "Chia seeds do not cause weight loss" **is** a causal statement - Causal *statements* do not require causal *relationships* - Generalizability to a population is a result of sampling: how are data collected? - Sample size isn't really a big deal - Good (random) sampling = generalizable; bad sampling = not generalizable --- # Homework 2 Comments 📝 - Review your homework! Even questions you got full credit on. - You should have comments on every question you lost points on. - Remember we don't grade every question for correctness. --- class: center, middle # Weekly Advice We're focusing a lot on random sampling this week. ###Your mileage may vary! Your results *will* be different from mine and from your collaborators'. --- # Integer Sequences in R - A **vector** is a way to hold a collection of things in R. Think of it as a pill organizer. - This week, we're going to create a vector that holds a sequence of consecutive integers. ```r 1:6 ``` ``` [1] 1 2 3 4 5 6 ``` - Read the colon `:` as "through", so `1:6` is "1 through 6" --- # Sampling in R 🎲 - Think of `1:6` as representing a six-sided die. - We can "roll" the die by taking a `sample()` from the vector `1:6` ```r sample(1:6, size = 1) ``` ``` [1] 3 ``` - Run the `sampleDieRoll` chunk (line ~63) and type what you got in the chat! --- # Sampling With vs. Without Replacement - Consider randomly selecting 6 values from the set {1, 2, 3, 4, 5, 6}. - Say our first pick is 3. - What do we do with 3? Do we take 3 out of the set (don't *replace* it), or do we put it back in (*replace* it)? -- ```r sample(1:6, size = 6, replace = F) ``` ``` [1] 3 6 5 1 2 4 ``` -- ```r sample(1:6, size = 6, replace = T) ``` ``` [1] 1 2 2 5 4 2 ``` -- - Which of these strategies represents die-rolling in real life? --- # Law of Large Numbers As you collect more data, sample averages will get close to population averages ("*expected values*"). .pull-left[ Roll dice ``` [1] 5 [1] 4 6 [1] 2 5 2 [1] 1 2 6 3 [1] 3 3 4 6 1 ``` ] .pull-right[ Average of rolls ``` [1] 5 [1] 5 [1] 3 [1] 3 [1] 3.4 ``` ] -- The mean seems like it's trying to do something, but it's too variable to really see what's happening. --- # Law of Large Numbers <img src="lab04-slides-sync_files/figure-html/lln-1.png" style="display: block; margin: auto;" /> --- # Expected Value We can compute the value that the sample averages will converge to! $$ \sum_{i=1}^n x_i \cdot p_i $$ - `\(\Sigma\)` means "summation" (addition) - `\(x_i\)` is the value (in our case, 1, 2, 3, 4, 5, or 6) - `\(p_i\)` is the *probability* of observing the value For the six-sided die, the expected value is -- $$ 1 \cdot \left(\frac{1}{6}\right) + 2 \cdot \left(\frac{1}{6}\right) + 3 \cdot \left(\frac{1}{6}\right) + 4 \cdot \left(\frac{1}{6}\right) + 5 \cdot \left(\frac{1}{6}\right) + 6 \cdot \left(\frac{1}{6}\right) = 3.5$$ --- class:inverse ## Penguins! ```r penguins <- read.csv(url("https://raw.githubusercontent.com/STATS250SBI/palmerpenguins/master/inst/extdata/penguins_NArm.csv")) ``` .pull-left[ ```r str(penguins) 'data.frame': 333 obs. of 8 variables: $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ... $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ... $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ... $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ... $ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ... $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ... $ sex : chr "male" "female" "female" "female" ... $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ``` ] .pull-right[![Illustration of penguin species. Artwork by @allison_horst.](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png) ] --- # Scatterplots A **scatterplot** is a way to display the relationship between quantitative explanatory (x) and response (y) variables. The data are paired (x, y) and then each pair is plotted on a grid. We can use scatterplots to look for **associations** between these quantitative variables. --- # Scatterplots (line ~142) .pull-left[ ```r plot(penguins$bill_length_mm, penguins$body_mass_g, main = "Scatterplot of Bill Length versus Body Mass of Penguins", xlab = "Bill Length in mm", ylab = "Body Mass in g") ``` - positive association - reasonably linear - moderately strong - no apparent unusual points ] .pull-right[ ![](lab04-slides-sync_files/figure-html/scatterplot-1.png)<!-- --> ] --- # Correlation - The **correlation** between two quantitative variables quantifies the strength of the *linear* association. - Denote correlation by `\(r\)` - As `\(\lvert r \rvert\)` gets close to 1, the linear relationship becomes stronger <img src="lab04-slides-sync_files/figure-html/corExample-1.png" style="display: block; margin: auto;" /> --- # Correlation (line ~155) We can find the correlation between two quantitative variables in R using the `cor()` function. ```r cor(penguins$bill_length_mm, penguins$body_mass_g) ``` ``` [1] 0.5894511 ``` --- class:inverse # Lab Project ⌨️ You will be **randomly** moved to a breakout room for the rest of the lab (minus ~10 minutes) .pull-left[ ### Your tasks 1. Introduce yourself to your collaborators! 1. **Work together** to complete the "Try It!" and "Dive Deeper" portions of the lab assignment by copy/pasting and modifying appropriate code from earlier in the document. ] .pull-right[ ### How to get help - I'll be floating around between breakout rooms to check on everyone - Use the "Ask for help" button to flag me down - Let me know when you're done ] --- class: center, middle, inverse # What questions do you have? Any issues? --- class: center, middle # "Exit Ticket" Please take 1-2 minutes to complete the survey at ### [bit.ly/njs_ticket4](https://bit.ly/njs_ticket4)