Advanced Biological Data Analysis

BIOSCI 738

Author
Affiliation

Charlotte Jones-Todd

University of Auckland

Published

2026

Notes

All BIOSCI 738 lectures require your active involvement! See the course policies for further information.

Throughout this runsheet you’ll find a number of different callout boxes:

Tipthis box will contain the expected learning objectives and/or outcomes for the material
Notethis box will contain some useful information
Warning this box will ask you discuss/write/sketch/code something! Completing these tasks will make up part of your weekly task-based participation mark.

Week 01

A big part of the learning experience in BIOSCI738 comes from you engaging with all the material suggested. Have you familiarised yourself with the BIOSCI738 CANVAS page and the Course policies and FAQs? Let’s see…

What’s the name of my dog? (hint he stars in the introductory video)

NoteNeed a stats or \(\texttt{R}\) refresher?

Try out the \(R^2\)-D2 AI agent specifically designed for this purpose. I would task yourself with around 30 questions, or until you are completely comfortable with the content (basically Module 1 from BIOSCI 220). If you’re stuck \(R^2\)-D2 will guide you through the answers too!

Tasks

Warning Introduction to BIOSCI738

BIOSCI738 kicks off a little differently in week 1. Rather than a typical lecture, we have 30-minute small-group sessions (you only need to attend one).

Sign up for a slot via the BIOSCI738 CANVAS calendar and come along for an introduction to the course. During the session, I’ll go over my expectations, introduce the diagnostic tasks (below), and answer any questions you may have. Be ready to share what you hope to get out of the course!

If you’re also enrolled in BIOSCI747 and can’t attend, please reach out to me directly (c.jonestodd@auckland.ac.nz).

Warning Diagnostic Coding Task

Below, are a few tasks designed for you to self-assess your data wrangling skills and to make sure that you are all set up for the semester. You are encouraged to use any tools, including AI-based tools, you wish; however, the responsibility is yours to ensure you fully understand and verify all work you submit.

You are not expected to have prior experience with GitHub and should refer to this section of the courseguide for an introduction (as well as some common issue fixes).

  1. Install git (follow these instrictions)

  2. Register an account with GitHub (choose the free option!)

  3. Link to the GitHub Classroom repository you were invited to (via the CANVAS announcement) and ensure it is correctly set up.

  4. Download and open the repository in RStudio, make sure to familiarise yourself with its structure.

  5. Update the tasks/week_01.r script to reproduce Figure 1 of this paper using the data the script already reads in (sourced from the paper).

    • All data manipulation must be done in R and be fully reproducible
    • Useful tidyverse functions: group_by(), summarise_all(), pivot_longer(), pivot_wider(), separate(), mutate(), str_remove(), str_replace().
  6. Finally, add and commit your changes with a meaningful commit message. Then push the changes to GitHub.

Any problems then please reach out via c.jonestodd@auckland.ac.nz. This task should be completed by the end of Week 01.

Other info

Note📢 Call for class rep! 📢 First meeting 31 March, 1-2pm

NoteWellbeing Resources

Week 02 - Lecture 01

Rough Agenda

4:05 pm - 5:55 pm

Getting into Groups

Each lecture will involve a a few group-based activities/discussion. It is your responsibility to organise yourselves into groups (see below). Feel free to reorganise the tables to suit your needs!

NoteIn groups of ~4

Designate roles:

  • Facilitator Responsible for seeking out opinions from all group members and ensuring that everyone has the opportunity to contribute.
  • Notetaker/Coder Responsible for recording the key points or for implementing the code for the group this does NOT mean the sole coder, rather the person responsible for running any suggested code etc.
  • Reporter Responsible for reporting back to the class on behalf of the group. Seek consensus from the group about the most important aspects of your discussion to share.
  • Timekeeper Responsible for managing time for the group, making sure you stay on track and spend equal time on each question.

Determine who is filling each role by order of upcoming birthdays:

  • The person with the next birthday → Notetaker
  • The next birthday after that → Reporter
  • The third birthday → Timekeeper
  • Next → Facilitator

Gorilla Gambit

In your groups quickly determine who is filling what role.

Once you have allocated group roles your Reporter should come to me to retrieve the first instruction!

Let’s talk AI - for programming

Who’s heard of bullshit?

Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs

This is an excerpt from the abstract of Hicks, Humphries, and Slater (2024).

If a student is confused about a concept, they can sit with ChatGPT and it will talk to them for hours about that particular concept.

It is a really great tool to create code but also a really great tool to prevent yourself from learning.

It’s teachers’ responsibility to motivate them and make such a problem that [students] are keen to solve and in a way that they actually would like to learn something and realize that they need these skills also in the future.

…there have always been so many ways of cheating, but I don’t think I’ve ever been aware of such an obvious, cheap, and easy way of cheating. Students can get [an AI tool] to answer any question I can ask them at the moment and therefore I have lost my ability to confidently assess any work that students hand in.

I think we need different kinds of professionals with different understandings of computing. Some need to be very deeply involved with how our programming languages work … others might only need some kind of overall understanding. They are not programmers by themselves, but they still should understand how software is produced.

All the above are quotes garnered in Sheard et al. (2024).

Questions

Below generative AI refers to tools that can generate text, code, explanations, or other content in response to prompts (e.g., large language models and AI coding assistants).

Warning Quick Poll

Open StatChat (log into your at.aucklanduni account)

  1. To what extent do you think your peers use generative AI in ways that instructors would consider inappropriate? Answer Either: A = None, B = Some, C = A Few, D = Many
  2. I rely on generative AI as a regular part of my workflow for coursework. Answer Either: A = Strongly Disagree, B = Disagree, C = Neutral, D = Agree, E = Strongly Agree
  3. Reusing small portions of code from external sources without acknowledgement is unethical. Answer Either: A = Strongly Disagree, B = Disagree, C = Neutral, D = Agree, E = Strongly Agree

Group Code of Conduct

TipA classwide agreed group working Code of Conduct

Upon the completion of this activity I will summarise the main themes/suggestions from Section 3.5.2 (that I deem appropriate). This will become the class-agreed group working Code of Conduct that you are expected to adhere to during each activity.

Why I think this is important

As a student of University of Auckland student, you are responsible for understanding and abiding by the requirements of the Student Charter.

In this activity we’re going to be talking about my and your expectation when working in a group during this class, see this section of the course guide for further details.

A Code of Conduct is not just a strange thing the university make you sign. They are a large part of many professional and research-focused bodies beyond university. The following lists just a few examples of societies or institutes you will likely come across during a biostats career in NZ.

Write your own

Warning Propose your own code of conduct for group work in this class

As a group discuss your expectations of yourself and each other during group work activities. You may want to think about

  • how to ensure all members feel welcome,
  • your expectations around being an active (and equal) contributor, and
  • your expectations around peers’ use of technologies and generative AI during activities.

Individually submit your summaries via StatChat

Good programming practice

TipAbide by good programming practices to avoid future heartaches!

For a more in-depth and general discussion I recommend reading Wilson et al. (2017).

IMO

Following this section of the course guide let’s talk about what good programming practice looks like in this course.

You should always start with a clean work space.

Honestly, I think the default RStudio behaviour of restoring .RData files etc. just makes everyone lazy…

Devise a workflow

During this course, very likely in other courses you’ll be taking this semester and in your future careers you will have to deal with many different datasets, wrangle “dirty” data and deal with data from different sources (at the very least). The key thing is to ensure that ANY ANALYSIS YOU CARRY OUT is TRANSPARENT and FULLY REPRODICIBLE (either for your peers or future you). This is where setting good foundations and devising a well-thought-out workflow is imperative!

It’s not all about length

NoteLines of code written is NOT a measure of skill!

Some people thing that a writing a large number of lines of code demonstrates prowess. It does not. Surely we’ve all added nonsense to essays to “fill up” the word count!

On the over hand some people strive for carrying out operations in the fewest number of lines possible. This typically makes their work impossible to follow!

I recommend finding a spot you’re comfortable in between the following code snippets.1 Most importantly keep your style readable & consistent!

grepl("^[[:alnum:]._-]+@[[:alnum:].-]+[:alnum:]+$", c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com", "test@test.com.", "charlottejones-todd"))
[1]  TRUE  TRUE  TRUE FALSE FALSE

or

email_addresses <- c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com", "test@test.com.", "charlottejones-todd")
contain_at <- function(x){
  grep("@", x)
}
idx <- contain_at(email_addresses)
correct_email <- email_addresses[idx]
correct_email
[1] "larry@gmail.com"       "larry-sally@sally.com" "larry@sally.larry.com"
[4] "test@test.com."       
contain_notrailing <- function(x){
  grep("^[:alnum:]+", x)
}
idx01 <- contain_notrailing(correct_email)
final_correct_email <- correct_email[idx01]
final_correct_email
[1] "larry@gmail.com"       "larry-sally@sally.com" "larry@sally.larry.com"

Keep the inside thoughts inside

NoteAn R script (or equivalent) is a roadmap to your work.

You should present the cleanest most direct route you can!

I recommend the latter approach below (if you were going to pass on your solution that is). It’s not that each step shouldn’t be carried out. On the contrary, exploring your data via printing and plotting it is very important! But when you have a solution, pare down your script! no need to take everyone on your journey.

## read in data
data <- readr::read_csv("a_valid_filename")
## printing data
print(data)
View(data) ##view opens up new window
data$variable
## Create new data object
newdata <- data$variable
newdata
## plot data
plot(newdata)
## calculate mean
mean <- mean(newdata)
print(mean)
## round
round(mean(newdata))

vs

data <- readr::read_csv("a_valid_filename")
round(mean(data$variable))

Telltale signs…

NoteAn R script (or equivalent) showcases your approch!

There are a few quirks that AI insists on including in R scripts; with a bit of knowledge these are unnecessary. Rightly or wrongly as soon as I see these in a script I become suspicious!

  1. print() statements e.g., print(variable)
  2. cat() statements e.g., cat("There are", variable, "students enrolled.\n")

Anyone know why these are telltale signs?

Make it EASY for me to give you marks

  • Think about your audience!
  • Ensure your code is reproducible!
  • Do not include any irrelevant or overly verbose output; this makes it difficult to find and therefore award the relevant components!
  • Keep your code tidy and your plots neat and professional. For example, use informative, readable axis labels rather than allowing the default behaviour of printing the R object name.

Peer-share

Warning Share your approach to the diagnostic coding task

Below is my solution to the diagnostic coding task. This is not the only solution! In your groups run through my approach and share your approach (especially if you got stuck in a few places)! Sharing and talking things through will make things clearer. Trust me! 😉 - (Also a great task to start practicing for your interactive Oral Assessments)

library(tidyverse)
url <- "https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/dicots_proportions.csv"
data <- read_csv(url)
data %>%
    select(starts_with(c("Calluna", "Treat"))) %>%
    ## select only heather & treatment cols
    group_by(`Treat!`) %>%
    ## group by treatment so calcs are done by group
    summarise_all(list(mean = mean, sd = sd)) %>%
    ## cal mean and sd of each group
    pivot_longer(!`Treat!`) %>%
    ## flip the data frame to "long" format
    separate(name, c(NA, "year", "calc")) %>%
    ## extract and separate info from name column
    mutate(year = as.numeric(str_remove(year, "vulgaris"))) %>%
    ## keep numeric year info only
    pivot_wider(names_from = "calc", values_from = "value") %>%
    ## data into wider format based on mean & sd
    mutate(`Treat!` = str_replace(`Treat!`, "HB", "BH")) %>%
    ## change treatment label HB to BH
    mutate(`Treat!` = fct_relevel(`Treat!`,c("C", "B", "H", "BH"))) %>%
    ## relevel treatment to help with legend ordering later on
    ggplot(., aes(x = year, y = mean, group = `Treat!`)) + ## set up plot
    geom_point(aes(pch = `Treat!`)) + geom_line(aes(linetype = `Treat!`)) +
    ## add mean points & lines
    ylab("Percentage cover") + xlab("Year") + ## axis labels
    geom_errorbar(aes(ymin = mean - sd/sqrt(6),
                      ymax = mean + sd/sqrt(6)), width = .05) +
    ## add error-bars note we need the standard error of the mean of the proportion
    scale_x_continuous(breaks = seq(8, 12, 1),
                       labels = seq(2008, 2012, 1),
                       expand = expansion(0.01),
                       limits = c(08, 12)) + ## match x-axis labels and limits
    scale_y_continuous(breaks = seq(0, 1, 0.1), labels = seq(0, 100, 10),
                       expand = expansion(0.01),
                       limits = c(0, 0.7)) + ## match y-axis labels and limits
    theme_classic() + ## closest in-built ggplot theme I could find
    theme(legend.title = element_blank()) ## remove legend title

Week 02 - Lecture 02

Rough Agenda

2:05 pm - 3:55 pm

Before we begin sort yourselves into groups following Section 3.2

TipBefore modelling, you should first be able to simulate!
  • Translate a data-generating story into a formal (linear) model
  • Identify and justify the assumptions required for simulation
  • Separate assumptions about covariates from assumptions about errors
  • Explore the consequences of assumption violations

Simulation

Warning What is a Model?

Discuss, then open StatChat and answer!

Lets improv!

I’m going to attempt to do this on the fly. So, when I prompt you yell out an answer to the questions below!

  • Something measurable we might be interested in (e.g., commute time to university)
  • Something that might affect this thing of interest (e.g., distance to university)
  • Something else that might affect this thing of interest (e.g., mode of transport)
  • Anything else…

Simulate your own scenario

In your groups

  • Define a measurable outcome of interest
  • Identify predictors (something you think will influence to outcome)
  • What do you think this relationship will be (e.g., linear), formulate this systematic relationship
  • Now, think about the random variation (error)
  • Simulate your outcome.
  • What are your assumption? Write them down!

Be prepared to discuss your progress/thoughts with me as I wander around the class.

Extra things to think about: Could there be non-linear relationships?, Could there be interaction effects?, Could some predictors be correlated?, Could errors be non-constant or clustered?

Warning

Once you’ve simulated a scenario you’re happy with (individually) paste your code and assumptions into StatChat.

From numbers to plots

Below is some R output showing data summaries and linear model summary output. In your groups, read through and interpret the output then, on the whiteboards sketch the model fit (including uncertainty), for each case.

data
# A tibble: 30 × 21
   film   category `worldwide gross ($m)` `% budget recovered` `critics % score`
   <chr>  <chr>                     <dbl> <chr>                <chr>            
 1 Ant-M… Ant-Man                     518 398%                 83%              
 2 Ant-M… Ant-Man                     623 479%                 87%              
 3 Aveng… Avengers                   1395 382%                 76%              
 4 Aveng… Avengers                   2797 699%                 94%              
 5 Aveng… Avengers                   2048 683%                 85%              
 6 Black… Black P…                   1336 668%                 96%              
 7 Black… Black P…                    855 342%                 84%              
 8 Black… Unique                      379 190%                 79%              
 9 Capta… Captain…                    370 264%                 79%              
10 Capta… Captain…                   1151 460%                 90%              
# ℹ 20 more rows
# ℹ 16 more variables: `audience % score` <chr>,
#   `audience vs critics % deviance` <chr>, budget <dbl>,
#   `domestic gross ($m)` <dbl>, `international gross ($m)` <dbl>,
#   `opening weekend ($m)` <dbl>, `second weekend ($m)` <dbl>,
#   `1st vs 2nd weekend drop off` <chr>, `% gross from opening weekend` <dbl>,
#   `% gross from domestic` <chr>, `% gross from international` <chr>, …
str(data)
tibble [30 × 21] (S3: tbl_df/tbl/data.frame)
 $ film                          : chr [1:30] "Ant-Man" "Ant-Man & The Wasp" "Avengers: Age of Ultron" "Avengers: End Game" ...
 $ category                      : chr [1:30] "Ant-Man" "Ant-Man" "Avengers" "Avengers" ...
 $ worldwide gross ($m)          : num [1:30] 518 623 1395 2797 2048 ...
 $ % budget recovered            : chr [1:30] "398%" "479%" "382%" "699%" ...
 $ critics % score               : chr [1:30] "83%" "87%" "76%" "94%" ...
 $ audience % score              : chr [1:30] "85%" "80%" "82%" "90%" ...
 $ audience vs critics % deviance: chr [1:30] "-2%" "7%" "-6%" "4%" ...
 $ budget                        : num [1:30] 130 130 365 400 300 200 250 200 140 250 ...
 $ domestic gross ($m)           : num [1:30] 180 216 459 858 678 700 453 183 176 408 ...
 $ international gross ($m)      : num [1:30] 338 406 936 1939 1369 ...
 $ opening weekend ($m)          : num [1:30] 57 75.8 191 357 257 202 181 80.3 65 179 ...
 $ second weekend ($m)           : num [1:30] 24 29 77 147 114 111 66 25.8 25 72.6 ...
 $ 1st vs 2nd weekend drop off   : chr [1:30] "-58%" "-62%" "-60%" "-59%" ...
 $ % gross from opening weekend  : num [1:30] 31.8 35 41.7 41.6 38 28.9 48.6 43.8 36.8 43.9 ...
 $ % gross from domestic         : chr [1:30] "34.7%" "34.7%" "32.9%" "30.7%" ...
 $ % gross from international    : chr [1:30] "65.3%" "65.2%" "67.1%" "69.3%" ...
 $ % budget opening weekend      : chr [1:30] "43.8%" "58.3%" "52.3%" "89.3%" ...
 $ year                          : num [1:30] 2015 2018 2015 2019 2018 ...
 $ source                        : chr [1:30] "https://www.the-numbers.com/movie/Ant-Man#tab=summary" "https://www.the-numbers.com/movie/Ant-Man-and-the-Wasp#tab=summary" "https://www.the-numbers.com/movie/Avengers-Age-of-Ultron#tab=summary" "https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=summary" ...
 $ critics_score                 : num [1:30] 83 87 76 94 85 96 84 79 79 90 ...
 $ budget_recovered              : num [1:30] 398 479 382 699 683 668 342 190 264 460 ...
##### ##### ##### ###
##### Model 1 #######
##### ##### ##### ###
data %>%
    lm(budget_recovered ~ critics_score, data = .) |>
    summary()

Call:
lm(formula = budget_recovered ~ critics_score, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-240.48 -101.16  -27.18  108.53  410.57 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)   -242.513    215.335  -1.126  0.26964   
critics_score    8.472      2.585   3.278  0.00279 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 156.8 on 28 degrees of freedom
Multiple R-squared:  0.2773,    Adjusted R-squared:  0.2515 
F-statistic: 10.74 on 1 and 28 DF,  p-value: 0.002795
##### ##### ##### ###
##### Model 2 #######
##### ##### ##### ###
names(table(data$category))
 [1] "Ant-Man"         "Avengers"        "Black Panther"   "Captain America"
 [5] "Dr Strange"      "Guardians"       "Iron Man"        "Spider-Man"     
 [9] "Thor"            "Unique"         
mod <- data %>%
    lm(budget_recovered ~  category, data = .) 
mod |> summary()

Call:
lm(formula = budget_recovered ~ category, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-227.25  -95.92  -11.50   71.25  341.60 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              438.500    109.360   4.010 0.000688 ***
categoryAvengers         170.750    133.938   1.275 0.216977    
categoryBlack Panther     66.500    154.658   0.430 0.671808    
categoryCaptain America  -57.167    141.183  -0.405 0.689841    
categoryDr Strange         4.500    154.658   0.029 0.977076    
categoryGuardians          5.500    154.658   0.036 0.971984    
categoryIron Man          -9.167    141.183  -0.065 0.948876    
categorySpider-Man       283.500    141.183   2.008 0.058339 .  
categoryThor             -64.000    133.938  -0.478 0.637950    
categoryUnique          -135.100    129.397  -1.044 0.308906    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 154.7 on 20 degrees of freedom
Multiple R-squared:  0.4977,    Adjusted R-squared:  0.2717 
F-statistic: 2.202 on 9 and 20 DF,  p-value: 0.06791
##### ##### ##### ###
##### Model 3 #######
##### ##### ##### ###
mod <- data %>%
    lm(budget_recovered ~ critics_score + category, data = .) 
mod |> summary()

Call:
lm(formula = budget_recovered ~ critics_score + category, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-221.617  -82.872    1.029   59.007  310.560 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)  
(Intercept)               26.257    279.136   0.094   0.9260  
critics_score              4.850      3.041   1.595   0.1272  
categoryAvengers         163.475    129.131   1.266   0.2208  
categoryBlack Panther     42.250    149.789   0.282   0.7809  
categoryCaptain America  -63.633    136.092  -0.468   0.6454  
categoryDr Strange        21.475    149.395   0.144   0.8872  
categoryGuardians        -11.475    149.395  -0.077   0.9396  
categoryIron Man           8.616    136.488   0.063   0.9503  
categorySpider-Man       251.167    137.534   1.826   0.0836 .
categoryThor             -15.501    132.585  -0.117   0.9082  
categoryUnique           -74.961    130.253  -0.576   0.5717  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 149 on 19 degrees of freedom
Multiple R-squared:  0.557, Adjusted R-squared:  0.3239 
F-statistic: 2.389 on 10 and 19 DF,  p-value: 0.04909
##### ##### ##### ###
##### Model 4 #######
##### ##### ##### ###
data <- data %>%
    rename(., "worldwide" = `worldwide gross ($m)`, "domestic" = `domestic gross ($m)`,
                 "international" = `international gross ($m)`)
mod <- lm(worldwide ~ 0 + domestic + international, data = data) 
mod |> summary()

Call:
lm(formula = worldwide ~ 0 + domestic + international, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4167 -0.3363  0.4036  0.6414  0.8259 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
domestic      1.0016977  0.0010233   978.9   <2e-16 ***
international 0.9995891  0.0006227  1605.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6566 on 28 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.147e+07 on 2 and 28 DF,  p-value: < 2.2e-16
##### ##### ##### ###
##### Model 5 #######
##### ##### ##### ###
mod <- lm(worldwide ~ budget*critics_score, data = data) 
mod |> summary()

Call:
lm(formula = worldwide ~ budget * critics_score, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-491.98 -169.76  -36.49  153.46  792.52 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)  
(Intercept)          1936.38319 1685.07029   1.149   0.2610  
budget                -12.31026    8.23297  -1.495   0.1469  
critics_score         -26.74895   19.92735  -1.342   0.1911  
budget:critics_score    0.22214    0.09679   2.295   0.0301 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 298.6 on 26 degrees of freedom
Multiple R-squared:  0.7479,    Adjusted R-squared:  0.7189 
F-statistic: 25.72 on 3 and 26 DF,  p-value: 6.07e-08

Extra Activities (time dependent)

Reproducible analysis

TipExploring data

Building on Section 3.3 here you’ll be carrying out some basic EDA and interpreting hypothesis tests.

In this task you’re going to be using these data collected as part of a (now retracted) study (Shu et al. (2012))2.

The data comes from a study that claims to show that people are less likely to act dishonestly when they sign an honesty pledge at the top of a form rather than at the bottom of a form. Participants received a worksheet with 20 math puzzles and were offered $1 for each puzzle they (reported to have) solved correctly within 5 minutes. After the 5 minutes passed, participants were asked to count how many puzzles they solved correctly and then throw away their worksheets. The goal was to mislead participants into thinking that the experimenter could not observe their true performance, when in fact they could because each worksheet had a unique identifier. Thus, participants could cheat (and earn more money) without fear of being caught, while the researchers could observe how much each participant had cheated. Participants then completed a “tax” form reporting how much money they had earned, and also how much time and money they spent coming to the lab. The experimenters partially compensated participants for those costs.

The paper reported very large effects. Signing at the top vs. the bottom lowered the share of people over-reporting their math puzzle performance from 79% to 37% (p = .0013), and lowered the average amount of over-reporting from 3.94 puzzles to 0.77 puzzles (p < .00001). Similarly, it nearly halved the average amount of claimed commuting expenses, from $9.62 to $5.27 (p = .0014).

Variables of interest

  • Cond: = 0, No signature; = 1, Signed at the top; and = 2, Signed at the bottom.
  • CheatedOnMatrixTax: = 0, participants didn’t cheat/over-report their maths puzzle performance; and = 1, participants did cheat/over-report their maths puzzle performance.
  • OverReport: the amount of puzzles solved participants over-reported by.
  • SumDeduction: the amount of claimed commuting expenses $US (corrected for the true amount).

As a group:

  1. Discuss what to the presented results indicate/claim?

  2. Use the variables listed above and reproduce the results quoted above using the appropriate statistical analysis/test. How easy was it to replicate these results given the way the results were presented? How might you improve the way the results were presented?

  3. There is an additional column of interest flag; this is a binary variable that indicates if the observations were considered dodgy (i.e., fraudulent) or not. Create a visualization/carry out some analysis that could be used as evidence for/against this belief.

Fry time

Warning GitHub script upload

Below is a plot of some data collected as part of a study carried out by Abdul Hamid, Omar, and Sanny (2019).

Figure 1 from  article doi.org/10.1016/j.jssas.2018.01.002

Figure 1 from Abdul Hamid, Omar, and Sanny (2019)

To the tasks folder of your GitHub repository create, add and push a reproducible R script named week_0x.r (where x is the current semester week) that simulates these data as best you can. Remember to update your README.md accordingly.

Activity solutions

NoteSuggested R code solutions

This section will be update with my suggested solutions to the class wide activities after their conclusion. I strongly recommend that you review these outside of class if only to familiarise yourself with a different approach.

Suggested solution to Section 3.3

require(tidyverse)
## read in data
data <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/BMI.csv")
data
## Mean BMI
data %>%
    summarise(mean(bmi)) %>%
    round()
## Max BMI males
data %>%
    filter(gender == "male") %>%
    filter(bmi == max(bmi))
## t-test BMI males vs females
t.test(bmi ~ gender, data = data)
lm(bmi ~ gender, data = data) |> summary()
## t-test steps males vs females
t.test(steps ~ gender, data = data) 
lm(steps ~ gender, data = data) |> summary()
## plotting. ALWAYS PLOT YOUR DATA!
ggplot(data, aes(x = steps, y = bmi, col = gender)) +
    geom_point()

Suggested solution to Section 4.3

## data source: https://docs.google.com/spreadsheets/d/1YSJ4ypkYLq6j1mIBJCgUHhHjJZQ0Rkfe1qW2WC5HLiw/edit#gid=748627588
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/marvel_movies.csv") %>%
    mutate(critics_score = as.numeric(str_replace_all(`critics % score`, "%", ""))) %>%
    mutate(budget_recovered = as.numeric(str_replace_all(`% budget recovered`, "%", "")))
##### ##### ##### ###
##### Model 1 #######
##### ##### ##### ###
data %>%
    lm(budget_recovered ~ critics_score, data = .) |>
    summary()
## Ans
data %>%
    ggplot(., aes(y = budget_recovered, x = critics_score)) +
    geom_smooth(method = "lm", se = FALSE) + ylab(" Estimated budget_recovered")
## w. data
data %>%
    ggplot(., aes(y = budget_recovered, x = critics_score)) +
    geom_smooth(method = "lm", se = FALSE) + geom_point()
##### ##### ##### ###
##### Model 2 #######
##### ##### ##### ###
names(table(data$category))
mod <- data %>%
    lm(budget_recovered ~  category, data = .) 
mod |> summary()
## Ans
data$pred_vals <- predict(mod)
## plot
data %>%
    ggplot(aes(y = pred_vals, x = category)) +
    geom_point(size = 15, pch = "-") + ylab(" Estimated budget_recovered")

##### ##### ##### ###
##### Model 3 #######
##### ##### ##### ###
mod <- data %>%
    lm(budget_recovered ~ critics_score + category, data = .) 
mod |>  summary()
## Ans
data$pred_vals <- predict(mod)
## plot
data %>%
    ggplot(aes(y = pred_vals, x = critics_score, color = category)) +
    geom_line() + ylab(" Estimated budget_recovered")

##### ##### ##### ###
##### Model 4 #######
##### ##### ##### ###
data <- data %>%
    rename(., "worldwide" = `worldwide gross ($m)`, "domestic" = `domestic gross ($m)`,
                 "international" = `international gross ($m)`)
mod <- lm(worldwide ~ 0 + domestic + international, data = data) 
mod |> summary()
## Ans
## plot
require(rsm) ## install if not available
persp(mod, form =  ~ 0 + domestic + international)
image(mod, form =  ~ 0 + domestic + international)
##### ##### ##### ###
##### Model 5 #######
##### ##### ##### ###
mod <- lm(worldwide ~ budget*critics_score, data = data) 
mod |>  summary()
## Ans
## plot
persp(mod, form =  ~ budget*critics_score)
image(mod, form =  ~ budget*critics_score)

References

Abdul Hamid, N., S. Omar, and M. Sanny. 2019. “Effect of Thawing Conditions and Corresponding Frying Temperature Profiles on the Formation of Acrylamide in French Fries.” Journal of the Saudi Society of Agricultural Sciences 18 (4): 396–400. https://doi.org/https://doi.org/10.1016/j.jssas.2018.01.002.
Hicks, M. T., J. Humphries, and J. Slater. 2024. “ChatGPT Is Bullshit.” Ethics and Information Technology 26 (38). https://doi.org/10.1007/s10676-024-09775-5.
Sheard, Judy, Paul Denny, Arto Hellas, Juho Leinonen, Lauri Malmi, and Simon. 2024. “Instructor Perceptions of AI Code Generation Tools - a Multi-Institutional Interview Study.” In. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3626252.3630880.
Shu, Lisa L., Nina Mazar, Francesca Gino, Dan Ariely, and Max H. Bazerman. 2012. “RETRACTED: Signing at the Beginning Makes Ethics Salient and Decreases Dishonest Self-Reports in Comparison to Signing at the End.” Proceedings of the National Academy of Sciences 109 (38): 15197–200. https://doi.org/10.1073/pnas.1209746109.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Computational Biology 13 (6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510.

Footnotes

  1. The former example is a solution posed on this StackOverflow post asking about email address validation. The latter snippet is my attempt at a long winded (and far less stable) approach.↩︎

  2. See this Guardian article about the debacle↩︎