grepl("^[[:alnum:]._-]+@[[:alnum:].-]+[:alnum:]+$", c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com", "test@test.com.", "charlottejones-todd"))[1] TRUE TRUE TRUE FALSE FALSE
All BIOSCI 738 lectures require your active involvement! See the course policies for further information.
Throughout this runsheet you’ll find a number of different callout boxes:
A big part of the learning experience in BIOSCI738 comes from you engaging with all the material suggested. Have you familiarised yourself with the BIOSCI738 CANVAS page and the Course policies and FAQs? Let’s see…
What’s the name of my dog? (hint he stars in the introductory video)
Try out the \(R^2\)-D2 AI agent specifically designed for this purpose. I would task yourself with around 30 questions, or until you are completely comfortable with the content (basically Module 1 from BIOSCI 220). If you’re stuck \(R^2\)-D2 will guide you through the answers too!


Each lecture will involve a a few group-based activities/discussion. It is your responsibility to organise yourselves into groups (see below). Feel free to reorganise the tables to suit your needs!
Designate roles:
Determine who is filling each role by order of upcoming birthdays:
In your groups quickly determine who is filling what role.
Once you have allocated group roles your Reporter should come to me to retrieve the first instruction!
Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs
This is an excerpt from the abstract of Hicks, Humphries, and Slater (2024).
If a student is confused about a concept, they can sit with ChatGPT and it will talk to them for hours about that particular concept.
It is a really great tool to create code but also a really great tool to prevent yourself from learning.
It’s teachers’ responsibility to motivate them and make such a problem that [students] are keen to solve and in a way that they actually would like to learn something and realize that they need these skills also in the future.
…there have always been so many ways of cheating, but I don’t think I’ve ever been aware of such an obvious, cheap, and easy way of cheating. Students can get [an AI tool] to answer any question I can ask them at the moment and therefore I have lost my ability to confidently assess any work that students hand in.
I think we need different kinds of professionals with different understandings of computing. Some need to be very deeply involved with how our programming languages work … others might only need some kind of overall understanding. They are not programmers by themselves, but they still should understand how software is produced.
All the above are quotes garnered in Sheard et al. (2024).
Below generative AI refers to tools that can generate text, code, explanations, or other content in response to prompts (e.g., large language models and AI coding assistants).
Upon the completion of this activity I will summarise the main themes/suggestions from Section 3.5.2 (that I deem appropriate). This will become the class-agreed group working Code of Conduct that you are expected to adhere to during each activity.
As a student of University of Auckland student, you are responsible for understanding and abiding by the requirements of the Student Charter.
In this activity we’re going to be talking about my and your expectation when working in a group during this class, see this section of the course guide for further details.
A Code of Conduct is not just a strange thing the university make you sign. They are a large part of many professional and research-focused bodies beyond university. The following lists just a few examples of societies or institutes you will likely come across during a biostats career in NZ.
For a more in-depth and general discussion I recommend reading Wilson et al. (2017).
Following this section of the course guide let’s talk about what good programming practice looks like in this course.
Honestly, I think the default RStudio behaviour of restoring .RData files etc. just makes everyone lazy…
During this course, very likely in other courses you’ll be taking this semester and in your future careers you will have to deal with many different datasets, wrangle “dirty” data and deal with data from different sources (at the very least). The key thing is to ensure that ANY ANALYSIS YOU CARRY OUT is TRANSPARENT and FULLY REPRODICIBLE (either for your peers or future you). This is where setting good foundations and devising a well-thought-out workflow is imperative!
Some people thing that a writing a large number of lines of code demonstrates prowess. It does not. Surely we’ve all added nonsense to essays to “fill up” the word count!
On the over hand some people strive for carrying out operations in the fewest number of lines possible. This typically makes their work impossible to follow!
I recommend finding a spot you’re comfortable in between the following code snippets.1 Most importantly keep your style readable & consistent!
grepl("^[[:alnum:]._-]+@[[:alnum:].-]+[:alnum:]+$", c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com", "test@test.com.", "charlottejones-todd"))[1] TRUE TRUE TRUE FALSE FALSE
or
email_addresses <- c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com", "test@test.com.", "charlottejones-todd")
contain_at <- function(x){
grep("@", x)
}
idx <- contain_at(email_addresses)
correct_email <- email_addresses[idx]
correct_email[1] "larry@gmail.com" "larry-sally@sally.com" "larry@sally.larry.com"
[4] "test@test.com."
contain_notrailing <- function(x){
grep("^[:alnum:]+", x)
}
idx01 <- contain_notrailing(correct_email)
final_correct_email <- correct_email[idx01]
final_correct_email[1] "larry@gmail.com" "larry-sally@sally.com" "larry@sally.larry.com"
R script (or equivalent) is a roadmap to your work.
You should present the cleanest most direct route you can!
I recommend the latter approach below (if you were going to pass on your solution that is). It’s not that each step shouldn’t be carried out. On the contrary, exploring your data via printing and plotting it is very important! But when you have a solution, pare down your script! no need to take everyone on your journey.
## read in data
data <- readr::read_csv("a_valid_filename")
## printing data
print(data)
View(data) ##view opens up new window
data$variable
## Create new data object
newdata <- data$variable
newdata
## plot data
plot(newdata)
## calculate mean
mean <- mean(newdata)
print(mean)
## round
round(mean(newdata))vs
data <- readr::read_csv("a_valid_filename")
round(mean(data$variable))R script (or equivalent) showcases your approch!
There are a few quirks that AI insists on including in R scripts; with a bit of knowledge these are unnecessary. Rightly or wrongly as soon as I see these in a script I become suspicious!
print() statements e.g., print(variable)cat() statements e.g., cat("There are", variable, "students enrolled.\n")Anyone know why these are telltale signs?
I’m going to attempt to do this on the fly. So, when I prompt you yell out an answer to the questions below!
In your groups
Be prepared to discuss your progress/thoughts with me as I wander around the class.
Extra things to think about: Could there be non-linear relationships?, Could there be interaction effects?, Could some predictors be correlated?, Could errors be non-constant or clustered?
Below is some R output showing data summaries and linear model summary output. In your groups, read through and interpret the output then, on the whiteboards sketch the model fit (including uncertainty), for each case.
data# A tibble: 30 × 21
film category `worldwide gross ($m)` `% budget recovered` `critics % score`
<chr> <chr> <dbl> <chr> <chr>
1 Ant-M… Ant-Man 518 398% 83%
2 Ant-M… Ant-Man 623 479% 87%
3 Aveng… Avengers 1395 382% 76%
4 Aveng… Avengers 2797 699% 94%
5 Aveng… Avengers 2048 683% 85%
6 Black… Black P… 1336 668% 96%
7 Black… Black P… 855 342% 84%
8 Black… Unique 379 190% 79%
9 Capta… Captain… 370 264% 79%
10 Capta… Captain… 1151 460% 90%
# ℹ 20 more rows
# ℹ 16 more variables: `audience % score` <chr>,
# `audience vs critics % deviance` <chr>, budget <dbl>,
# `domestic gross ($m)` <dbl>, `international gross ($m)` <dbl>,
# `opening weekend ($m)` <dbl>, `second weekend ($m)` <dbl>,
# `1st vs 2nd weekend drop off` <chr>, `% gross from opening weekend` <dbl>,
# `% gross from domestic` <chr>, `% gross from international` <chr>, …
str(data)tibble [30 × 21] (S3: tbl_df/tbl/data.frame)
$ film : chr [1:30] "Ant-Man" "Ant-Man & The Wasp" "Avengers: Age of Ultron" "Avengers: End Game" ...
$ category : chr [1:30] "Ant-Man" "Ant-Man" "Avengers" "Avengers" ...
$ worldwide gross ($m) : num [1:30] 518 623 1395 2797 2048 ...
$ % budget recovered : chr [1:30] "398%" "479%" "382%" "699%" ...
$ critics % score : chr [1:30] "83%" "87%" "76%" "94%" ...
$ audience % score : chr [1:30] "85%" "80%" "82%" "90%" ...
$ audience vs critics % deviance: chr [1:30] "-2%" "7%" "-6%" "4%" ...
$ budget : num [1:30] 130 130 365 400 300 200 250 200 140 250 ...
$ domestic gross ($m) : num [1:30] 180 216 459 858 678 700 453 183 176 408 ...
$ international gross ($m) : num [1:30] 338 406 936 1939 1369 ...
$ opening weekend ($m) : num [1:30] 57 75.8 191 357 257 202 181 80.3 65 179 ...
$ second weekend ($m) : num [1:30] 24 29 77 147 114 111 66 25.8 25 72.6 ...
$ 1st vs 2nd weekend drop off : chr [1:30] "-58%" "-62%" "-60%" "-59%" ...
$ % gross from opening weekend : num [1:30] 31.8 35 41.7 41.6 38 28.9 48.6 43.8 36.8 43.9 ...
$ % gross from domestic : chr [1:30] "34.7%" "34.7%" "32.9%" "30.7%" ...
$ % gross from international : chr [1:30] "65.3%" "65.2%" "67.1%" "69.3%" ...
$ % budget opening weekend : chr [1:30] "43.8%" "58.3%" "52.3%" "89.3%" ...
$ year : num [1:30] 2015 2018 2015 2019 2018 ...
$ source : chr [1:30] "https://www.the-numbers.com/movie/Ant-Man#tab=summary" "https://www.the-numbers.com/movie/Ant-Man-and-the-Wasp#tab=summary" "https://www.the-numbers.com/movie/Avengers-Age-of-Ultron#tab=summary" "https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=summary" ...
$ critics_score : num [1:30] 83 87 76 94 85 96 84 79 79 90 ...
$ budget_recovered : num [1:30] 398 479 382 699 683 668 342 190 264 460 ...
##### ##### ##### ###
##### Model 1 #######
##### ##### ##### ###
data %>%
lm(budget_recovered ~ critics_score, data = .) |>
summary()
Call:
lm(formula = budget_recovered ~ critics_score, data = .)
Residuals:
Min 1Q Median 3Q Max
-240.48 -101.16 -27.18 108.53 410.57
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -242.513 215.335 -1.126 0.26964
critics_score 8.472 2.585 3.278 0.00279 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 156.8 on 28 degrees of freedom
Multiple R-squared: 0.2773, Adjusted R-squared: 0.2515
F-statistic: 10.74 on 1 and 28 DF, p-value: 0.002795
##### ##### ##### ###
##### Model 2 #######
##### ##### ##### ###
names(table(data$category)) [1] "Ant-Man" "Avengers" "Black Panther" "Captain America"
[5] "Dr Strange" "Guardians" "Iron Man" "Spider-Man"
[9] "Thor" "Unique"
mod <- data %>%
lm(budget_recovered ~ category, data = .)
mod |> summary()
Call:
lm(formula = budget_recovered ~ category, data = .)
Residuals:
Min 1Q Median 3Q Max
-227.25 -95.92 -11.50 71.25 341.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 438.500 109.360 4.010 0.000688 ***
categoryAvengers 170.750 133.938 1.275 0.216977
categoryBlack Panther 66.500 154.658 0.430 0.671808
categoryCaptain America -57.167 141.183 -0.405 0.689841
categoryDr Strange 4.500 154.658 0.029 0.977076
categoryGuardians 5.500 154.658 0.036 0.971984
categoryIron Man -9.167 141.183 -0.065 0.948876
categorySpider-Man 283.500 141.183 2.008 0.058339 .
categoryThor -64.000 133.938 -0.478 0.637950
categoryUnique -135.100 129.397 -1.044 0.308906
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 154.7 on 20 degrees of freedom
Multiple R-squared: 0.4977, Adjusted R-squared: 0.2717
F-statistic: 2.202 on 9 and 20 DF, p-value: 0.06791
##### ##### ##### ###
##### Model 3 #######
##### ##### ##### ###
mod <- data %>%
lm(budget_recovered ~ critics_score + category, data = .)
mod |> summary()
Call:
lm(formula = budget_recovered ~ critics_score + category, data = .)
Residuals:
Min 1Q Median 3Q Max
-221.617 -82.872 1.029 59.007 310.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.257 279.136 0.094 0.9260
critics_score 4.850 3.041 1.595 0.1272
categoryAvengers 163.475 129.131 1.266 0.2208
categoryBlack Panther 42.250 149.789 0.282 0.7809
categoryCaptain America -63.633 136.092 -0.468 0.6454
categoryDr Strange 21.475 149.395 0.144 0.8872
categoryGuardians -11.475 149.395 -0.077 0.9396
categoryIron Man 8.616 136.488 0.063 0.9503
categorySpider-Man 251.167 137.534 1.826 0.0836 .
categoryThor -15.501 132.585 -0.117 0.9082
categoryUnique -74.961 130.253 -0.576 0.5717
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 149 on 19 degrees of freedom
Multiple R-squared: 0.557, Adjusted R-squared: 0.3239
F-statistic: 2.389 on 10 and 19 DF, p-value: 0.04909
##### ##### ##### ###
##### Model 4 #######
##### ##### ##### ###
data <- data %>%
rename(., "worldwide" = `worldwide gross ($m)`, "domestic" = `domestic gross ($m)`,
"international" = `international gross ($m)`)
mod <- lm(worldwide ~ 0 + domestic + international, data = data)
mod |> summary()
Call:
lm(formula = worldwide ~ 0 + domestic + international, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.4167 -0.3363 0.4036 0.6414 0.8259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
domestic 1.0016977 0.0010233 978.9 <2e-16 ***
international 0.9995891 0.0006227 1605.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6566 on 28 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 4.147e+07 on 2 and 28 DF, p-value: < 2.2e-16
##### ##### ##### ###
##### Model 5 #######
##### ##### ##### ###
mod <- lm(worldwide ~ budget*critics_score, data = data)
mod |> summary()
Call:
lm(formula = worldwide ~ budget * critics_score, data = data)
Residuals:
Min 1Q Median 3Q Max
-491.98 -169.76 -36.49 153.46 792.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1936.38319 1685.07029 1.149 0.2610
budget -12.31026 8.23297 -1.495 0.1469
critics_score -26.74895 19.92735 -1.342 0.1911
budget:critics_score 0.22214 0.09679 2.295 0.0301 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 298.6 on 26 degrees of freedom
Multiple R-squared: 0.7479, Adjusted R-squared: 0.7189
F-statistic: 25.72 on 3 and 26 DF, p-value: 6.07e-08
Building on Section 3.3 here you’ll be carrying out some basic EDA and interpreting hypothesis tests.
In this task you’re going to be using these data collected as part of a (now retracted) study (Shu et al. (2012))2.
The data comes from a study that claims to show that people are less likely to act dishonestly when they sign an honesty pledge at the top of a form rather than at the bottom of a form. Participants received a worksheet with 20 math puzzles and were offered $1 for each puzzle they (reported to have) solved correctly within 5 minutes. After the 5 minutes passed, participants were asked to count how many puzzles they solved correctly and then throw away their worksheets. The goal was to mislead participants into thinking that the experimenter could not observe their true performance, when in fact they could because each worksheet had a unique identifier. Thus, participants could cheat (and earn more money) without fear of being caught, while the researchers could observe how much each participant had cheated. Participants then completed a “tax” form reporting how much money they had earned, and also how much time and money they spent coming to the lab. The experimenters partially compensated participants for those costs.
The paper reported very large effects. Signing at the top vs. the bottom lowered the share of people over-reporting their math puzzle performance from 79% to 37% (p = .0013), and lowered the average amount of over-reporting from 3.94 puzzles to 0.77 puzzles (p < .00001). Similarly, it nearly halved the average amount of claimed commuting expenses, from $9.62 to $5.27 (p = .0014).
Variables of interest
Cond: = 0, No signature; = 1, Signed at the top; and = 2, Signed at the bottom.CheatedOnMatrixTax: = 0, participants didn’t cheat/over-report their maths puzzle performance; and = 1, participants did cheat/over-report their maths puzzle performance.OverReport: the amount of puzzles solved participants over-reported by.SumDeduction: the amount of claimed commuting expenses $US (corrected for the true amount).As a group:
Discuss what to the presented results indicate/claim?
Use the variables listed above and reproduce the results quoted above using the appropriate statistical analysis/test. How easy was it to replicate these results given the way the results were presented? How might you improve the way the results were presented?
There is an additional column of interest flag; this is a binary variable that indicates if the observations were considered dodgy (i.e., fraudulent) or not. Create a visualization/carry out some analysis that could be used as evidence for/against this belief.
R code solutions
This section will be update with my suggested solutions to the class wide activities after their conclusion. I strongly recommend that you review these outside of class if only to familiarise yourself with a different approach.
require(tidyverse)
## read in data
data <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/BMI.csv")
data
## Mean BMI
data %>%
summarise(mean(bmi)) %>%
round()
## Max BMI males
data %>%
filter(gender == "male") %>%
filter(bmi == max(bmi))
## t-test BMI males vs females
t.test(bmi ~ gender, data = data)
lm(bmi ~ gender, data = data) |> summary()
## t-test steps males vs females
t.test(steps ~ gender, data = data)
lm(steps ~ gender, data = data) |> summary()
## plotting. ALWAYS PLOT YOUR DATA!
ggplot(data, aes(x = steps, y = bmi, col = gender)) +
geom_point()## data source: https://docs.google.com/spreadsheets/d/1YSJ4ypkYLq6j1mIBJCgUHhHjJZQ0Rkfe1qW2WC5HLiw/edit#gid=748627588
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/marvel_movies.csv") %>%
mutate(critics_score = as.numeric(str_replace_all(`critics % score`, "%", ""))) %>%
mutate(budget_recovered = as.numeric(str_replace_all(`% budget recovered`, "%", "")))
##### ##### ##### ###
##### Model 1 #######
##### ##### ##### ###
data %>%
lm(budget_recovered ~ critics_score, data = .) |>
summary()
## Ans
data %>%
ggplot(., aes(y = budget_recovered, x = critics_score)) +
geom_smooth(method = "lm", se = FALSE) + ylab(" Estimated budget_recovered")
## w. data
data %>%
ggplot(., aes(y = budget_recovered, x = critics_score)) +
geom_smooth(method = "lm", se = FALSE) + geom_point()
##### ##### ##### ###
##### Model 2 #######
##### ##### ##### ###
names(table(data$category))
mod <- data %>%
lm(budget_recovered ~ category, data = .)
mod |> summary()
## Ans
data$pred_vals <- predict(mod)
## plot
data %>%
ggplot(aes(y = pred_vals, x = category)) +
geom_point(size = 15, pch = "-") + ylab(" Estimated budget_recovered")
##### ##### ##### ###
##### Model 3 #######
##### ##### ##### ###
mod <- data %>%
lm(budget_recovered ~ critics_score + category, data = .)
mod |> summary()
## Ans
data$pred_vals <- predict(mod)
## plot
data %>%
ggplot(aes(y = pred_vals, x = critics_score, color = category)) +
geom_line() + ylab(" Estimated budget_recovered")
##### ##### ##### ###
##### Model 4 #######
##### ##### ##### ###
data <- data %>%
rename(., "worldwide" = `worldwide gross ($m)`, "domestic" = `domestic gross ($m)`,
"international" = `international gross ($m)`)
mod <- lm(worldwide ~ 0 + domestic + international, data = data)
mod |> summary()
## Ans
## plot
require(rsm) ## install if not available
persp(mod, form = ~ 0 + domestic + international)
image(mod, form = ~ 0 + domestic + international)
##### ##### ##### ###
##### Model 5 #######
##### ##### ##### ###
mod <- lm(worldwide ~ budget*critics_score, data = data)
mod |> summary()
## Ans
## plot
persp(mod, form = ~ budget*critics_score)
image(mod, form = ~ budget*critics_score)The former example is a solution posed on this StackOverflow post asking about email address validation. The latter snippet is my attempt at a long winded (and far less stable) approach.↩︎
See this Guardian article about the debacle↩︎