The genius of Steven Spilberg

This project was realized in collaboration with my study group at London Business School. We exercised our R skills by looking at the relative ratings of the movies from two famous directors.

A couple weeks ago, we analyzed some data on IMBD ratings. Today, we will explore if there is a clear difference in the ratings received between Tim Burton and Steven Spielberg.

Our Null Hypothesis: The true difference in means is equal to 0.

Alternative hypothesis: The true difference in means is not equal to 0.

To test this hypothesis, we will use both the t.test command and the infer package to simulate from a null distribution, where you assume zero difference between the two.

# Now, we want to select the date that we will use which is the rating and the directors
IMBD <- movies %>% 
  select(rating, director) %>% 
  
  # We also filtered by the two directors in question
  filter(director %in% c("Tim Burton", "Steven Spielberg")) %>% 

  # In order to compare both directors, we group by this variable
  group_by(director) %>% 
  
  # To calculate the necessary information to run a hypothesis test we use summarise
  summarise (
         
          # General Statistics
          mean_rating = 
            mean(rating, na.rm = TRUE),
          sd_rating = 
            sd(rating, na.rm=TRUE),
          count = n(),
         
          # CI Statistics
           se_rating = 
            sd_rating/sqrt(count),
          t_critical = 
            qt(0.975, count-1), 
          margin_of_error = 
            t_critical * se_rating,
          lower = 
            mean_rating - t_critical * se_rating,
          upper = 
            mean_rating + t_critical * se_rating
            )


# This line exhibits our calculations
IMBD %>% tbl_df %>% rmarkdown::paged_table()
# Now we are going to replicate the graph

graph_directors <- 
  
  ggplot(IMBD, 
  aes(
    x= mean_rating,
    y = reorder(director,mean_rating)))+
  geom_point()+
  theme_bw() +
  theme(legend.position = "none")+
  
  # The next tree lines include the mean ratings and the upper and lower boundaries of the CI
  geom_text(aes(label = round(mean_rating, 2)), size = 6, hjust = 0.5,vjust = -1)+
  
  geom_text(aes(label = round(lower, 2)), hjust = 6, vjust = -2) +
  geom_text(aes(label = round(upper, 2)), hjust = -5.5, vjust = -2) +
  
  # This line includes the error bar 
  geom_errorbar(aes(xmin = lower, 
                    xmax = upper, 
                    colour = director, 
                    fill = director), 
                width = 0.1, 
                size = 2) +
  
  # This line includes the shade to the graph in the overlap of the confidence intervals
  geom_rect(xmin = 7.27, xmax = 7.33, ymin = 0, ymax = 10, fill = "grey", aes(alpha = 0.1))+

  labs(
    title = "Do Spielberg and Burton have the same IMDB ratings?",
    subtitle = "95% confidence intervals overlap", 
    x = "Mean IMDB ratings", 
    y = "") +
  theme(plot.title = element_text(face = "bold"))
  
# Saving our graph
ggsave("graph_directors.jpg",
       plot=graph_directors,
       width = 10,height = 6)

# Displaying the graph
knitr::include_graphics("graph_directors.jpg")

#Now, we want to select the date that we will use which is the rating and the directors
steve_tim <- movies %>% 
  select(rating, director) %>% 
  filter(director %in% c("Tim Burton", "Steven Spielberg")) 

# In this line we do the ttest with the previously selected data
t.test(rating ~ director, data = steve_tim)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
# In order to compare the two methods, we will run the test again using the "infer" package
obs_diff_IMBD <- steve_tim %>%
  specify(rating ~ director) %>%
  calculate(
    stat = "diff in means", 
    order = c("Tim Burton", "Steven Spielberg"))


null_dist_IMBD <- steve_tim %>%
  specify(rating ~ director) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(
    stat = "diff in means", 
    order = c("Tim Burton", "Steven Spielberg"))


ggplot(data = null_dist_IMBD, 
       aes(
         x = stat)) +
  geom_histogram()

null_dist_IMBD %>% visualize() +
  shade_p_value(obs_stat = obs_diff_IMBD, direction = "two-sided")

null_dist_IMBD %>%
  get_p_value(obs_stat = obs_diff_IMBD, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.012

Conclusion Following our tests, we reject the null hypothesis as the p-value is under our treshold at 95% confidence interval. We can therefore conclude that there is a clear difference between the average rating between Steven Spielberg and Tim Burton. So clearly, Steven Spielberg’ movies have a higher rating than Tim Burton’s ones.