Some descriptive statistics

Heads up: a very different sort of content. Quite technical. Anyway, now that I’ve got a reasonable set of reviews to work with I thought it might be interesting to do a bit of statistical analysis. I’ve been meaning to try out the new pipe and lambda syntax in R 4.1 too.

Data import

I’ve been hosting this site on GitHub Pages, so no need to scrape using httr or anything.

git clone spiritrecord.github.io

Let’s load in these files, ignore everything that’s not whisky, and parse the HTML. Expand for the gory details…

library(tidyverse)
library(rvest)

whisky_review_files <- list.files("spiritrecord.github.io/", ".html", full.names = TRUE) |>
  str_subset("404|index|legrand|oloroso-north-star", negate = TRUE)

parse_review <- function(review_html) {
  html <- xml2::read_html(review_html, encoding = "utf8")
  list(
    whisky = html |>
      html_element("h2") |>
      html_text(),
    review = html |>
      html_elements(xpath = "//strong[contains(text(), 'Nose') or contains(text(), 'Palate') or contains(text(), 'Finish') or contains(text(), 'Comments')]/parent::p/text()") |>
      html_text() |>
      setNames(c("nose", "palate", "finish", "comments"))
  )
}

parse_whisky_name <- function(whisky_name) 
  str_match(whisky_name, "([A-Za-z &\\(\\)'-\\.]+)( \".+\")?( [0-9]{4})?( [^(0-9)]+)?( [0-9]+\\.[0-9]+)?( [0-9]+ years)?( [0-9\\.]+%)( \\(.*\\))?")

whisky_reviews <- whisky_review_files |>
  map(parse_review)

review_text <- whisky_reviews |>
    map(\(x) x$review) |>
    bind_rows() |>
    mutate_all(trimws)

whiskies <- whisky_reviews |>
  map(\(x) x$whisky) |>
  parse_whisky_name() |>
  as_tibble() |>
  rename(string = 1,
         distillery = 2,
         title = 3,
         vintage = 4,
         bottler = 5,
         smws_code = 6,
         age = 7,
         abv = 8,
         cask_misc = 9) |>
  mutate_all(trimws) |>
  mutate(distillery_corrected = case_when(
    str_detect(distillery, "Loch Lomond") ~ "Loch Lomond",
    str_detect(distillery, "Bruichladdich") ~ "Bruichladdich",
    str_detect(distillery, "Springbank") ~ "Springbank",
    T ~ distillery),
    age = as.numeric(str_extract(age, "\\d+")),
    vintage = as.numeric(vintage),
    abv = as.numeric(str_extract(abv, "[0-9\\.]+"))
    ) |>
  bind_cols(review_text) |>
  mutate(score = str_extract(comments, "([0-9]+)/100."),
         score = as.numeric(str_extract(score, "[0-9]+")))

Distilleries and bottlers

The first thing that I evaluate when I’m looking at a new bottle of whisky is the distillery and the bottler. Useful to start there, then.

whiskies |>
  group_by(distillery_corrected) |>
  count(sort = TRUE)

Ignore the issue with Tormore… Looks like I’m a big fan of Bruichladdich. After that, Linkwood. Seems about right. What this table is really telling us is that the reviews in the dataset right now are sparse across distilleries. In other words, it wouldn’t be advisable to use the distillery as a feature in any model or statistic that we run down the line. We’d be attributing what is mostly bottle character to a distillery.

What about bottler?

whiskies |>
  group_by(bottler) |>
  count(sort = TRUE)

A labelling error in here or two… nothing crazy. We’ve got a lot of NAs, corresponding to OBs in the dataset. After that, it looks like Chorlton, Hidden Spirits, and Rest and Be Thankful are my next most reviewed bottlers. I’ve also got a lot of SMWS in here, but that’s mostly because it’s quite available at the bars I frequent. I’m not actually a member.

What about vintage and age?

whiskies |>
    select(string, Vintage = vintage, Age = age) |>
    pivot_longer(-string) |>
    ggplot(aes(value)) + facet_grid(.~name, scales = "free_x") +
    geom_histogram() + labs(title = "Whisky reviews, age and vintage", y = "Count", x = "") +
    theme_minimal()

Unsurprisingly, we’re clustered mainly around the 2000s and early 2010s. Age tells a similar story. I’m not drinking old dusties.

What about cask? This one will require a bit of work.

whiskies <- whiskies |>
  mutate(cask = case_when(
    str_detect(cask_misc, regex("wine|marsala|madeira|port|muscat|moscatel|rivesaltes|viognier|barossa", ignore_case = T)) ~ "wine",
    str_detect(cask_misc, regex("sherry|px|oloroso|manzanilla", ignore_case = T)) ~ "sherry",
    str_detect(cask_misc, regex("bourbon|hogshead|refill|barrel", ignore_case = T)) ~ "bourbon",
    T ~ "other"
  ))

whiskies |>
  group_by(cask) |>
  count(sort = TRUE)

Mostly ex-bourbon, pleasingly, but plenty of sherry and wine. Some undisclosed or other random casks (milk stout…) I’ve grouped together as “other”. Best to ignore that category. It looks like we’ll be able to use this for any model we might want to run – quite happy here.

I wonder if there’s a relationship between cask and age or vintage?

whiskies |>
  group_by(cask) |>
  summarise(mean_age = mean(age, na.rm = T),
            mean_vintage = mean(vintage, na.rm = T))

Well, perhaps there’s a few more years on my average bourbon over sherry and wine, but I wouldn’t draw too many conclusions here.

How about some final descriptive statistics in the form of word clouds?

library(tidytext)
library(SnowballC)

top_words <- function(whiskies, column)
  whiskies |>
  unnest_tokens(word, column) |>
  anti_join(tidytext::stop_words, by = "word") |>
  count(word, sort = TRUE) |>
  head(10)

nose <- top_words(whiskies, "nose")
palate <- top_words(whiskies, "palate")
finish <- top_words(whiskies, "finish")

nose
palate
finish

Water is big, because I like to add it. Other common words are fruit, oak, malt, textural descriptions, and heat on the finish. Let’s refine this a bit by stemming words (so malt and malty map down into one) and additionally having a look at casks.

Nose…

top_words_by_cask <- function(whiskies, column)
  whiskies |>
    unnest_tokens(word, column) |>
    filter(!word %in% c("water", "little", "bit", "quite", "very")) |>
    mutate(word = wordStem(word)) |>
    anti_join(tidytext::stop_words, by = "word") |>
    group_by(cask) |>
    count(word, sort = TRUE) |>
    slice_head(n = 10) |>
    select(-n) |>
    pivot_wider(names_from = cask, values_from = word, values_fn = list) |>
    unnest(cols = everything()) |>
    select(-other) # Not interested right now
  
top_words_by_cask(whiskies, "nose")

… palate…

top_words_by_cask(whiskies, "palate")

… and finish.

top_words_by_cask(whiskies, "finish")

Worth noting that “dri” is dried (as in “dried fruits”). Unsurprisingly that note appears quite a lot in sherried whiskies. We’ve another issue here: we haven’t tracked whether something is peated or not! Clearly a topic we should add to any future model. Looks like the stemmer is missing malti(ness) and plain old malt. Oh well, reasonable enough picture…

Scores

Always good to start with a plot.

whiskies |>
  ggplot(aes(score)) + geom_histogram() +
  labs(title = "Whisky score distribution", y = "Count", x = "Score") +
  theme_minimal()

Not really normal, but normalish in the 80s-90s if you squint.

Let’s have a look at the mean and standard deviation (let’s assume that the second moment is informative, and that seems reasonable):

whiskies |>
  summarise(mean_score = mean(score), sd_score = sd(score))

Now, I’d personally characterise anything above 80 as quite nice, very drinkable. So either my threshold is below whatever average is, or there’s quite a lot of selection bias operating here. I think both of those explanations seem reasonable and both apply from bottle to bottle. Perhaps we should look at percentiles?

quantile(whiskies$score)
##   0%  25%  50%  75% 100% 
## 64.0 84.0 86.5 88.0 94.0

Median is close to the mean but above. Half of my reviews lie between 84-88. Finally, let’s have a quick look at scores by cask:

whiskies |>
  group_by(cask) |>
  summarise(mean_score = mean(score), score_sd = sd(score)) |>
  arrange(-mean_score)

I’m going to avoid doing any hypothesis testing here. But it doesn’t seem like there’s a meaningful difference between sherry and wine. But it does seem like bourbon takes the cake! That certainly lines up with my priors.

Posted by Dominic on 22 Jun 2021