Deepdiving into Survey Methodology and that new ToM Poll

New Esprimi Poll

Times of Malta/Esprimi came out with a new poll last weekend with some newer methodology than we’re used to. Though I have a feeling it largely made the rounds and captured the imagination of many because it bumped up the “40,000 lead” net figure to “50,000”. Percentage wise however, we’re bang on to other recent polls, in the 41% range for PN and 58% range for PL.

If you’re really pedantic, it’s actually a 0.5% improvement for PN from this MT poll around 3 weeks ago, and the emphasis on “number of votes” as opposed to % margin makes zero sense, but for many people it’s a bit like money at this stage: the more trailing zeros, the more excited they get.

Non Replies Issue

What it should really have of made the news about was on how differently it handled “no answer” responses. In European style polling, we usually readjust the denominator to the vote intention we have responses to, and rework that as a percentage out of 100 - implicitly assuming that the “no answers” are split along the same lines as the people that were vocal about it.

This is a fairly reasonable assumption unless you have reason to believe that the missingness is not at random but perhaps due to some sort of social desirability bias where supporters of an unorthodox party assess that their support is socially undesirable in the broader population.

(As a slight tangent, I said “European” style because in the US it’s perfectly normal to report a 40%-50% split as a ‘10 point lead’. In many ways this is actually a much more mature way to think about it.)

What Esprimi and Lobeslab have done is a step further: running a classification model on their responses and predicting the “no answer” category’s likely support.

In theory this approach should give similar results to weighing if the survey is properly designed, but I can see it possibly being a safeguard against some sort of survey wide bias. The catch here is however that this “bias” needs to be accounted for by some signal in the features (socially desirable non-responses are explained by for example location, to keep with the example) for machine learning techniques to pick it up.

The proof however, is always in practice.

Setting up our Experiment

If we synthetically create a population, and simulate sampling it just like a survey might, we can analyze the responses using traditional weighting vs. the machine learning approach. This would let us see if there is a meaningful difference of the estimated population values from the sample anf the true population value from our generated population.

Firstly, let’s write a function to generate our synthetic population:

create_population <- function(N = 500000,
                              blueberry_bias_age = 0,
                              blueberry_bias_location = 0){
  # Creates a sample population of blueberry & watermelon support
  # 
  #       :param N(int): population size
  #       :param blueberry_bias_age(float): the bias in blueberry support according to age.
  #       :param blueberry_bias_location(float): bias in blueberry support according to location.

  #       :return (dataframe): composed of the synthetic population

set.seed(123)

  age_values <- c('18-30', '31-45', 
                  '46-55', '56-65', '65+')
  age_probabilities <- c(0.13, 0.25, 0.15, 0.3, 0.17)
  
  region_value <- c('Gozitan Republic', 'Malta North', 
                    'Malta West', 'Malta East', 'Malta South')
  region_probabilities <- c(0.1, 0.32, 0.1, 0.15, 0.33)
  
  population <- tibble(age = sample(age_values,
                                    N,
                                    prob = age_probabilities,
                                    replace = TRUE),
                       region = sample(region_value, 
                                       N,
                                       prob = region_probabilities,
                                       replace = TRUE)) %>% 
    mutate(preference = runif(nrow(.)),
           vote = case_when(age == '18-30' ~ preference + (blueberry_bias_age * 1.2),
                            age == '31-45' ~ preference + (blueberry_bias_age * 1),
                            age == '46-55' ~ preference,
                            age == '56-65' ~ preference - (blueberry_bias_age * 1),
                            age == '65+' ~ preference - (blueberry_bias_age * 1.5)),
           vote = case_when(region == "Malta North" ~ vote - blueberry_bias_location,
                            region == "Malta South" ~ vote + blueberry_bias_location,
                            TRUE ~ vote),
           vote = if_else(vote < 0.5, "Blueberry", "Watermelon")) %>%
    rowwise() %>% 
    mutate(occupation = if_else(vote == "Blueberry", 
                                sample(c("Farmer", 
                                         "Office Work", 
                                         "Healthcare", 
                                         "Law Enforcment", 
                                         "Bitcoin Trader",
                                         "Other"),
                                       1,
                                       replace = T,
                                       c(0.19, 0.16, 0.16, 0.16, 0.16, 0.17)),
                                sample(c("Farmer", 
                                         "Office Work", 
                                         "Healthcare", 
                                         "Law Enforcment", 
                                         "Bitcoin Trader",
                                         "Other"),
                                       1,
                                       replace = T,
                                       c(0.16, 0.17, 0.16, 0.16, 0.19, 0.16))),
           has_car = sample(c("Yes", "No"), 1, replace = T, c(0.5, 0.5))) %>%
    ungroup() %>% 
    select(c("age", "region", "occupation", "has_car", "vote"))
}

create_population creates a Malta in an alternate reality, that’s exactly 500,000 people strong. Because this is in an alternate timeline, Gozo has achieved Republic status, but maintains strong social and economic ties with Malta. This is split into 4 regions, using the actual compass this time: North, West, East and South.

Rather than support for political parties, we’ll be measuring the nation’s favorite fruit: blueberry or watermelon. The blueberry_bias_age variable is important. If left to 0, the function will generate a random amount of support for both fruits. Over half a million rows, this will work out to nearly 50-50.

If tweaked with a positive value however, it will apply a bias in support, making blueberry less popular with younger groups, and more popular with older groups. If we want it to go another way, we can just apply a negative value.

Similarly, blueberry_bias_location works in much the same way, however in this case the bias is only applied to the Northern and Southern Malta regions.

To add some more predicting power, I created an occupation variable that’s slightly more farmer than chance if you like blueberries, and slightly more bitcoin trader if you like watermelons. The other has_car variable is just some random noise.

Creating Alternate Malta

All that’s left for us to create this alternate reality is to run the function!

set.seed(123)

malta_v2 <- create_population(blueberry_bias_age = 0.1, 
                              blueberry_bias_location = 0.05)

Here, I’ll be biasing blueberry just a nudge with age (blueberry increases in popularity with age) and location (people in the north like to sprinkle them in yogurt for breakfast).

Anyway, the alternate Malta NSO’s latest census shows the population is comprised of:

census <- malta_v2 %>% 
  group_by(age, region, occupation, has_car) %>% 
  tally()

census
## # A tibble: 300 x 5
## # Groups:   age, region, occupation [150]
##    age   region           occupation     has_car     n
##    <chr> <chr>            <chr>          <chr>   <int>
##  1 18-30 Gozitan Republic Bitcoin Trader No        612
##  2 18-30 Gozitan Republic Bitcoin Trader Yes       582
##  3 18-30 Gozitan Republic Farmer         No        539
##  4 18-30 Gozitan Republic Farmer         Yes       586
##  5 18-30 Gozitan Republic Healthcare     No        516
##  6 18-30 Gozitan Republic Healthcare     Yes       520
##  7 18-30 Gozitan Republic Law Enforcment No        518
##  8 18-30 Gozitan Republic Law Enforcment Yes       540
##  9 18-30 Gozitan Republic Office Work    No        527
## 10 18-30 Gozitan Republic Office Work    Yes       541
## # ... with 290 more rows

This is important because we’ll use this census to weigh our survey. Since we’re on the subject, here’s what the true support of blueberry vs. watermelon is in the population:

malta_v2 %>% 
  group_by(vote) %>% 
  summarise(n(), percent = n()/500000)
## # A tibble: 2 x 3
##   vote        `n()` percent
##   <chr>       <int>   <dbl>
## 1 Blueberry  257031   0.514
## 2 Watermelon 242969   0.486

And here’s the interaction by age and region:

malta_v2 %>% 
  mutate(likes_blueberry = if_else(vote == "Blueberry", 1, 0)) %>% 
  group_by(age, region) %>% 
  summarise(blueberry_support = mean(likes_blueberry)) %>% 
  ggplot(aes(x = region, y = blueberry_support, fill = age))+
  geom_col(position = "dodge")+
  coord_flip()+ 
  scale_y_continuous(labels = scales::percent)+
  theme_bw()+
  ylab("Blueberry Support")+
  xlab("")

Our Survey

Similarly, we’ll wrap our survey in a function named survey_population.

survey_population <- function(population_df,
                              n = 600,
                              perc_shy = 0.1) {
  
  # Surveys the population randomly just like a phone survey might
  # 
  #       :param population_df(dataframe): a population dataframe created using the create_population function.
  #       :param n(int): sample size
  #       :param perc_shy(float): the percentage of survey responses that will have 'no answer'

  #       :return (dataframe): the survey results
set.seed(234)
sample_n(population_df, replace = F, size = n) %>%
mutate(vote =  replace(vote, 
                       sample(row_number(),
                              size = ceiling(perc_shy * n()),
                              replace = FALSE),
                       NA))
  }

The sample package’s sample_n actually does most of the heavy lifting here, but I added a line in a mutate call that randomly replaces the watermelon/blueberry value with an NA to simulate a stipulated percent of missing values. MaltaToday reports around 12-15% of “Don’t Knows”, so we’ll go with 15%, and generate a simulated survey of 600 people.

survey <- survey_population(malta_v2, n = 600, perc_shy = 0.15)

This is what we get:

survey %>% head()
##     age      region     occupation has_car       vote
## 1 31-45 Malta North         Farmer      No Watermelon
## 2 18-30  Malta East Bitcoin Trader     Yes  Blueberry
## 3 31-45 Malta South     Healthcare     Yes Watermelon
## 4   65+ Malta North Law Enforcment     Yes       <NA>
## 5 56-65 Malta South          Other      No  Blueberry
## 6 31-45 Malta North Bitcoin Trader      No       <NA>

And this is the picture our raw crosstabs paint:

survey %>% 
  group_by(vote) %>% 
  summarise(n(), n()/600)
## # A tibble: 3 x 3
##   vote       `n()` `n()/600`
##   <chr>      <int>     <dbl>
## 1 Blueberry    254     0.423
## 2 Watermelon   256     0.427
## 3 <NA>          90     0.15

Excitingly, due to the magic of randomness, watermelon is leading!

Traditional Surveys

The survey package is a tremendous resource in R for analysis. To make our lives easier, let’s join the census data to the survey. The pw column simply contains the number of that demographic group present in the population, while the fpc is the finite population correction, which we set to our 500,000 inhabitants.

library(survey)

survey_w_weights <- survey %>% 
  inner_join(census) %>% 
  rename(pw = n) %>% 
  mutate(fpc = 500000)

survey_w_weights %>% head()
##     age      region     occupation has_car       vote   pw   fpc
## 1 31-45 Malta North         Farmer      No Watermelon 3422 5e+05
## 2 18-30  Malta East Bitcoin Trader     Yes  Blueberry  852 5e+05
## 3 31-45 Malta South     Healthcare     Yes Watermelon 3327 5e+05
## 4   65+ Malta North Law Enforcment     Yes       <NA> 2097 5e+05
## 5 56-65 Malta South          Other      No  Blueberry 4013 5e+05
## 6 31-45 Malta North Bitcoin Trader      No       <NA> 3553 5e+05

Then we can create our survey design object, and pass it through the svymean funtion.

design <- svydesign(id = ~1, 
                    weights = ~pw, 
                    data = survey_w_weights, 
                    fpc = ~fpc)

svymean(~vote, design, na.rm = T)
##                   mean     SE
## voteBlueberry  0.50918 0.0246
## voteWatermelon 0.49082 0.0246

And what do you know, already a terrific improvement!

We’ve corrected the swing, and the true population value is within the standard error.

Classifier Aided

We can train a classification model on our provided survey responses, and use this to predict on the unknown responses. We’ll try two types of model: XGBoost and logistic regression from the glmnet package.

Julia Silge also introduced the finetune package in this video, so I also took this opportunity to give it a try. The pre-processing steps are pretty straight forward (some of it is actually copy pasted from the above video), but we’ll drop the NAs, create our folds, onehot encode all the predictors and give each of the models a 50 parameter grid, before analyzing performance on the unseen test set using the last_fit function.

XGBoost

library(tidymodels)
library(finetune)

doParallel::registerDoParallel()

set.seed(123)

raw <- survey %>% 
  drop_na() %>% 
  initial_split(0.8)

training <- training(raw)
testing <- testing(raw)

cv_folds <- vfold_cv(training)

classifying_rec <- recipe(vote ~ .,
                          data = training) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

xgb_spec <-
  boost_tree(
    trees = 500,
    min_n = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow(classifying_rec, xgb_spec)

xgb_rs <- tune_race_anova(
  xgb_wf,
  resamples = cv_folds,
  grid = 50)
xgb_last <- xgb_wf %>%
  finalize_workflow(select_best(xgb_rs)) %>% 
  last_fit(raw)

xgb_last$.metrics
## [[1]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.480 Preprocessor1_Model1
## 2 roc_auc  binary         0.530 Preprocessor1_Model1

GLMNet

set.seed(234)

log_reg_spec <- logistic_reg(
  penalty = tune(),
  mixture = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")

lr_wf <- workflow(classifying_rec, log_reg_spec)

lr_rs <- tune_race_anova(
  lr_wf,
  resamples = cv_folds,
  grid = 50)

lr_last <- lr_wf %>%
  finalize_workflow(select_best(lr_rs)) %>% 
  last_fit(raw)

lr_last$.metrics
## [[1]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.520 Preprocessor1_Model1
## 2 roc_auc  binary         0.543 Preprocessor1_Model1

Now to be honest, both these models have dreadful performance. It may be entirely due to my synthetic data or something obvious I missed. But since the logistic classifier is a smudge better, let’s use that to predict the “unknowns” present in the survey.

lr_preds <- lr_wf %>%
  finalize_workflow(select_best(lr_rs)) %>% 
  fit(training) %>% 
  predict(survey)

survey_model_assist <- survey %>% 
  bind_cols(lr_preds) %>% 
  mutate(vote = case_when(is.na(vote)~ as.character(.pred_class),
                          TRUE ~ vote))

Raw

How well did it improve things from the raw results?

survey_model_assist %>% 
  group_by(vote) %>% 
  summarise(n(), n()/600)
## # A tibble: 2 x 3
##   vote       `n()` `n()/600`
##   <chr>      <int>     <dbl>
## 1 Blueberry    298     0.497
## 2 Watermelon   302     0.503

It swung 0.1% the wrong way.

As an Input to a Weighted Survey

But if we use that raw survey as the input to the usual weighing process…

survey_model_assist <- survey_model_assist %>% 
  inner_join(census) %>% 
  rename(pw = n) %>% 
  mutate(fpc = 500000) 
## Joining, by = c("age", "region", "occupation", "has_car")
design_2 <- svydesign(id = ~1, 
                    weights = ~pw, 
                    data = survey_model_assist, 
                    fpc = ~fpc)

svymean(~vote, design_2, na.rm = T)
##                  mean     SE
## voteBlueberry  0.5176 0.0227
## voteWatermelon 0.4824 0.0227

That’s actually the closest result of the lot. Obviously, the better work you do on the classification model, the better your result will look.

Summarizing It

What if the Effect was systemic?

There’s one case where the ML approach might actually make a huge difference: if the missingness was not at random. To simulate this, I set the shyness to 0 (we get no NAs), order our table so the same categories are clustered together, and just chop off a bunch of let’s say, shy watermelon fans.

survey_biased <- survey_population(malta_v2, n = 600, perc_shy = 0) %>% 
  arrange(vote, region, age) %>% 
  head(510)
survey_biased %>% 
  group_by(vote) %>% 
  summarise(n()/510)
## # A tibble: 2 x 2
##   vote       `n()/510`
##   <chr>          <dbl>
## 1 Blueberry      0.573
## 2 Watermelon     0.427

Yikers. Now let’s do the same things.

survey_w_weights_2 <- survey_biased %>% 
  inner_join(census) %>% 
  rename(pw = n) %>% 
  mutate(fpc = 500000)
## Joining, by = c("age", "region", "occupation", "has_car")
design <- svydesign(id = ~1, 
                    weights = ~pw, 
                    data = survey_w_weights_2, 
                    fpc = ~fpc)

svymean(~vote, design, na.rm = T)
##                   mean     SE
## voteBlueberry  0.60069 0.0239
## voteWatermelon 0.39931 0.0239

In this case we can see that the weighing is way off. Pollsters will actually determine minimum respondents for each cell, but if it’s a systemic issue, some of those cells will remain empty.

set.seed(123)

raw <- survey_biased %>% 
  drop_na() %>% 
  initial_split(0.8)

training <- training(raw)
testing <- testing(raw)

cv_folds <- vfold_cv(training)

log_reg_spec <- logistic_reg(
  penalty = tune(),
  mixture = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")

lr_wf <- workflow(classifying_rec, log_reg_spec)

lr_rs <- tune_race_anova(
  lr_wf,
  resamples = cv_folds,
  grid = 50)

lr_last <- lr_wf %>%
  finalize_workflow(select_best(lr_rs)) %>% 
  last_fit(raw)
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
lr_last$.metrics
## [[1]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.618 Preprocessor1_Model1
## 2 roc_auc  binary         0.674 Preprocessor1_Model1

Interestingly, the logistic regression model’s performance jumps up: the missingness fit a pattern it’s learnt well.

lr_preds <- lr_wf %>%
  finalize_workflow(select_best(lr_rs)) %>% 
  fit(training) %>% 
  predict(survey)
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
survey_model_assist <- survey %>% 
  bind_cols(lr_preds) %>% 
  mutate(vote = case_when(is.na(vote)~ as.character(.pred_class),
                          TRUE ~ vote))

And in terms of raw and weighted numbers:

survey_model_assist %>% 
  group_by(vote) %>% 
  summarise(n(), n()/600)
## # A tibble: 2 x 3
##   vote       `n()` `n()/600`
##   <chr>      <int>     <dbl>
## 1 Blueberry    313     0.522
## 2 Watermelon   287     0.478
survey_model_assist <- survey_model_assist %>% 
  inner_join(census) %>% 
  rename(pw = n) %>% 
  mutate(fpc = 500000) 
## Joining, by = c("age", "region", "occupation", "has_car")
design_2 <- svydesign(id = ~1, 
                    weights = ~pw, 
                    data = survey_model_assist, 
                    fpc = ~fpc)

svymean(~vote, design_2, na.rm = T)
##                   mean     SE
## voteBlueberry  0.53865 0.0227
## voteWatermelon 0.46135 0.0227

Pretty cool!

Obligatory side by side tie-fighter plot for comparison:

library(patchwork)

plot_1 / plot_2