crowdsource-with-difficulty.qmd

---
title: "Crowdsourcing with Difficulty"
subtitle: "A Bayesian Rating Model for Heterogeneous Items"
author:
  - name: Bob Carpenter
    corresponding: true
    email: bcarpenter@flatironinstitute.org
    url: https://bob-carpenter.github.io/
    orcid: 0000-0002-2433-9688
    affiliations:
      - name: Flatiron Institute
        department: Center for Computational Mathematics
        url: https://www.simonsfoundation.org/flatiron/center-for-computational-mathematics/
date: last-modified
date-modified: last-modified
description: |
abstract: >+
  Dawid and Skene's crowdsourcing model, which adjusts raters for sensitivity and specificity, fails to capture important distributional properties of real rating data.  By adding item-level effects for difficulty, discriminativeness, and guessability, the annotation model is able to more closely model the true data generating process.  This is illustrated with binary data from dental image classification and natural language inference.
keywords: [rating model, Dawid-Skene, item-response theory, Bayesian, Stan, Python]
citation:
  type: article-journal
  container-title: "Computo"
  doi: "xxxx"
  url: https://computo.sfds.asso.fr/template-computo-quarto
  publisher: "Société Française de Statistique"
  issn: "2824-7795"
bibliography: references.bib
github-user: bob-carpenter
repo: "crowdsource-computo"
draft: true # set to false once the build is running
published: false # will be set to true once accepted
format:
  computo-html: default
  computo-pdf: default
jupyter: python3
---

# Introduction

Crowdsourcing is the process of soliciting ratings from a group of repondents for a set of items and combining them in a useful way for some task, such as training a neural network classifier or a large language model.  The use of the term "crowdsourcing" is not meant to imply that the raters are untrained---the crowd may be one of oncologists, lawyers, or professors.  This note focuses on binary ratings, which arise in two-way classification problems, such as whether a patient has a specific condition, whether an image contains a traffic light, whether one consumer product is better than another, whether an action is ethical, or whether one output of a large language model is more truthful than another. 

This note considers two specific data sets.  The first data set involves dental X-rays being rated as positive or negative for caries, a kind of pre-cavity [@espeland1989].  There are thousands of X-rays, each of which is rated positive or negative for caries (a pre-cavity) by a handful of dentists.  The dentists showed surprisingly little agreeemnt and consensus, especially in cases where at least one dentist rated the X-ray positive for caries.  The dentists also varied dramatically in the number of cases each rated as having caries, indicating bias toward positive or negative ratings.  The second data set consists of pairs of sentences, which are rated positive if the first entails the second.  @snow2008 collected ratings on Mechanical Turk for thousands of sentence pairs, using dozens of raters, each of whom rated only a subset of the items. Natural language semantics tasks are notoriously difficult, and thus it is not surprising that the rater agreement level is low.


# The Five Tasks of Crowdsourcing

There are several tasks to which crowdsourcing is applied, and a rating model improves performance on all of them over heurstic baselines like majority voting for category and indirect measurements like inter-annotator agreement statistics to measure task difficulty.

1. *Inferring a gold-standard data set.* The first and foremost application of crowdsourcing is to generate a "gold standard" data set, where a single category (or label) is assigned to each item.  In terms of generating representative data, it is best to sample data according to its probability (i.e., follow the generative model) rather than to choose the "best" rating for each item according to some metric such as highest probability, which is similar to the commonly used heuristic of majority vote.  If the goal of creating a gold standard data set is to train a machine learning classifier, the second section of this paper shows why it is better for downstream accuracy to train with a probabilistic corpus that retains information about rating uncertainty.  Barring that, it is far better to sample labels according to their posterior probability in the rating model than to choose the "best" label.  In particular, majority voting schemes among raters will be later shown to e sub-optimal compared to sampling, which is in turn dominated by training with the probabilities (a kind of Rao-Blackwellization).  An even better approach is to jointly train a classifier and analyze rating data [@raykar2010].

2. *Inferring population prevalence.* The second most common application of crowdsourcing is to understand the probability of positivity among items in the population represented by the crowdsourcing data.  This is particularly common in epidemiology, where the probability of positive outcomes is the prevalence of the disease in the (sub)population.  It can also be used to analyze the prevalence of hate speech on a social media site or bias in televised news, the prevalence of volcanos on Venus, or the prevalence of positive reviews for a restaurant.

3. *Understanding and improving the coding standard.* The third most common application of crowdsourcing is to understand the coding standard, which is the rubric under which rating is carried out.  Typically this is measured through inter-annotator agreement, but rating models provide finer-grained analysis of sensitivity and specificity of raters.  

4. *Understanding and improving raters.* What is the mean sensitivity and specificity and how does it vary among raters?  Are sensitivity an specificity anticorrelated or correlated in the population? This understanding can be fed back to the raters themselves for ongoing training.  For example, American baseball umpires have extensive feedback on how they call balls and strikes as measured against a very accurate machine's call, which has led to much higher accuracy and consistency among umpires [@flannagan2024].  Understanding rater populations, such as those available through Mechanical Turk or Upwork, is important when managing raters for multiple crowdsourcing tasks.  For example, it will be possible to infer the rate of spammers and the rate of high quality raters.  The number of raters required for high quality joint ratings may also be assessed.  

5. *Understanding the items.* A fifth task that is rarely considered in the crowdsourcing literature is to understand the structure of the population of items. For example, which items are difficult to rate and why? Which items simply have too little signal to be consistently rated?  Which items live on the boundary of the rating decision boundary and which live far away? Which items have high discrimination and why? A discriminative item is one which cleanly separates high ability from low ability raters in their ability to rate it correctly.  This is the primary focus of educational testing, where the items are test questions and the raters are students.  A good test will be composed of questions at a variety of difficulty levels with high discrimination.


# Contributions and Previous Work

## Contributions

The primary contribution of this work is the expansion of rating models to account for item variation in the form of difficulty, discriminability, and guessability.  A space of models along five dimensions is considered and thoroughly evaluated using two medium-sized rating data sets.  Only models with item-level effects for difficulty and rater effects distinguishing sensitivity and specificity are able to pass posterior predictive checks [@rubin1984, @gelman1996].  Posterior predictive checks are the Bayesian equivalent of $\chi^2$ goodness of fit tests in regression, which test a "null" of the fitted model against the data to evaluate whether the data could have reasonably been generated by the model [@formann2003].

A secondary contribution is a parameterization of sensitivity and specificity on the log odds scale that restricts parameters to a cooperative range by jointly constraining sensitivity and specificity to lie above the spam line.  Without this constraint, the rating model likelihoods have two modes, one in which raters are cooperative and one in which prevalence is inverted and the raters are adversarial.  An adversarial rater is one who consistently provides the wrong rating (i.e., they know the answer and provide the wrong answer).  The spam line is determined by random guessing---spammy annotators have a sensitivity equal to one minus their specificity.

An implicit contributon is a replicable case study with open-source implementations of all of these models in Stan, a probabilistic programming language [@carpenter2017].  The code lays out a framework for comparing models which form a lattice of continuous expansions.


## Previous work

Rating models show up in multiple fields, including educational testing, from which the model variants introduced here are derived [@lazarsfeld1950,@lord1968,@rasch1960].  Rating models are also applied in epidemiology, both for multiple diagnostic testing [@albert2004] and extracting health status information from patient records [@dawid1979].  In sociology, rating models were independently developed for cultural consensus theory [@romney1986, @batchelder1988].  More recently, they have become popular for providing human feedback for classification of images
[@smyth1994, @raykar2010]; human ratings are the basis of massive data sets of millions of images and thousands of classes like ImageNet [@deng2009].
Rating models have long been popular for natural language tasks [@snow2008,@passonneau2014].  More recently, crowdsourced ratings of language model output are used as a fine-tuning step that adjusts a foundational large language model like GPT-4, which is trained to complete text, into a chatbot like ChatGPT that is (imperfectly) trained to be helpful, truthful, and harmless [@ouyang2022, @rafailov2024].


## Extensions

A natural extension is to $K$-way categorical ratings, such as classifying a dog image by species, classifying an article in a newspaper by topic, rating a movie on a one to five scale, classifying a doctor's visit with an ICD-10 code, and so on.  Most of the work on ratings has been in a more general categorical setting.  With more than two categories, sensitivity and specificity are replaced with categorical responses based on the latent true category.  Discrimination and guessing act the same way, but difficulty must be replaced with a more general notion of a categorical item level effect, which may represent either focused alternatives (a border collie is confusible with an Irish shepherd) or diffuse (e.g., can't tell what's in the image).

With enough raters, these models may also be extended hierarchically to make population-level inferences about the distribution of rater abilities or item difficulties [@paun2018].  Several of the crowdsourcing tasks may be combined to select raters and items to rate online with active learning, which is a form of bandit problem often addressed through reinforcement learning. With a hierarchical model, inference may be expanded to new raters.

It is also straightforward to extend a rating model to ordered responses, counts, proportions, distances, or arbitary real numbers such as geolocations.  All that needs to change is the response model and the representation of the latent truth---the idea of getting noisy ratings and inferring a ground truth remains.  As an example, suppose we have images of Venus and markings on an image of where the rater thinks a volcano is located.  The true location is represented as a latitude and longitude and rater responses can be multivariate normal centered around the true, but unknown, location.  


# A General Crowdsourcing Model

This section presents the most general crowdsourcing model, then considers how it may be simplified by removing features.  Removing features here amounts to tying parameters, such as assuming all raters have the same accuracy or all items have the same difficulty.

## The rating data

Consider a crowdsourcing problem for which there are $I \in \mathbb{N}$ items to rate and $J \in \mathbb{N}$ raters to do the rating.  Long-form data will be used to accomodate there being a varying number of raters per item and a varying number of items per rater.  let $N \in \mathbb{N}$ be the number of ratings $\textrm{rating}_n \in \{0, 1 \}$, each of which is equipped with a rater $\textrm{rater}_n \in 1:J$ and item being rated $\textrm{item}_n \in 1:I$.

For each item $i$, suppose there are covariates $x_i \in \mathbb{R}^K$.  Then $x \in \mathbb{R}^{I \times K}$ is the full data matrix and $x_i$ is naturally considered a row vector.  In the simplest case, $K = 1$ and $x$ is a single column of 1 values representing an intercept.  In the more general case, other features may denote information relevant for classifying the item, such as the pixels of an image for image classification, demographic information for predicting a vote in an election, or medical test results for predicting a medical condition.

## The generative model

The generative model derives from combining Dawid and Skene's epidemiology model with sensitivity and specificity with the item-response theory educational testing model with item difficulty, disciminativeness, and guessing. The categories for the items are generated independently given the covariates using a logistic regression.  The ratings for an item are then generated conditionally based on the item's category and the rater's abilities and biases.  

In frequentist terms, this section presents a complete data likelihood for the categories and rating data.  The categories are not observed and are thus treated as missing data to allow them to be marginalized to derive the likelihood for the rating data.

### Generating categories

For each item, let $z_i \in \{ 0, 1 \}$ be its (unobserved/latent) category, with a 1 conventionally denoting "success" or a "positive" result.  The 

The complete data likelihood is complete in the sense that it includes the latent category.  Marginalizing out this category, the technical details and numerical stability of which are deferred until later, leads to the ordinary likelihood function used in the model to avoid challenging inference over discrete parameters.

Let $\beta \in \mathbb{R}^K$ be a vector of (unobserved) regression coefficients.  Let $\pi \in (0, 1)$ be the parameter representing the prevalence of positive outcomes.  

Categories are generated independently given the prevalence,
\begin{equation}
z_i \sim \textrm{bernoulli}(\pi).
\end{equation}

### Generating ratings

The rating from rater $j$ for item $i$ is generated conditionally given the category $z_i$ of an item.  For positive items $z_i = 1$), sensitivity (i.e., accuracy on positive items) is used, whereas for negative items ($z_i = 0$), specificity (i.e., accuracy on negative items) is used.  Thus every rater $j$ will have a sensitivity $\alpha^{\textrm{sens}}_j \in \mathbb{R}$ and a specificity $\alpha^{\textrm{spec}}_j$ on the log odds scale.  If the sensitivity is higher than the specificity there will be a bias toward 1 ratings, whereas if the specificity is higher than the sensitivity, there is a bias toward 0 ratings.  If the model only has sensitivity and specificity parameters that vary by rater, it reduces to Dawid and Skene's diagnostic testing model.  Fixing $\alpha^{\textrm{sens}} = \alpha^{\textrm{spec}}$ introduces an unbiasedness assumption whereby a rater has equal sensitivities and specificities.

The items are parameterized with a difficulty $\beta_i \in \mathbb{R}$ on the log odds scale.  This difficulty is subtracted form the sensitivity (if $z_i = 1$) or specificity (if $z_i = 0$) as appropriate to give the raw log odds of a correct rating (i.e., a rating matching the true category $z_i$).  Fixing $\beta_i = 0$ introduces the assumption that every item is equally difficult.

Each item is further parameterized with a positive-constrained discrimination parameter $\delta_i \in (0, \infty)$.  This is multiplied by the raw log odds to give a discrimation-adjusted log odds to give a probability of correctly rating the item.  With high discrimination, it is more likely a rater with ability greater than the difficulty will get the correct answer and less likely that a rater with ability less than difficulty will get the correct answer.  For educational testing, high discrimation test questions are preferable, but for rating wild type data, low discrimination items are common because of natural variations in the signal (e.g., natural language text or an image).  Fixing $\delta_i = 1$ introduces the assumption that the items are equally discriminative.

The final parameter associated with an item is a guessability parameter $\lambda_i \in (0, 1)$, giving the probability that a rater can just "guess" the right answer.  The probability that a rater assigns the correct rating will thus be the combination of the probability of guessing correctly and otherwise getting the correct answer in the usual way.  Fixing $\lambda_i = 0$ introduces the assumption that the raters never guess an answer.  

Without a guessing parameter, as difficulty goes to infinity, the probability a rater provides the correct label for an item goes to zero.  With guessing, the probability of a correct label is always at least the probability of guessing.

The full model follows the item-response theory three-parameter logistic (IRT-3PL) model, where the probability that rater $j$ assigns the correct rating to item $i$ is given by
\begin{equation}
c_n \sim \textrm{Bernoulli}\!\left(\lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot (\alpha^k_j - \beta_i))\right),
\end{equation}
where $k = \textrm{sens}$ if $z_i = 1$ and $k = \textrm{spec}$ if $z_i = 0.$

In order to convert to a distribution over rating, the probability of a 1 outcome must be flipped when $z_i = 0$ so that a 90\% accurate rating results in a 90% chance of a 0 rating.  Thus the rating is given by
\begin{equation}
y \sim
\begin{cases}
\textrm{Bernoulli}\!\left(\lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}\!\left(\delta_i \cdot \left(\alpha^\textrm{sens}_j - \beta_i\right)\right)\right)
& \textrm{if } z_i = 1, \textrm{ and}
\\[4pt]
\textrm{Bernoulli}\!\left(1 - \left( \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}\!\left(\delta_i \cdot \left(\alpha^\textrm{spec}_j - \beta_i\right)\right)\right)\right)
& \textrm{if } z_i = 0.
\end{cases}
\end{equation}
The second case ($z_i = 0$) reduces to
\begin{equation}
\textrm{Bernoulli}\!\left( \left( 1 - \lambda_i \right)
                \cdot \left( 1 - \textrm{logit}^{-1}\!\left(\delta_i \cdot \left(\alpha^\textrm{sens}_j - \beta_i\right)\right)\right)\right).
\end{equation}

### Adding predictors

If item-level or rater-level covariates are available, they may be
used to inform the parameters through a regression in the form of a
generalized linear model.

Suppose there are item-level covariates $x_i \in \mathbb{R}^K$.  With a parameter vector $\gamma \in \mathbb{R}^K$, the generative model for the category of an item may be extended to a logistic regression,
\begin{equation}
z_i \sim \textrm{Bernoulli}\!\left( x_i \cdot \beta \right).
\end{equation}
Here, $x_i \in \mathbb{R}^K$ is considered a row vector because it is a row of the data matrix $x \in \mathbb{I \times K}$.  An intercept for the regression may be added by including a column of 1 values in $x$.  The resulting coefficients will act as a classifier for new items by generating a covariate-specific prevalence.  This motivated @raykar2010's combination of Dawid and Skene's rating model and a logistic regression of this form. 

Item-level coariates can be used to generate the item-level parameters of difficulty, discrimination, and guessability.  For example, a covariate that indicates the number of options in a multiple choice test would inform the guessability parameter.  If the grade level of the textbook from which a problem was culled was included as a predictor, that could be used to inform the difficulty parameter.

If there are rater level covariates $u_i \in \mathbb{R}^L$, these may be used in the same way to generate the ability parameters for the raters.  For example, there might be an indicator of whether the rater was an undergraduate, graduate student, faculty member, or crowdsourced external agent, which could be used to inform ability.


## Marginal likelihood

The generative model leaves us with latent discrete categories $z_i \in \{ 0, 1 \}$ for each item.  For both optimization and sampling, it is convenient to marginalize the complete likelihood $p(y, z \mid \pi, \alpha, \beta, \delta, \lambda)$ to the rating likelihood $p(y \mid \pi, \alpha, \beta, \delta, \lambda)$.  This notation suppresses the conditioning throughout on the design variables $\textrm{rater}$ and $\textrm{item}$.  The marginalization calculation is efficient because it is factored by data item.  Letting $\theta = \pi, \alpha, \beta, \delta, \lambda$ be the full set of continuous parameters, the trick is to rearrange the long-form data by item, by selecting the $n$ by item (i.e., by $\textrm{item}_n$).
\begin{equation}
\begin{array}{rcl}
p(y, z \mid \theta)
& = &
\prod_{i=1}^I p(z_i \mid \theta)
\cdot
\prod_{n = 1}^N p(y_n \mid z, \theta)
\\[4pt]
& = & 
\prod_{i=i}^I
  \left( p(z_i \mid \theta)
         \cdot
	 \prod_{n : \textrm{item}_n = i} p(y_n \mid z_i, \theta)
  \right).
\end{array}
\end{equation}

On a per item basis, the marginalization is tractable,
\begin{equation}
p(y \mid \theta)
= \prod_{i=i}^I
  \sum_{z_i = 0}^1 \,
          p(z_i \mid \theta)
                \cdot
		\prod_{n : \textrm{item}_n = i} p(y_n \mid z_i, \theta).
\end{equation}

Computational inference requires a log likelihood.  The log marginal likelihood of the rating data is
\begin{equation}
\begin{array}{rcl}
\log p(y \mid \theta)
& = & \log \prod_{i=i}^I
  \sum_{z_i = 0}^1 \,
          p(z_i \mid \theta)
                \cdot
		\prod_{n : \textrm{item}_n = i} p(y_n \mid z_i, \theta).
\\[4pt]
& = &
\sum_{i=1}^I
\textrm{logSumExp}_{z_i = 0}^1 \,
  \left(
    \log p(z_i \mid \theta)
    + 
    \sum_{n : \textrm{item}_n = i} \log p(y_n \mid z_i, \theta)
  \right),
\end{array}
\end{equation}
where
\begin{equation}
\textrm{logSumExp}_{n = 1}^N \, \ell_n
= \log \sum_{n=1}^N \exp(\ell_n)
\end{equation}
is the numerically stable log-scale analogue of addition.

## Model reductions

By tying or fixing parameters, the full model may be reduced to define a wide range of natural submodels.  Six of these models correspond to item-response theory models of the one-, two-, and three-parameter logistic variety, either with or without a sensitivity/specificity distinction. The model with varying rater sensitivity and specificity and no item effects reduces to Dawid and Skene's model.  Other models, such as the model with a single item effect and no rater effects have been studied in the epidemiology literature.

The table below summarizes the possible model reductions and gives them tags by which they can be abbreviated when evaluating models.

| Reduction | Description | Tag |
|:---------:|:-----------:|:---:|
| $\lambda_i = 0$ |  no guessing items | A |
| $\delta_i = 1$ | equal discrimination items | B |
| $\beta_i = 0$ |equal difficulty items | C |
| $\alpha^{\textrm{spec}} = \alpha^{\textrm{sens}}$ | equal error raters| D |
| $\alpha_j = \alpha_j'$ | identical raters | E |


The first list contains models that split accuracy into sensitivity and specificity components.  Rathers with unequal sensitivities and specifificities show a bias toward 0 or 1 ratings.  Models without a sensitivity versus specificity distinction make sense when the categories are not ordered.  For example, asking consumers which brand is their favorite between two brands at a time, the labels 0 and 1 are arbitrary, and sensitivity should be equal to specificity.  The following models do not distinguish sensitivity and specificity.

\begin{equation}
\begin{array}{ccl}
\textrm{Reductions} & \textrm{Probability Correct} & \textrm{Note} \\ \hline
D
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot (\alpha_j - \beta_i)) & \textrm{\small IRT 3PL}
\\ 
CD
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot \alpha_j)
\\
BD
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha_j - \beta_i) & \textrm{\small IRT 2PL}
\\ \hline
BCD
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha_j)
\\
AD
& \textrm{logit}^{-1}(\delta_i \cdot (\alpha_j - \beta_i))
\\
ACD
& \textrm{logit}^{-1}(\delta_i \cdot \alpha_j)
\\ \hline
ABD
& \textrm{logit}^{-1}(\alpha_j - \beta_i) & \textrm{\small IRT 1PL}
\\
ABCD
& \textrm{logit}^{-1}(\alpha_j)
\\
ABCDE
& \textrm{logit}^{-1}(\alpha)
\end{array}
\end{equation}
The final model in the list is the only model that does not distinguish among the raters, using a single accuracy parameter. The following models introduce separate parameters for sensitivity and specifity rather than assuming they are the same.
\begin{equation}
\begin{array}{ccl}
\textrm{Reductions} & \textrm{Probability Correct} & \textrm{Note} \\ \hline
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot (\alpha^k_j - \beta_i)) & \textrm{\small IRT 3PL + sens/spec}
\\
C
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot \alpha^k_j)
\\
BC
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha^k_j)
\\ \hline
A
& \textrm{logit}^{-1}(\delta_i \cdot (\alpha^k_j - \beta_i)) & \textrm{\small IRT 2PL + sens/spec}
\\
AC
& \textrm{logit}^{-1}(\delta_i \cdot \alpha^k_j)
\\
AB
& \textrm{logit}^{-1}(\alpha^k_j - \beta_i) & \textrm{\small IRT 1PL + sens/spec}
\\ \hline
ABC
& \textrm{logit}^{-1}(\alpha^k_j) & \textrm{\small Dawid and Skene}
\\
ABCE
& \textrm{logit}^{-1}(\alpha^k)
\end{array}
\end{equation}
The final model in this list has a single sensitivity and specificity for all raters, whereas the other models have varying effects.

There is a single model that does not use any notion of rating accuracy at all, relying solely on item effects.
\begin{equation}
\begin{array}{cc}
\textrm{Reductions} & \textrm{Probability Correct} \\ \hline
ABDE
& \textrm{logit}^{-1}(- \beta_i) \quad
\\ \hline
\end{array}
\end{equation}
The problem with this

The remaining models are all redundant in the sense that fixing their non-identifiability issues reduces to a model with a single item effect.
\begin{equation}
\begin{array}{ccc}
\textrm{Reductions} & \textrm{Probability Correct} \\ \hline
E
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot (\alpha^k - \beta_i))
\\
DE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot (\alpha - \beta_i))
\\
CE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot \alpha^k)
\\ \hline
CDE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\delta_i \cdot \alpha)
\\
BE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha^k - \beta_i)
\\
BDE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha - \beta_i)
\\ \hline
BCE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha^k)
\\
BCDE
& \lambda_i + (1 - \lambda_i) \cdot \textrm{logit}^{-1}(\alpha)
\\
AE 
& \textrm{logit}^{-1}(\delta_i \cdot (\alpha^k - \beta_i))
\\ \hline
ADE
& \textrm{logit}^{-1}(\delta_i \cdot (\alpha - \beta_i))
\\
ACE
& \textrm{logit}^{-1}(\delta_i \cdot \alpha^k)
\\
ACDE
& \textrm{logit}^{-1}(\delta_i \cdot \alpha)
\\ \hline
ABE
& \textrm{logit}^{-1}(\alpha^k - \beta_i)
\end{array}
\end{equation}

# Empirical Evaluations

The posteriors of all 18 models introduced above were sampled for two data sets. The first data set consists of 5 dentists rating each of roughly 4000 dental X-rays for caries (a kind of pre-cavity) [@espeland1989].  The second is nearly 200 Mechanical Turkers, each rating a subset of roughly 3000 pairs of sentences for entailment [@snow2008].

The models were coded in Stan (version 2.33) and fit with default sampling settings using CmdStanPy (version 1.20).  The default sampler is the multinomial no-U-turn sampler, an adaptive form of Hamiltonian Monte Carlo [@hoffman, @betancourt2017] that adapts a diagonal metric.  The default number of chains is four, and the default runs 1000 warmup iterations (for burnin and adaptation) and 1000 sampling iterations.  All sampling runs ended with split-$\widehat{R}$ values less than 1.01 for all parameters (prevalence, rater parameters, and item parameters), indicating consistency with convergence to approximate stationarity [@gelman2013].

\begin{equation}
\begin{array}{|l||r|r|}
\hline
\text{Model} & \text{Rater \textit{p}-value} & \text{Ratings \textit{p}-value} \\ \hline \hline
\text{abcde} & < 0.001 & < 0.001 \\
\text{abcd} & < 0.001 & < 0.001 \\
\text{abce} & < 0.001 & 0.019 \\ \hline
\text{abde} & < 0.001 & < 0.001 \\
\text{abc} & 0.462 & < 0.001 \\
\text{abd} & 0.074 & < 0.001 \\ \hline
\text{acd} & < 0.001 & < 0.001 \\
\text{bcd} & < 0.001 & < 0.001 \\
\textbf{ab} & 0.468 & 0.218 \\ \hline
\text{ac} & 0.405 & < 0.001 \\
\text{ad} & 0.325 & < 0.001 \\
\text{bc} & 0.102 & 0.001 \\ \hline
\text{bd} & < 0.001 & < 0.001 \\
\text{cd} & < 0.001 & < 0.001 \\
\textbf{a} & 0.046 & 0.289 \\ \hline
\text{c} & 0.014 & 0.001 \\
\text{d} & < 0.001 & < 0.001 \\
\textbf{full} & 0.120 & 0.01 \\ \hline
\end{array}
\end{equation}

Only three models pass the posterior predictive checks, and the AB (IRT 1PL), A (IRT 2PL), and full (IRT 3PL) model with sensitivity/specificity distinction.  Note that model ABC (Dawid and Skene) does not pass.  Only models with difficulty parameters and that distinguished sensitivity and specificity on a per-rater basis passed.

::: {#refs}
:::