Revv: Inter-rater reliability guide: choose, compute, report

Overview

Inter-rater reliability (IRR) quantifies how consistently two or more raters, instruments, or algorithms classify or score the same items. It matters because treatment decisions, research inferences, and model deployment often rest on these ratings. If raters disagree, the signal you think you see may be noise.

IRR spans families of statistics chosen by measurement level (nominal, ordinal, continuous), rater count, and study design (crossed versus nested raters). Many metrics also correct for agreement expected by chance.

Chance-corrected agreement (for example, Cohen’s kappa) adjusts raw percent agreement to avoid overestimating reliability when categories are common or imbalanced. Continuous measures usually rely on intraclass correlation coefficients (ICC) or Lin’s concordance correlation. Method-comparison uses Bland–Altman limits of agreement. For background on the kappa family and interpretation pitfalls, see the NIH/PMC primer on Cohen’s kappa.

Why inter-rater reliability matters and common pitfalls

IRR directly affects validity. Low agreement inflates measurement error, attenuates effect sizes, and can bias diagnostic accuracy and model performance estimates.

In clinical diagnostics, overoptimistic agreement can mask safety risks. In social and behavioral coding, it can undermine construct validity. In ML annotation, poor IRR propagates into mislabeled training data and degraded model generalization.

Three pitfalls recur across studies.

First, base-rate effects and marginal imbalance can depress or inflate chance-corrected statistics. This creates the “kappa paradox” (high raw agreement but low kappa) as documented in the NIH/PMC primer on Cohen’s kappa.

Second, mis-specified models distort inferences. Examples include using unweighted kappa for ordered categories or the wrong ICC model for your design.

Third, operational issues degrade agreement over time. Insufficient training, inadequate codebooks, and rater drift are common causes.

Effect of category prevalence and marginal imbalance on agreement statistics

Prevalence and rater bias shift the expected-by-chance component of chance-corrected metrics. Kappa can drift toward zero even when raw agreement is high.

Extremely common or rare categories make it easy to “agree” by chance. Unequal rater tendencies (marginals) further skew expected agreement. The result, sometimes called the kappa paradox, can mislead in binary diagnostics with low disease prevalence. Raw agreement can exceed 90% while kappa remains modest.

Two practical remedies help when base rates are skewed. First, report prevalence and bias indices alongside kappa. Also present raw agreement so readers see both the unadjusted and adjusted pictures.

Second, consider prevalence-adjusted bias-adjusted kappa (PABAK) or estimators like Gwet’s AC1/AC2 that are less sensitive to marginal imbalances. Whichever path you choose, prespecify your primary metric, justify it relative to your data’s distribution, and include sensitivity analyses in your report.

Measure-selection framework: map your design to the right IRR statistic

Start by matching your measurement level, rater configuration, and study design to the appropriate statistic. Choose among nominal or ordinal versus continuous scales first. Then decide on two versus multiple raters. Finally, consider crossed (all raters rate all items) versus nested (different items per rater) designs.

Use this quick map:

Nominal, two raters: Cohen’s kappa; if prevalence is skewed, consider Gwet’s AC1.
Nominal, three or more raters: Fleiss’ kappa for fully crossed designs; Krippendorff’s alpha for incomplete or missing data; Gwet’s AC1 for multiple raters as implemented in select software.
Ordinal, two raters: Weighted kappa (linear or quadratic) or Gwet’s AC2; Krippendorff’s alpha (ordinal) if missingness or rater counts vary.
Ordinal, three or more raters: Weighted Fleiss-type kappa where available, Krippendorff’s alpha (ordinal), or Gwet’s AC2 for fully crossed panels.
Rankings (full or partial orders): Kendall’s W for group concordance; pairwise Kendall’s tau for rater pairs.
Continuous scores: ICC (choose model carefully) for reliability; Lin’s CCC for concordance against a gold standard or two methods; Bland–Altman for method-comparison with bias and limits of agreement.

Once you’ve matched a candidate metric, check assumptions. Are categories ordered? Are all raters evaluating the same items? Is missingness ignorable?

For borderline cases (e.g., partly crossed panels with some missing data), Krippendorff’s alpha is often a robust default for categorical or ordinal data. ICCs with mixed-effects modeling can handle more complex continuous designs.

How do I pick the right ICC model (one-way vs two-way; random vs mixed; absolute vs consistency) for my study design?

Pick the ICC that mirrors your sampling and inference. Use one-way random when each target is rated by a different random sample of raters. Use two-way random when all targets are rated by the same random sample of raters and you want to generalize to other raters. Use two-way mixed when raters are fixed and you care only about these raters.

Choose absolute agreement when exact matching matters. Choose consistency when systematic offsets are acceptable. For thresholds and reporting recommendations, see Koo and Li (2016) ICC guideline.

A practical rule set:

One-way random (ICC(1,1) or ICC(1,k)): different raters per target; raters sampled from a population.
Two-way random (ICC(2,1) or ICC(2,k)): same raters rate all targets; raters random; use absolute agreement or consistency as required.
Two-way mixed (ICC(3,1) or ICC(3,k)): same raters rate all targets; raters fixed; use absolute agreement or consistency as required.

If you average multiple ratings per target (k raters), use the “average” form (e.g., ICC(2,k)) for reliability of the mean rating. Use the “single” form for reliability of a single rater.

When in doubt, diagram your sampling and whether you intend to generalize beyond your rater panel. Then match to the model labels above in your software.

Which inter-rater reliability statistic should I use for ordinal ratings with three or more raters and some missing data?

Use Krippendorff’s alpha with ordinal distance if rater coverage is incomplete or missingness is present. It accommodates varying numbers of raters per item and handles ties; see the Krippendorff’s alpha documentation.

If all raters scored all items and missingness is minimal, consider Gwet’s AC2 or a weighted multi-rater kappa. Both allow ordinal weights and are less sensitive to prevalence and bias than unweighted kappa.

When categories differ in clinical impact (e.g., “none,” “mild,” “moderate,” “severe”), align weights with those distances, often quadratic. If items are sporadically unrated, avoid listwise deletion that distorts rater marginals. Instead, use an estimator that naturally handles missing cells (Krippendorff’s alpha). Run a sensitivity analysis with imputed labels only if missingness is plausibly at random.

Nominal and ordinal ratings: kappa family, weighted kappa, and Gwet’s AC1/AC2

For nominal categories, kappa-type statistics compare observed agreement to expected-by-chance agreement based on rater marginals. Cohen’s kappa is for two raters. Fleiss’ kappa generalizes to many raters in fully crossed designs.

For ordinal categories, weighted kappa assigns penalties for disagreements proportional to how far apart the categories are. It improves sensitivity to near-misses compared with raw disagreement counting.

Gwet’s AC1 (nominal) and AC2 (ordinal) modify the chance component to stabilize estimates under imbalanced prevalence and marginal distributions. When disease prevalence is very low or one category dominates, AC1/AC2 often reflect perceived agreement better than kappa.

Regardless of the estimator, always report raw percent agreement and the distribution of categories. Readers need the context behind a single summary coefficient.

When is Gwet’s AC1/AC2 preferable to Cohen’s kappa, and how do I compute it?

AC1/AC2 is preferable when category prevalence is very high or very low, or when rater marginals are unbalanced. In those situations, kappa can be unduly deflated despite high raw agreement.

Use AC1 for nominal data. Use AC2 for ordinal data with appropriate weights to reflect ordered categories. A concise, practitioner-friendly explanation and formulas are available in the Gwet’s AC1/AC2 overview.

To compute AC1/AC2 in R, use irrCAC::gwet.ac1.raw, irrCAC::gwet.ac1.ci, and irrCAC::gwet.ac2.raw with a weight matrix for ordinal categories. In Stata, the community-contributed kappaetc supports AC1/AC2 for multiple raters. In Python, direct AC1/AC2 implementations are sparse. Either port the formula or use R from Python via rpy2.

Always accompany AC estimates with confidence intervals. Obtain them analytically where supported or via nonparametric bootstrap.

Weighted kappa variants: linear vs quadratic vs custom weights

Choose weights that reflect the real cost of disagreements. Linear weights penalize one-step disagreements proportionally. Quadratic weights penalize distant disagreements more heavily and often approximate ICC behavior for many categories.

Custom weight matrices let you encode clinical distances when some category jumps are riskier than others. As a rule of thumb, quadratic weights are a good default for evenly spaced ordinal scales with four or more categories. Linear weights can be preferred for short scales where each step has similar meaning.

When categories are unevenly spaced (e.g., “none,” “trace,” “significant”), specify a custom weight matrix and justify it in your protocol. Report the exact weights you used. Perform a sensitivity check with an alternative (e.g., linear vs quadratic) to demonstrate robustness.

Rankings and ordered consensus: Kendall’s W and its relation to Kendall’s tau

When raters rank items rather than assign categories, Kendall’s W measures overall agreement across the full set of raters. It ranges from 0 (no consensus) to 1 (perfect consensus) and accommodates ties with appropriate corrections.

W summarizes concordance at the group level. Kendall’s tau quantifies concordance for a pair of raters. The two are related but answer different questions.

Use Kendall’s W when you want a single number to describe how consistently a panel orders items (e.g., prioritizing cases for review). If you need to identify which raters align or diverge, inspect pairwise Kendall’s tau coefficients in addition to W.

For incomplete rankings, consider rank aggregation methods alongside W to handle missing ranks gracefully. Specify how ties were handled in your analysis plan.

Continuous agreement: ICC vs Lin’s concordance correlation vs Bland–Altman

For continuous measurements, ICC quantifies reliability as the ratio of between-target variance to total variance. Different models reflect rater sampling and inference choices.

Lin’s concordance correlation coefficient (CCC) measures agreement by combining precision (correlation) and accuracy (closeness to the 45-degree line). It is ideal when comparing two methods or a method against a gold standard.

Bland–Altman analysis complements both by estimating mean bias and limits of agreement. It reveals whether methods differ systematically or have heteroscedastic error. See Martin Bland’s Bland–Altman notes for worked examples.

Pick ICC when your goal is reliability across raters or sessions within a design. For example, can any trained rater reproduce this score? Pick CCC when you care about method interchangeability. Ask whether the new device is numerically concordant with the reference.

Use Bland–Altman when you must visualize bias and clinically acceptable limits. In regulatory or clinical validation, combine CCC and Bland–Altman to show both concordance and practical interchangeability bounds.

What is the difference between ICC and Lin’s concordance correlation for continuous measures of agreement?

ICC focuses on reliability, or how well measurements discriminate between subjects. It depends on the design and variance components.

CCC measures absolute agreement against the 45-degree line by integrating correlation and mean bias. ICC can be high even with systematic bias if the rank ordering is preserved. CCC will penalize such bias.

Use ICC for reliability within a rater or method design. Use CCC when you need two methods to be numerically exchangeable. Add Bland–Altman to visualize bias and limits.

Multi-rater and complex designs: Fleiss’ kappa, Krippendorff’s alpha, and generalizability theory

With more than two raters, Fleiss’ kappa extends chance-corrected agreement to fully crossed nominal designs. Krippendorff’s alpha flexibly handles different numbers of raters per item, missing data, and different distance functions (nominal, ordinal, interval).

For ordinal multi-rater panels, Gwet’s AC2 can also be attractive in balanced, fully crossed designs. When raters are nested within sites or when designs cross multiple facets (e.g., rater, occasion, site), classical coefficients struggle to decompose error sources.

Generalizability theory (G-theory) addresses multi-facet designs by partitioning variance components attributable to persons, raters, items, and their interactions. You can conduct a G-study to estimate these components and a D-study to forecast reliability under alternative designs (e.g., “What if we add a second rater?”).

This approach is especially useful for credentialing, OSCE/OSPE assessments, and complex observational coding where rater and task facets interact meaningfully.

Generalizability theory (G- and D-studies) for rater facets

G-theory reframes reliability as universe-score generalizability across facets. In a G-study, fit a random-effects model where persons, raters, and items (and interactions) are random effects. Then compute generalizability coefficients from estimated variance components.

In a D-study, vary the number of raters or items to see how reliability improves as you add resources. This clarifies the most efficient pathway to your reliability target.

Practically, start with a fully crossed pilot and estimate components with a mixed-effects model. Identify which facet drives error. For example, a large person-by-rater interaction suggests inconsistent application of the rubric.

Then simulate D-study scenarios, like doubling raters versus extending the rubric, to choose the cost-effective design change. Report your facet specification, variance components, and the D-study scenarios you considered.

Handling missing data, imbalanced categories, and the prevalence/bias paradox

Missing or incomplete ratings are common in real workflows. For categorical or ordinal data with missing cells, prefer estimators that support incomplete matrices (Krippendorff’s alpha). You can also use pairwise approaches with caution.

For continuous data, mixed-effects ICC models can accommodate unbalanced designs. If missingness is informative (e.g., raters skip uncertain cases), consider multiple imputation with sensitivity analyses. Never impute purely to “fix” kappa.

When categories are imbalanced, report prevalence and bias indices. Present raw agreement alongside chance-corrected metrics.

If kappa appears paradoxically low, apply PABAK to contextualize agreement. Consider switching to Gwet’s AC1/AC2, which stabilizes the chance component.

In extreme prevalence settings (e.g., screening rare diseases), prespecify AC1/AC2. You can also plan to stratify by prevalence bands to avoid misleading global summaries.

What strategies mitigate the prevalence and bias paradox in kappa-based agreement?

Use a three-part strategy: report, adjust, and triangulate.

First, always report raw agreement and the distribution of categories so readers see context. Second, compute prevalence and bias indices and consider PABAK as a sensitivity estimator. Third, triangulate with a prevalence-robust coefficient like Gwet’s AC1/AC2 and compare interpretations.

Details and formula behavior are covered in the NIH/PMC kappa primer and Gwet’s overview. Document your primary metric choice and the rationale in the protocol to prevent post hoc metric shopping.

Planning power and sample size for kappa and ICC

Power and sample size for IRR studies should be planned around a target coefficient and an acceptable confidence interval (CI) width. For kappa, planning depends on expected category prevalence and the null versus target kappa values.

Designs often seek to reject a minimally acceptable kappa (e.g., 0.60) in favor of a target (e.g., 0.80) with specified power and alpha. For ICC, planning depends on the ICC model, number of raters, and the distinction between single-measure and average-measure reliability.

Analytic formulas exist but can be fragile when assumptions are violated (e.g., balanced designs, fixed prevalence). A robust approach is to simulate plausible data under your design and compute the target IRR statistic repeatedly. Choose the sample size that achieves your desired power or CI width.

For kappa, the R package kappaSize offers closed-form and exact methods. For ICC, ICC.Sample.Size or simulation via mixed-effects models is flexible. Always prespecify a minimally acceptable reliability threshold consistent with field standards and justify it with citations (e.g., Koo and Li (2016) ICC guideline).

How do I calculate the required sample size to achieve a target Cohen’s kappa (e.g., 0.80) given expected prevalence?

Start by fixing four elements: the number of raters, the number of categories, the expected prevalence vector, and your null and target kappas (e.g., H0: κ ≤ 0.60 vs H1: κ ≥ 0.80). Then choose your Type I error (e.g., 0.05) and desired power (e.g., 0.80).

Use a planning tool that incorporates prevalence to compute the required number of items. In R, kappaSize::PowerKappa and related functions do this with exact or asymptotic methods.

When prevalence is highly skewed or uncertain, run a sensitivity analysis by varying the prevalence inputs and observing how required N changes. If your panel is multi-rater or the design is not fully crossed, consider Krippendorff’s alpha planning or simulate data reflecting your actual design. Compute the empirical power to detect your target.

Document your assumed prevalence and the resulting sample size so readers can assess design adequacy.

Rater training, calibration, and drift monitoring

Reliability is designed, not discovered. Start with a clear rubric and exemplars. Conduct calibration sessions with feedback until your raters achieve prespecified agreement on a pilot set.

Lock your codebook, including edge-case handling, before production ratings. During data collection, monitor agreement on a rolling basis and create a remediation plan with retraining triggers.

Operationally, predefine blinded overlap items at a steady cadence (e.g., 10% of cases double-scored each month). Track IRR over time.

When agreement dips below your action threshold, pause, review disagreement patterns, and retrain using recent exemplars. Document drift checks, thresholds, and corrective actions in your methods, as many journals and regulators expect procedural transparency.

How can I monitor and correct rater drift over a multi-month study?

Monitor drift with statistical process control. Maintain a moving-window estimate of your primary IRR metric (e.g., 50 most recent overlaps) and plot it on a control chart with prespecified lower action limits.

For binary or categorical outcomes, you can also track a p-chart of raw agreement as an early-warning signal. For continuous measures, track ICC and within-rater standard deviation.

When metrics breach limits, trigger a structured review. Diagnose disagreement themes, update or clarify the rubric, and conduct targeted recalibration sessions before resuming ratings.

To avoid overreacting to noise, require consecutive breaches (e.g., two windows below threshold) before pausing. Archive disagreements and resolutions to build a living codebook and reduce recurring ambiguity.

Over long studies, schedule periodic re-baselining sessions even without breaches. This refreshes shared understanding and mitigates slow drift.

Crowdsourcing and ML annotation reliability

In crowdsourced labeling and ML annotation, rater ability and effort vary widely. Majority vote can be suboptimal when worker accuracies differ.

Probabilistic aggregation methods such as Dawid–Skene and GLAD estimate worker confusion matrices and infer latent true labels. These methods improve effective IRR and downstream model performance.

Active learning complicates IRR because the case mix evolves. Monitor agreement by strata (e.g., easy vs hard cases) to ensure apparent IRR is not inflated by an increasingly easy sample.

Decide whether your target is agreement among crowd workers or agreement between aggregated labels and an expert gold standard. If you report crowd-only IRR, ensure overlap among workers so that multi-rater metrics are estimable.

If you report against experts, use the appropriate two-rater or method-comparison metrics and disclose the aggregation method. In all cases, use qualification tests, ongoing honeypots, and feedback loops to sustain label quality.

How should I aggregate labels from crowdsourced raters (majority vote vs Dawid–Skene) and how does that choice impact IRR?

Use majority vote when worker quality is fairly homogeneous and the task is easy. It is simple, fast, and strong under balanced conditions.

Prefer Dawid–Skene or GLAD when worker accuracy varies or when categories are imbalanced. These models weight workers by inferred ability and often improve agreement with expert labels and model performance.

Aggregation affects IRR. Majority vote can understate true consensus if a minority of high-quality workers are drowned out. DS/GLAD can raise observed agreement by amplifying reliable raters.

Report both the aggregation method and post-aggregation IRR against a reference set. If possible, compare multiple aggregation strategies on a held-out gold standard.

Adjudication workflows and their impact on reported IRR

Adjudication resolves disagreements after independent ratings via consensus discussion, tie-breakers, or expert arbitration. While adjudication improves final label quality, it inflates apparent agreement if you compute IRR post-adjudication. Disagreements have been removed by design.

Best practice is to compute IRR on the initial, blinded ratings and to describe adjudication procedures for generating the analysis dataset.

Specify whether your reported IRR is pre- or post-adjudication. Note who served as arbitrators and whether arbitrators were drawn from the same rater pool.

If adjudication systematically favors one rater type (e.g., senior clinicians), discuss the potential for bias. When space permits, present both pre- and post-adjudication IRR. Explain how reconciliation altered category distributions and any downstream analyses.

How do consensus adjudication and tie-breaker procedures affect reported IRR and study conclusions?

Consensus and tie-breakers generally increase final agreement by design. This can lead to overestimation if IRR is computed after reconciliation.

It matters because diagnostic accuracy, algorithm performance, or rubric adequacy may look stronger than the underlying independent agreement warrants. The defensible approach is to report independent IRR as your primary metric. Describe the adjudication process transparently, and use post-adjudication labels for outcome analyses, not for reliability estimation.

If adjudication changes prevalence (e.g., pushing borderline cases into a dominant category), chance-corrected metrics may shift even if raw agreement stays high. To maintain interpretability, provide both raw and adjusted metrics pre-adjudication. Summarize how adjudication altered label distributions.

Interpretation thresholds, discipline standards, and regulatory reporting

Interpreting IRR requires discipline-aware thresholds and transparent reporting. Many fields still cite conventional labels (“moderate,” “substantial”), but their cutoffs vary and should not substitute for context.

For ICC, widely used guidance suggests ICC < 0.5 poor, 0.5–0.75 moderate, 0.75–0.9 good, and > 0.9 excellent. These thresholds must be justified for your use case; see Koo and Li (2016) ICC guideline for cautionary notes.

Diagnostic accuracy studies often follow reporting frameworks such as the STARD guidelines for diagnostic accuracy. Risk-of-bias assessments like QUADAS-2 encourage transparent description of reader studies, blinding, and agreement metrics.

A concise reporting checklist helps:

Define the outcome scale (nominal, ordinal, continuous) and the rater design (crossed/nested, fixed/random).
State your primary IRR metric with rationale and list any sensitivity metrics (e.g., AC1/AC2, PABAK).
Provide raw agreement and category distributions; include CIs for IRR metrics (analytic or bootstrapped).
Describe rater training, calibration, drift monitoring, and adjudication procedures.
For continuous data, specify the ICC model and whether you report single-measure or average-measure reliability; add Bland–Altman where interchangeability matters.

Software how-tos: R, Python, SPSS, and Stata

You can compute most IRR statistics with standard packages.

In R, use irr::kappa2 for Cohen’s kappa and irr::kappam.fleiss for Fleiss’ kappa. Use psych::cohen.kappa for weighted kappa. Use irrCAC::gwet.ac1.raw and irrCAC::gwet.ac2.raw for AC1/AC2. Use kripp::kripp.alpha for Krippendorff’s alpha. For ICC, use irr::icc or psych::ICC. For Lin’s CCC, use DescTools::CCC.

Bootstrap CIs via boot::boot with a resampling function that recomputes your IRR on each bootstrap sample. For Bayesian ICC, fit a random-effects model with brms::brm and summarize the intraclass correlation from posterior draws.

In Python, use sklearn.metrics.cohen_kappa_score (supports weights='linear' or 'quadratic') for weighted kappa. Use statsmodels.stats.inter_rater.fleiss_kappa for multi-rater nominal data. Use pingouin.intraclass_corr for ICC and scipy.stats.kendalltau for pairwise tau. The krippendorff package (krippendorff.alpha(reliability_data, level_of_measurement='ordinal')) computes alpha with missing data. For CCC, implement directly or use packages like statsmodels recipes. For AC1/AC2, either port the formula or call R via rpy2.

In SPSS, compute Cohen’s kappa via Crosstabs (Analyze > Descriptive Statistics > Crosstabs > Statistics > Kappa). Compute Kendall’s W via Nonparametric Tests (K Related Samples). Compute ICC via Reliability Analysis (Analyze > Scale > Reliability Analysis > Model: Intraclass Correlation). Weighted kappa may require an extension command from IBM’s extension hub.

In Stata, use kap for Cohen’s kappa. Use kappaetc (community-contributed) for multi-rater kappa and AC1/AC2. Use icc for intraclass correlations. Use concord (community-contributed) for Lin’s CCC. Use baplot (community) or your own commands for Bland–Altman.

How do I compute Krippendorff’s alpha with incomplete data in R or Python?

Use functions that natively support missing entries.

In R, arrange your ratings matrix with items in columns and raters in rows (or vice versa). Pass it to kripp::kripp.alpha with method = "ordinal" or "nominal". The function ignores NAs and computes alpha with appropriate distance metrics.

In Python, install the krippendorff package and call krippendorff.alpha(reliability_data, level_of_measurement='ordinal') on a list-of-lists or NumPy array containing missing values (e.g., np.nan). The package handles these by excluding incomplete pairs.

Whichever language you use, validate your result by spot-checking a subset with complete data using a second estimator (e.g., weighted kappa). Ensure consistent direction and magnitude.

Report how missingness occurred, how much data were incomplete, and why alpha was chosen over alternatives.