Revv: Inter-rater agreement guide: methods, metrics, planning

Overview

This guide shows you exactly how to choose, compute, and report inter rater agreement for real studies and QA programs. Inter rater agreement quantifies how consistently different people rate the same items; it’s essential whenever human judgment affects outcomes, from clinical imaging reads to essay scoring to content moderation.

We’ll clarify agreement versus reliability and map decisions across data types and study designs. We explain kappa, AC1/AC2, ICC, alpha, CCC, and Bland–Altman in plain language. You’ll get practical steps for confidence intervals, sample size, and reporting.

Along the way, we point to authoritative standards such as the GRRAS checklist for transparent reporting and widely cited guidance for ICC selection and reporting (ICC reporting guideline by Koo and Li). We also link the original method-comparison approach by Bland and Altman for absolute agreement (Bland–Altman method comparison).

Inter rater agreement vs reliability: what’s the difference and why it matters

This section separates two concepts often conflated: agreement and reliability. Agreement asks “do raters give the same answers?” Reliability asks “are measurements consistent enough to rank or discriminate?”

Agreement is about closeness on the same scale (e.g., two radiologists both call “present” or both measure 42 mm). Reliability focuses on variance: how much of the total variability is due to true differences versus error.

For categorical data, percent agreement and chance-corrected indices (kappa family, AC1/AC2) are standard. For continuous or rating-scale data, the intraclass correlation coefficient (ICC) assesses reliability. Bland–Altman or the concordance correlation coefficient (CCC) assesses absolute agreement.

A high Pearson correlation can coexist with poor agreement if one rater is systematically higher. That’s why correlation alone is not sufficient for inter rater agreement. Choose your statistic based on your decision need: do you need exact agreement, or is consistent ranking acceptable?

When to use percent agreement, kappa variants, AC1/AC2, alpha, ICC, or CCC

This section gives you a pragmatic selection framework by data type, number of raters, and goal. The right statistic depends on whether your ratings are categorical (nominal/ordinal) or continuous, how many raters participate, and whether raters are fixed or sampled.

For nominal categories with two raters, start with percent agreement and Cohen’s kappa. If prevalence is extreme or bias exists, add Gwet’s AC1 (more stable) or PABAK.

For nominal categories with many raters, use Fleiss’ kappa. If design is unbalanced or has missing ratings, Krippendorff’s alpha handles the irregularity well.

For ordinal categories, use weighted kappa (linear or quadratic) for two raters. For multiple raters or missingness, use Krippendorff’s alpha with an ordinal distance.

For continuous ratings, use ICC for reliability. Use Bland–Altman or CCC when you need absolute agreement against a method or between raters. Always report uncertainty via confidence intervals.

Decision map by data scale, rater design, and fixed vs random effects

This subsection operationalizes common scenarios so you can pick with confidence. If you have nominal categories and two raters on the same items, compute percent agreement and Cohen’s kappa first.

If category prevalence is skewed or marginal distributions differ, add Gwet’s AC1 to mitigate the kappa paradox. With more than two raters and a balanced design, use Fleiss’ kappa.

If raters vary by item or some items are missing ratings, Krippendorff’s alpha handles the irregularity well. For ordinal scales with two raters, choose weighted kappa.

Use linear weights when “one-step” misclassifications should be penalized equally. Use quadratic weights when larger gaps deserve stronger penalties and small discrepancies are more tolerable.

For continuous scores, choose an ICC model consistent with your design (one-way, two-way random, or two-way mixed; absolute agreement vs consistency; single vs average-measures). Supplement with Bland–Altman limits if clinical interchangeability matters.

If you need one statistic to handle many raters, multiple scales, and missingness, Krippendorff’s alpha is a robust choice. If you need generalization across random raters and items, consider an ICC from a random-effects model and, for rater severity, a many-facet Rasch approach later in QA.

The kappa prevalence/bias paradox and practical remedies (PABAK, AC1/AC2)

This section explains why Cohen’s kappa can be “low” even when percent agreement looks “high,” and what to do. Kappa adjusts observed agreement by the agreement expected by chance based on each rater’s marginal distributions.

When one category is very common (prevalence) or raters’ tendencies differ (bias), expected-by-chance becomes large, pushing kappa down. The result can look paradoxical.

In practice, two raters might agree on 95% of cases because almost all are “negative,” yet kappa reports only “fair” agreement. Chance agreement under imbalance is also high.

Remedies include PABAK, which rescales percent agreement to a kappa-like metric by the formula in words: twice the observed agreement minus one. It directly addresses prevalence without using marginal probabilities.

Gwet’s AC1/AC2 redefines the chance agreement term using an estimated probability that avoids instability from skewed marginals. AC1 is for nominal categories and AC2 extends to ordinal categories (see an inter-rater reliability overview).

When prevalence is extreme or marginal imbalance is expected, compute and report kappa alongside AC1/AC2 (and PABAK). Reconcile interpretation with the study context.

Weighted kappa for ordinal ratings and how it relates to ICC

This section shows how to handle ordered categories without treating all disagreements equally. Weighted kappa assigns partial credit to near-misses based on a weight matrix.

Linear weights penalize one-step disagreements proportionally. Quadratic weights penalize larger gaps more heavily and are more forgiving of small discrepancies.

In many scoring rubrics (e.g., 0–4 severity), quadratic weights better reflect practical closeness. There’s also a conceptual link: for two raters on ordinal scales treated as interval-like, quadratic weighted kappa is closely related to an ICC from a two-way consistency model.

Both emphasize relative closeness rather than absolute levels. In practice, compute weighted kappa with both linear and quadratic weights to check robustness. When appropriate, compute an ICC to triangulate reliability; differences can flag scale properties or rater effects worth investigating.

Choosing an ICC model: one-way random, two-way random, two-way mixed; single vs average measures

This section helps you match your design to the correct intraclass correlation coefficient. ICCs differ by whether raters are sampled randomly or are the only raters of interest, whether the design is one-way or two-way, and whether you care about absolute agreement or consistency.

Use one-way random effects [ICC(1)] when different random subsets of raters score different items. Examples include crowd annotation where items and raters vary.

Use two-way random effects [ICC(2)] when all items are rated by all raters and you want to generalize to a population of raters. This supports broader inference.

Use two-way mixed effects [ICC(3)] when all items are rated by fixed, specific raters (e.g., your two in-house experts). You don’t intend to generalize to other raters.

Choose “absolute agreement” when exact match matters (e.g., same numeric value). Choose “consistency” when systematic mean differences are tolerable and ranking is primary.

Decide “single-measures” if a single rater’s score will be used operationally. Choose “average-measures” if you will average across k raters; averaging reduces error variance and raises reliability (see the ICC reporting guideline by Koo and Li).

Report the ICC form explicitly (e.g., ICC(2,1), absolute agreement). Provide a confidence interval.

Krippendorff’s alpha for complex designs and missing data

This section introduces a versatile coefficient that works across scales and messy designs. Krippendorff’s alpha measures agreement corrected for chance across nominal, ordinal, interval, and ratio data.

It allows any number of raters per item and accommodates missing and “uncertain” ratings naturally. Alpha computes observed disagreement versus expected disagreement given the data and scale-specific distances.

When there are missing ratings, it uses only available pairs without ad hoc imputation. In content analysis and ML labeling with variable participation, alpha is often the most straightforward summary.

Conventional benchmarks in that literature view α ≥ 0.80 as acceptable for most uses and 0.67–0.80 as tentatively acceptable depending on stakes (Krippendorff’s alpha). If your design is unbalanced or raters differ by item, alpha provides a principled, chance-corrected summary without discarding data.

Bland–Altman limits of agreement vs ICC for continuous measures

This section distinguishes reliability from interchangeability for continuous outcomes. ICC quantifies the proportion of variance due to true differences, so it can be high even if one rater is consistently higher than another.

Bland–Altman directly assesses whether two methods or raters can be used interchangeably. A Bland–Altman analysis plots differences versus means, estimates the mean bias, and gives limits of agreement.

The limits are mean difference ± 1.96 times the standard deviation of differences under normality. If the limits are clinically acceptable, the methods “agree” for practical purposes.

If not, even a high ICC doesn’t imply interchangeability. The CCC combines correlation and bias into a single coefficient to quantify agreement against the 45-degree line (Concordance correlation coefficient).

For method comparison, report Bland–Altman limits and justify acceptability in the study’s units. If needed, add an ICC for reliability; they answer different questions.

Planning your study: sample size, power, and confidence intervals for agreement metrics

This section turns targets into design inputs so you end with interpretable precision. Decide your minimum acceptable agreement (e.g., kappa ≥ 0.60, ICC ≥ 0.80) and your target precision (e.g., 95% CI width ≤ 0.20).

Choose the number of items and raters accordingly. For kappa, the width of the confidence interval depends on the number of items, the number of categories, the prevalence distribution, and the true agreement level.

A rough rule is that skewed prevalence requires more items than balanced prevalence to achieve the same CI width. For ICC, the CI depends on the number of items (subjects), the number of raters, and the variance components.

More raters or items narrow the CI, with the first few additions giving the largest gains. In practice, pilot a small sample to estimate category prevalence or variance components, set your target CI width, then use standard formulas or software to solve for n.

If comparing methods or groups, inflate the sample to maintain precision for subgroup estimates.

Setting target precision (CI width) and minimum acceptable agreement

This subsection helps you set clear, defensible design targets.

Define the minimum acceptable agreement for your use case (e.g., α ≥ 0.80 for content analysis; ICC ≥ 0.90 for clinical decisions with patient-level impact).
Select a 95% confidence interval width you can act on (e.g., kappa CI within ±0.10; ICC CI within ±0.10).
Pilot to estimate prevalence or variance components to feed into planning.
Solve for the number of items and raters that meet both thresholds; prioritize more items first for kappa and a balanced increase of items and raters for ICC.
Document assumptions and revisit after initial data to adjust the planned sample.

Handling missing or ‘uncertain’ ratings and varying-rater designs

This section covers pragmatic strategies when not every rater scores every item or when raters can choose “uncertain.” Missing and uncertain ratings are common and, if mishandled, can bias agreement estimates upward or downward.

For nominal/ordinal data, Krippendorff’s alpha naturally accommodates missingness and uncertain codes as distinct categories with appropriate distance functions. For kappa, if “uncertain” is a meaningful category, include it.

If it reflects low confidence rather than a distinct state, consider sensitivity analyses treating it as missing. For varying-rater designs (e.g., crowd annotation), use alpha or fit a random-effects model to estimate variance components and an ICC that generalizes across raters.

Always perform a sensitivity analysis. Compute agreement with and without uncertain ratings, or using different missing-data treatments, and discuss the impact.

Comparing agreement between groups or studies

This section explains how to test whether two coefficients differ meaningfully. When comparing kappa across independent groups, you can use an asymptotic z-test based on each estimate’s standard error or use bootstrap methods.

Bootstrap can compare the difference with a confidence interval. For paired designs (same items, different rater pairs), use methods that account for dependence.

For ICC comparisons, if groups are independent, you can compare confidence intervals or use variance-component modeling with a group factor. Likelihood-ratio tests can assess differences in reliability.

Specialized tests (e.g., Fisher’s z isn’t directly applicable to ICC) are best implemented via mixed-effects models. If CIs overlap marginally, prefer hypothesis testing or equivalence testing with a predefined margin that is practically meaningful.

Regardless of method, report both point differences and uncertainty. Interpret in context (e.g., whether a 0.05 difference in ICC changes decisions).

Rater training, calibration, and drift monitoring over time

This section outlines how to build and keep agreement, not just measure it. High inter rater agreement arises from clear rubrics, examples near thresholds, and regular calibration. It declines with rater turnover, fatigue, and scope creep.

Before live scoring, train with gold-standard items and provide immediate feedback. Set a go/no-go threshold (e.g., weighted kappa ≥ 0.70 or ICC ≥ 0.80) before raters can score operational items.

During production, monitor rolling agreement with control charts (e.g., Shewhart or CUSUM on differences or agreement coefficients) to detect rater drift early. For rater severity/leniency and drift over facets (raters, items, tasks), consider many-facet Rasch modeling.

Estimate and adjust severity while tracking change over time. Trigger retraining or adjudication when agreement drops below predefined limits for consecutive windows, and document corrective actions.

Reporting your results: GRRAS-aligned checklists and domain-specific benchmarks

This section gives you a reporting template that stands up to peer review and audit. Follow the GRRAS guidance for transparency: define the target population and raters, detail sampling and training, specify the rating scale and instructions, predefine the agreement statistic and thresholds, handle missing data explicitly, and report estimates with confidence intervals and assumptions.

Benchmarks are domain-specific. In clinical research, ICC ≥ 0.75 is often considered good and ≥ 0.90 excellent for patient-level decisions.

Method interchangeability should be justified by Bland–Altman limits in clinically meaningful units. In content analysis, Krippendorff recommends α ≥ 0.80 for most uses (≥ 0.67 for tentative conclusions).

For categorical screening tests, report percent agreement alongside kappa and AC1/AC2. Interpret disagreements by prevalence and cost of errors.

Always align thresholds with regulatory, clinical, or organizational risk tolerance rather than generic cutoffs.

Tools and workflows: practical steps to compute and report agreement in R, Python, and common stats packages

This section offers tool-agnostic workflows you can reproduce without heavy code. In R, packages like “irr,” “psych,” and “irrCAC” compute Cohen’s/weighted kappa, Fleiss’ kappa, Gwet’s AC1/AC2, and ICC. “DescTools” and “agreement” provide CCC and Bland–Altman utilities.

In Python, scikit-learn offers Cohen’s kappa. “Pingouin” includes ICC and CCC. The “krippendorff” package computes alpha, and plotting Bland–Altman is straightforward with common plotting libraries.

In SPSS, use Analyze > Descriptive Statistics > Crosstabs for Cohen’s kappa. Use Analyze > Scale > Reliability Analysis for ICC (specify model and type) and Graphs for Bland–Altman plots.

In Stata, “kap”/“kappaetc” compute kappa variants and “icc” estimates ICC. In SAS, kappa is available via PROC FREQ and ICC via PROC MIXED/GLM variance components; CCC is in some contributed macros.

Regardless of software, your workflow should define the design and statistic a priori. Compute the point estimate and a 95% CI. Check assumptions (prevalence, bias, normality for LoA), run sensitivity analyses (e.g., AC1 vs kappa, linear vs quadratic weights), and present results with interpretation in the study’s units.

Common pitfalls and misinterpretations to avoid

This section flags mistakes that lower credibility and obscure decision value. The most common error is using Pearson correlation to claim agreement; correlation ignores bias and can be high with poor agreement.

Another is reporting percent agreement alone without chance correction when categories are imbalanced; this inflates perceived agreement. Over-relying on a single statistic without uncertainty leads to false certainty; always add 95% CIs.

Ignoring prevalence and bias (the kappa paradox) can misclassify acceptable performance as “poor.” Add AC1/AC2 or PABAK in skewed settings.

For continuous measures, using only ICC without Bland–Altman can hide unacceptable limits for interchangeability. Finally, not specifying the ICC model or using average-measures ICC to justify single-rater use are common missteps—match the statistic to how scores will be used.

Checklist before you compute

Use this quick checklist to validate design choices and avoid rework.

Clarify the decision goal: exact agreement vs consistency vs interchangeability.
Identify scale type, number of raters, and whether raters are fixed or random.
Predefine the primary statistic, acceptable threshold, and target CI width.
Anticipate prevalence and bias; plan to compute AC1/AC2 or PABAK if needed.
Plan how to handle missing/uncertain ratings and varying-rater coverage.
Pilot to estimate prevalence or variance components for sample size planning.
Document and preregister your analysis plan with GRRAS-aligned reporting items.

Appendix: Key formulas explained in plain language

This appendix summarizes core equations conceptually so you can interpret results confidently. Percent agreement is simply the number of times raters agree divided by the total number of items.

Cohen’s kappa equals “observed agreement minus agreement expected by chance” divided by “one minus agreement expected by chance.” Chance is computed from each rater’s marginal category proportions.

Weighted kappa modifies this by assigning a weight between 0 and 1 to each pair of categories reflecting closeness. Disagreements far apart get lower weights.

PABAK takes observed agreement and rescales it to a kappa-like 0–1 by the formula in words: two times observed agreement minus one. Gwet’s AC1/AC2 keep the same basic form as kappa but compute the expected-by-chance term differently, using an average probability of category assignment that stabilizes when prevalence is extreme.

ICC is the ratio of “between-item variance” to “total variance,” where total variance is the sum of between-item variance and error components. In two-way models, it also includes rater and interaction variance (see Intraclass correlation).

The concordance correlation coefficient multiplies the Pearson correlation by a bias-correction factor that penalizes deviations from the 45-degree line. It reaches 1 only when points lie exactly on that line.

Bland–Altman limits of agreement are the average difference between raters plus or minus about two standard deviations of the differences. This captures where most differences will fall and whether that spread is acceptable in your domain.

Putting it all together, inter rater agreement is not one number but a toolbox. Select the statistic that matches your scale and decision, plan for precision, compute with transparency, and report with context and uncertainty.

When in doubt, triangulate: combine a chance-corrected coefficient (kappa/AC1/alpha), a reliability estimate (ICC), and an absolute agreement analysis (Bland–Altman or CCC) to deliver a complete and defensible story.