Revv: Best Evidence Synthesis: Guide for Researchers

Overview

Choosing the right synthesis method is the first decision that determines whether your review will be decision-grade or merely descriptive. Best evidence synthesis (BES)—also written as best-evidence synthesis—offers a rigorous middle path. It helps when heterogeneity or methodological diversity makes a single pooled effect from meta-analysis misleading.

This guide shows when to choose BES over a random-effects meta-analysis or a conventional narrative review. It explains how to execute a stepwise BES protocol and align reporting with PRISMA 2020 and SWiM. It also shows how to translate findings into guidance with GRADE.

It draws on established methods from the Cochrane Handbook, ROB 2, ROBINS‑I, PRISMA 2020, SWiM, MOOSE, PROSPERO, and GRADE to provide an execution-ready playbook.

Definition and core principles of best-evidence synthesis

Best evidence synthesis (BES) is a systematic, transparent approach that integrates quantitative and qualitative evidence without forcing a pooled estimate when studies are too heterogeneous. It prioritizes methodological quality and directness. Each study’s influence on conclusions is explicit, while effect magnitudes are still summarized in comparable ways.

The core principle is simple. Do not let an average obscure important differences. Instead, let the best evidence speak loudest through clearly defined, quality-weighted judgments.

Practically, that means prespecifying eligibility, harmonizing effect metrics, and linking risk-of-bias (ROB) results to how much you rely on each study’s findings. To maintain transparency, protocolize all judgments and weighting rules. Report them in alignment with PRISMA 2020 and SWiM, which cover systematic reviews and synthesis without meta-analysis, respectively.

When to choose best-evidence synthesis vs meta-analysis vs narrative review

Your method choice should be driven by heterogeneity, data structure, and the decision context. BES is ideal when you need the discipline of a systematic review and the nuance of quality-weighted conclusions, but pooling would hide meaningful differences or rest on untenable assumptions.

A random-effects meta-analysis is preferred when outcomes, designs, and effect metrics are sufficiently compatible and heterogeneity is explainable or acceptable (as per the Cochrane Handbook). A traditional narrative review may suffice for exploratory scoping or very early fields, but it lacks the systematic, prespecified decision rules required for decision-grade recommendations. Document this choice in your protocol. Include thresholds that would trigger re-evaluation (for example, if additional comparable data later permit pooling).

Decision criteria and thresholds under high heterogeneity

Under high heterogeneity, the question is not “Can we pool?” but “Should we pool without misleading readers?” Choose BES when one or more of these conditions apply and cannot be resolved with transformations or justified subgrouping:

Incompatible or incomparable outcome measures where standardization would distort meaning (e.g., mixing patient-reported and biological endpoints).
Substantial, unexplained inconsistency across studies in direction or magnitude of effect after exploring key moderators.
Mixed study designs (e.g., RCTs, quasi-experimental, observational) with different confounding structures that drive effect differences.
Variants of the intervention or comparator that violate the exchangeability assumption required for pooling.
Sparse data within critical subgroups, where a pooled mean would overweight small, high-bias studies.

When these conditions are met, a BES lets you articulate direction and strength of evidence by quality tier and context. It avoids an average that masks decision-relevant variation. Record the rationale and thresholds in the protocol and revisit them if new data emerge.

Contraindications for meta-analysis and scenarios favoring BES

Contraindications for pooling are red flags that the average could mislead. In practice, lean toward BES when you see:

High risk of bias concentrated in smaller studies that would drive the pooled estimate.
Outcome timing or follow-up windows that are not meaningfully alignable.
Structural confounding (e.g., indication bias) in non-randomized evidence that differs across studies and cannot be adjusted consistently.
Non-overlapping populations or settings that change the baseline risk or effect modifiers in unresolvable ways.
Critical missing data or selective reporting that precludes reliable variance estimation for most studies.

State these contraindications in your protocol and describe how BES will handle them. For example, stratify results by design, ROB tier, or context and weight conclusions accordingly.

Step-by-step protocol for conducting a best-evidence synthesis

A strong BES protocol reads like an algorithm: clear rules, decision points, and documentation requirements. The blueprint below can be adapted across topic areas while staying standards-aligned and reproducible.

Define the question, scope, and eligibility criteria

Start by specifying a PICO (for interventions) or PEO (for exposures) question that aligns with the decisions your audience must make. BES requires tighter framing than a narrative review because you are committing to explicit weighting and harmonization decisions.

Define inclusion and exclusion criteria that anticipate heterogeneity: allowable designs, analytic approaches, outcome definitions, time windows, and minimal reporting standards. When heterogeneity is expected, predefine how you will stratify (e.g., by design, risk-of-bias tier, outcome timing) and how those strata influence interpretive weight. Document all criteria in the protocol. Justify them with references to the Cochrane Handbook, which details eligibility specification and handling of heterogeneity.

Search strategy and protocol registration

A comprehensive, reproducible search is non-negotiable, especially when you won’t have a pooled estimate to buffer sparse coverage. Combine database searches (e.g., MEDLINE, Embase, CENTRAL, Web of Science) with trial registries and grey literature to reduce publication bias.

Register your protocol on PROSPERO if your review addresses health-related outcomes. Registration requires a prespecified protocol and should occur before data extraction begins. If PROSPERO is not eligible (e.g., methods-only or non-health topics), preserve transparency by timestamping a protocol on an institutional repository. Plan versioning for any amendments.

Screening, coding, and inter-rater reliability

Plan dual, independent screening at title/abstract and full-text stages to minimize selection bias. Develop a codebook that operationalizes eligibility and data items. Pilot it on a small subset, and refine ambiguous definitions before full extraction.

Set an a priori inter-rater reliability target (e.g., Cohen’s kappa ≥0.70 for key decisions). Track it during calibration, and implement adjudication rules with a third reviewer for conflicts. Record reasons for exclusion at full text to populate the PRISMA flow diagram and inform sensitivity analyses for potentially excluded borderline cases (see PRISMA 2020).

Data extraction, effect metrics, and audit trails

Design structured extraction forms that capture effect sizes or sufficient statistics (e.g., means/SDs, event counts, hazard ratios), ROB domains, population/context features, and analytic notes. Prespecify a hierarchy of preferred effect metrics by outcome type and how you will convert metrics when needed.

Maintain a full audit trail: document contact with authors, imputation or transformation rules, deviations from the protocol, and all versioned datasets. This level of traceability is critical in BES because interpretive weighting depends on transparent, reproducible judgments aligned with SWiM guidance for non-pooled syntheses.

Risk-of-bias assessment and quality-weighting rules

Apply risk-of-bias tools matched to design: ROB 2 for randomized trials and ROBINS‑I for non-randomized studies. ROB 2 evaluates five domains and ROBINS‑I evaluates seven domains. Each maps to structured judgments that can be linked to weights (ROB 2; ROBINS‑I).

Define explicit weighting rules before seeing results. For example, assign preliminary weights such as 1.0 for low ROB, 0.75 for “some concerns”/moderate ROB, and 0.25 for high ROB. Prespecify exclusion when a “critical” domain is high risk (particularly in ROBINS‑I). Document how domain-level judgments will be synthesized into an overall weight. Flag any domain (e.g., selection bias) that carries extra influence in your context.

Synthesis procedures and sensitivity analyses

Plan how you will summarize effect direction and magnitude within and across strata without pooling. Options include median effects with interquartile ranges by design/ROB tier, structured narratives anchored to effect sizes, and visual displays such as harvest plots.

Predefine sensitivity analyses that test the robustness of conclusions to analytic choices. These may include excluding high-ROB studies, varying effect-size conversions, focusing on preregistered trials, or emphasizing larger, more precise studies. State the decision rules for resolving discordant findings. Explain how you will communicate uncertainty in the final narrative, consistent with SWiM reporting items.

Handling mixed study designs and harmonizing effect sizes

Mixed designs are where BES often outperforms “one-number” summaries, but only if you harmonize metrics and expectations clearly. Your goal is comparability without distortion, followed by design-aware interpretation.

When RCTs and observational studies co-exist, predefine whether you will present them separately or in a combined, design-weighted synthesis. If combined, ensure conversions are valid and note that residual confounding in non-randomized studies warrants down-weighting or at least explicit caution.

Effect-size computation across designs

Use effect-size conventions that preserve meaning while allowing structured comparison:

Continuous outcomes: mean difference (MD) on the original scale when instruments are identical; standardized mean difference (SMD, e.g., Hedges g) when scales differ.
Binary outcomes: log risk ratio (log RR) or log odds ratio (log OR); prefer RR when baseline risks differ markedly and are clinically interpretable.
Time-to-event outcomes: log hazard ratio (log HR), extracted from survival models or reconstructed when necessary.
Quasi-experimental designs: difference-in-differences (DiD) coefficients standardized to SMD or converted to log RR where feasible; interrupted time series as level and slope changes, interpreted separately.

State the preferred metric per outcome and allowed conversions (e.g., converting OR to RR with baseline risk). Record all transformations in your audit trail so that reviewers can replicate each step.

Harmonization and sensitivity to analytic choices

Harmonization involves more than math—it’s about fair comparison. Prioritize MD on a common instrument over SMD when feasible. Emphasize relative measures (RR) when baseline risk varies across settings. Avoid converting when it would obscure clinical meaning.

Perform sensitivity checks for key conversions (e.g., SMD vs rescaled MD) and for analytic choices (e.g., adjusted vs unadjusted estimates in observational studies). Report where conclusions hinge on a conversion or assumption and incorporate that into certainty judgments later.

Integrating risk-of-bias tools into weighting (ROB 2 and ROBINS-I)

ROB is not a footnote in BES; it’s a lever that shapes conclusions. Integrate ROB 2 and ROBINS‑I into your synthesis logic so that high-quality evidence appropriately anchors the takeaways.

Link domain-level judgments to overall weights and to narrative emphasis. For instance, a low-ROB RCT that is directly applicable to your PICO should anchor effect interpretation. High-ROB observational findings should inform hypotheses or contextual factors rather than the main conclusion.

From domain judgments to overall study weights

Translate ROB domains into weights through transparent, prespecified rules. One pragmatic mapping is:

Low ROB (all domains low or only minor issues): weight 1.0.
Some concerns/moderate ROB (one or more domains with concerns, none critical): weight 0.75.
High ROB (one or more domains high, but not critical): weight 0.25.
Critical ROB (ROBINS‑I critical in any domain): exclude from primary synthesis; consider only in qualitative context.

Adjust these weights if a domain is pivotal for your question (e.g., allocation concealment in acute interventions). Always run sensitivity analyses that set high-ROB weights to zero to test whether conclusions change materially.

Reflecting ROB in conclusions and certainty statements

ROB judgments should flow into your certainty language and, later, GRADE ratings. For example: “Across three low-ROB RCTs (n≈1,200), effects were consistently moderate in size; three high-ROB observational studies suggested larger effects but did not change the overall interpretation.”

Make the tie explicit: state which studies carry the most influence and why. Note any residual concerns (e.g., small-study effects). This transparency increases trust and prepares the ground for guideline translation with GRADE.

Reporting standards and checklists mapped to PRISMA, SWiM, and MOOSE

BES is reported under the same umbrella as systematic reviews but requires extra clarity where you choose not to pool. Align your reporting with PRISMA 2020, use SWiM for quantitative summaries without meta-analysis, and apply MOOSE when non-randomized studies dominate.

A simple rule of thumb is: PRISMA structures the whole report. SWiM explains how you summarized and grouped results without pooling. MOOSE adds design-specific details when observational evidence is central.

PRISMA 2020 items tailored to BES

PRISMA 2020 includes a 27-item checklist that applies fully to BES, with emphasis on methods transparency. In BES, pay special attention to prespecified synthesis methods (item 13), risk-of-bias assessment (items 11, 18), certainty of evidence (item 15), and your rationale for not pooling (embedded in items 13 and 20).

Explicitly describe your grouping/stratification rules, effect-size harmonization choices, and the weighting scheme that links ROB to interpretation. Include a PRISMA flow diagram and note any protocol deviations with reasons so readers can trace decisions end to end.

Using SWiM to present quantitative summaries without pooling

SWiM provides nine reporting items designed to improve transparency when no meta-analysis is performed. For BES, use SWiM to define how studies were grouped for synthesis, specify the standardized metrics used, describe how direction and size of effects were summarized, and explain how study limitations influenced synthesis.

Provide a rationale for chosen summary statistics (e.g., medians and IQRs by ROB tier). Describe how you handled inconsistency. This structure prevents “vote counting” by significance and emphasizes effect direction and magnitude aligned to study quality.

MOOSE considerations when observational studies dominate

When your body of evidence is largely observational, MOOSE guides the reporting of confounders, measurement methods, and analytic adjustments. Detail inclusion and exclusion criteria for cohorts, how exposures and outcomes were ascertained, the handling of missing data, and which confounders were adjusted in effect estimates.

Map these features into ROBINS‑I judgments and your weighting rules. Justify any decisions to include unadjusted estimates in sensitivity analyses only. This clarity helps readers interpret residual confounding and external validity (MOOSE).

Publication bias assessment and mitigation without a pooled effect

Without a pooled estimate, you still need to guard against publication bias and selective reporting. The primary strategy in BES is prevention—exhaustive searching, trial registry checks, and protocol comparison—paired with structured diagnostics.

Mitigate bias by searching registries and grey literature, contacting authors about unpublished analyses, and comparing registered outcomes to reported outcomes for selective reporting. Present results stratified by study size, funding source, and registration status. If small studies show systematically larger effects, flag this as a risk signal. When feasible, consider p-curve or related evidential value approaches to assess whether the distribution of significant p-values indicates selective reporting. Report the limitations of these methods in small corpora.

Software, workflows, and reproducibility for best-evidence synthesis

BES benefits from a reproducible workflow that keeps decisions traceable from search to synthesis. Select tools that enable deduplication, dual screening, structured extraction, ROB assessments, effect-size harmonization, and version-controlled reporting.

Plan a lightweight but robust pipeline. Use a reference manager for deduplication, a screening platform for dual review, structured extraction with validation, ROB tools aligned to design, and an analysis/reporting environment (e.g., R/Quarto) under version control. Document all steps so another team could recreate your results from search strings to figures.

Screening and data management tooling

Use tools that streamline collaboration while preserving an audit trail. Reference managers (e.g., EndNote, Zotero) help deduplicate. Screening platforms (e.g., Rayyan, Covidence, EPPI-Reviewer) support dual screening and conflict resolution. Structured extraction in REDCap, Airtable, or validated spreadsheets maintains consistency.

Whatever stack you choose, standardize variable names, lock codebooks before full extraction, and export machine-readable datasets for analysis. Record tool versions and settings in your methods to support reproducibility.

Effect-size computation and visualization packages

For effect-size computations and narrative-friendly visuals, a modern R stack is often sufficient. Commonly used packages include metafor (effect-size computation and plots), effectsize or esc (to compute standardized effects), and ggplot2 (custom harvest plots or direction-of-effect visuals).

Use robvis to visualize ROB 2 and ROBINS‑I judgments clearly, and consider simple, interpretable charts—like grouped dot plots with medians and IQRs—to implement SWiM-compliant summaries. Keep figure code under version control alongside data.

Reproducible documents, environments, and audit trails

Author the report in R Markdown or Quarto with parameterized documents that regenerate tables and figures from clean data inputs. Snapshot the R environment with renv and manage the pipeline with make-like tools or targets to ensure one-click rebuilds.

Version everything (searches, screening exports, data, code, figures) with git, and tag releases at protocol registration, analysis freeze, and submission. This level of reproducibility strengthens trustworthiness and supports living updates.

Quantitative presentation options without pooling

Quantitative summaries in BES should illuminate, not obscure. Favor displays and statistics that respect heterogeneity while conveying direction and magnitude at a glance.

Useful options include medians and IQRs of effect sizes within design/ROB tiers; harvest plots that encode direction, magnitude, and weight; dot plots grouped by context with study-level weights indicated; and narrative effect-size ranges anchored by high-quality studies. Always explain grouping rationale, chosen statistics, and how ROB informed the visual emphasis to align with SWiM guidance.

Case studies across fields: when BES changes the recommendation

In public health interventions, school-based nutrition programs often show larger effects in uncontrolled before–after studies than in cluster RCTs. A BES that weighted low-ROB cluster RCTs more heavily concluded modest, context-dependent benefits. This led to targeted rather than universal implementation.

In mental health, digital CBT apps display substantial variability across trials and observational deployments. By stratifying by adherence support and ROB, a BES found consistent small-to-moderate effects only in trials with structured onboarding. This steered payers toward programs with defined support models.

In health services, bundled-payment models yield mixed effects on costs and quality across quasi-experiments. A BES that down-weighted studies with high selection bias and emphasized difference-in-differences designs reported cost reductions without harm primarily in systems with mature data infrastructure. That guided staged adoption.

In environmental epidemiology, air filtration interventions for wildfire smoke exposure vary in design and outcome measures. BES harmonized metrics to SMD for lung function and grouped by exposure measurement quality. It found reliable benefits only in studies with direct particulate monitoring, informing procurement standards.

These cases illustrate the central BES advantage. Conclusions are grounded in the best, most applicable evidence rather than an average across incomparable contexts.

From evidence to guidance: applying GRADE and working with guideline panels

BES outputs are compatible with GRADE, which rates certainty across risk of bias, inconsistency, indirectness, imprecision, and publication bias. Even without a pooled effect, you can summarize effect direction and typical magnitude within the highest-quality stratum and then apply downgrades or upgrades as appropriate.

Work with panels to agree on which stratum anchors GRADE judgments (e.g., low-ROB RCTs). Document downgrades for inconsistency when effect directions vary meaningfully across contexts, and for imprecision when confidence intervals (or ranges) cross decision thresholds. Use GRADE Working Group guidance to produce evidence profiles and summary-of-findings tables that transparently reflect BES judgments and uncertainties.

Resource, time, and cost considerations

BES projects are often comparable in scope to meta-analyses, with added effort in harmonization and weighting. Plan resources around team roles, timelines for calibration, and analytical visualization.

Typical staffing includes a methods lead, information specialist, two screeners/extractors, and at least one content expert. Timelines hinge on volume and complexity. Expect 4–8 weeks for searching and screening, 6–10 weeks for extraction and ROB, and 4–8 weeks for synthesis, visualization, and write-up for a mid-sized corpus (e.g., 60–120 studies). Budget for tool licenses if needed and allocate time for panel engagement when guideline translation is anticipated.

Protocol registration, open-science practices, and living updates

Registration and open practices protect against bias and enable reuse. Register on PROSPERO when eligible and timestamp your protocol publicly even when not. Then share de-identified data and code upon publication where permitted.

For living BES, define surveillance intervals (e.g., quarterly database alerts), triage rules for whether new studies change conclusions, and a lightweight update pipeline that refreshes figures and narratives. Version and date every update, summarize what changed and why, and maintain a clear changelog so users can trust the continuity of your findings.