Quality of Scientific Evidence

When scientific results are reported, a number of mechanisms are used to report on the quality of the resulting evidence.

TL;DR

Uncertainties associated with measurements (or measurement-derived data) are described by error bars. Confounding associated with randomness is mitigated by the use of significance (or hypothesis) testing. Various other possible confounding variables are eliminated by experimental design (studies, RCTs, etc.) and there is a hierarchy of evidence quality associated with these designs. Meta-studies are used to aggregate many existing studies into one piece of composite evidence. Mathematical models are often used as well, and there are both measurement uncertainties and structural uncertainties that must be assessed with them.

Error bars

When the result is a direct measurement (or an average of such measurements, or another datum derived from measurements), standard practice is to report an “error bar” – that is, some indication of the uncertainty in the measurement (or the derived datum). There is a variety of error bar “types” – one standard deviation, one standard error (if the measurement is a mean), a 95% confidence interval, etc. The particular error bar chosen should be described clearly, as these are different from each other, though they all serve essentially the same function, that of approximately describing the probability distribution associated with the measurement (or datum).

Statistical significance (hypothesis testing)

Another idea used in many scientific results (especially clinical results) is statistical hypothesis testing, in which a “null hypothesis” (essentially an assumed population distribution) and confidence level are set up in advance, and various test statistics are computed from experimental measurements on a sample. The fundamental notion is to estimate the probability of obtaining the measured results by chance if the null hypothesis is true. A sufficiently low probability leads one to reject the null hypothesis (thus providing evidence for some alternative hypothesis described in the paper). There are a number of approaches to this – one may, for example, alternatively use “decision theory” with two competing hypotheses, choosing the one with the higher probability associated with the observations. Both of these approaches deal with the notion of “statistical significance.”

All of these, though, are susceptible to various experimental design flaws (both intentional [like p-hacking] and inadvertent [like stopping rule dependence]), as well as misunderstanding (for example, the probability p that is calculated is not the probability that the null hypothesis is true, given the measurements, but rather the probability of obtaining the measurements, given that the null hypothesis is true; these might be quite different – see here for more discussion of this) and there is a huge literature of criticizing all these approaches.

While there is no perfect solution here, all of these are solid attempts to remove the confounding associated with randomness (specifically, usually random sampling), while randomness itself is used to avoid some other types of confounding (like correlation vs. causation), as we will see next.

Experimental design

There is far more to this topic than can be covered here, but suffice it to say that there are other sources of error than either measurement error (direct or propagated) or randomness. The way to deal with these other error sources (or confounders) is to carefully design the experiment. Here is a (very rough) classification of the common types of experimental design, ranked by the quality of evidence produced (from weakest to strongest). Note, however, that there is considerably variability even within categories, so that (for example) it may be true that a particularly well-designed, large-sample retrospective study could produce stronger evidence than a poorly-designed, small-sample prospective study. I note also that most of this principally applies to clinical research (and often, just to clinical research with human subjects).

  • Retrospective study: data is gleaned from existing records (perhaps clinical treatment records) of various treatments, demographics and outcomes. Correlations (univariate or multivariate) are examined and reported in a variety of ways.
  • Prospective study: demographic variables (questions) are prepared in advance for clinical use. Following the run of the study in clinical practice, the same processes as with retrospective studies are used – treatments, demographics and outcomes are examined for correlations and reported in a variety of ways.
  • Randomized controlled trial (RCT): a trial is set up, in which subjects are randomly assigned to “control” and “treatment” groups (perhaps multiple treatment groups). Again, pre-written demographic variables are used (possibly before randomization to produce a stratified random sample), and treatments, demographics and outcomes are examined for correlations.
  • Blind randomized controlled trial: same as the RCT, except that the subjects are not aware of their control/treatment status.
  • Double blind RCT: same as blind RCT, except that the researchers conducting the treatment are also unaware of the control/treatment status of the subjects.

Successive advantages – Retrospective studies are subject to the “Texas sharpshooter” fallacy, in which the “target” can shift subconsciously based on appearance of the data (prospective studies deal with this by “setting the target beforehand”). Prospective studies are still subject to the “spurious correlation” (or “correlation does not imply causation“) fallacy (see here for many amusing examples) in which correlations (even strong ones) have nothing to do with causation (also, in which causation runs in the reverse direction, and in which both variables are effects of a common cause). RCTs are (strong as they are) still subject to the placebo effect. And even blind RCTs are subject to (a human version of) the Clever Hans effect.

Here is a (rough) table that I use for assessing the quality of clinical data:

TypeQualityAdvantage
Retrospective studiespoornone
Prospective studiesweakimmune to Texas Sharpshooter
RCTmoderateimmune to spurious correlation
Blind RCTstrongimmune to placebo
Double Blind RCTvery strongimmune to Clever Hans
Experimental Evidence Categories

My quality assessments here shouldn’t really be regarded as pejorative – sometimes retrospective studies are all that are possible for time/cost reasons; sometimes RCTs can’t be done at all for ethical reasons (there are no RCTs for smoking or second-hand smoke); sometimes RCTs can’t be blinded for obvious practical reasons (chiropractic treatment and physical therapy); the cost goes up markedly with each increase in evidentiary quality. This is simply a rough method for assessing the quality of experimental (mostly clinical) evidence and for comparing relative strength of two contradictory papers. Within each category, evidence quality roughly increases with increasing sample size (though there are other, more subtle, structural choices that can affect quality as well).

This is only a brief introduction to the subject – prospective studies, for example, may use either a cohort, cross-section or case-control design – a good overall introduction to all this (in the clinical world) is here. I should also point out that in the field of Evidence-Based Medicine, they have a standard hierarchy of evidence. It is similar to mine, but slightly different in details, and including some even lower quality evidence, like expert opinion, that I tend to discount completely from a scientific perspective – see, for example, Table 1 here.

Meta-studies

The use of meta-analysis has become quite popular in recent years, as computational techniques for dealing with large datasets have improved sharply. In this approach, a number of separately-run scientific studies are statistically aggregated into one “meta-study.” This approach (at least potentially) sharply increases the power of the available scientific literature at fairly low cost.

However, a few caveats should be kept in mind:

  • In a categorical sense, the resulting data is no stronger than the weakest data aggregated (though it can correspond to a much higher sample size).
  • There are many ways a meta-study can be done poorly (and many of these are described in the linked article). I will only emphasize here that the method for choosing which studies to include in a meta-study is of critical importance in the resulting quality of the meta-study evidence.
  • There is also a problem of the inherent comparability of the studies being aggregated. Inappropriate aggregation can lead to all sorts of statistical problems like Simpson’s paradox.

Ultimately, meta-studies are a useful tool when used carefully, but a bit of caution is required in assessing their results.

Mathematical Models

I’ll try not to go overboard with this – it’s rather a pet peeve of mine, since I spent many happy semesters teaching the subject to undergraduates. I will also mention that I am discussing discrete-time models here – continuous-time models do exist, but seem to rarely come up in the context of “scientism.”

The evidentiary quality of the output of a mathematical model has two components:

  • the uncertainty of the input data, as propagated through the model computations (there are standard “error bar” ways of assessing and reporting this)
  • the applicability of the structure of the model itself to reality (this requires substantial work to assess)

Essentially, the model process proceeds by setting a “model structure” that is believed to have some relationship to a situation in reality. Each model has one or more input variables (usually in time series form) and one or more output variables (also in time series form). The output variables may also feed back in a recursive fashion as input variables into the “next” time step. Normally, the structure has several unknown coefficients, which are chosen by “tuning” the model to existing data.

For example, if the model is intended to predict the population of two species on an island, a 2-step linear recurrence model may be chosen, in which population of each species next year is expected to be a linear combination of the previous two years of population (of both species). This would give rise to a set of 8 coefficients (usually arranged as a 2-by-4 matrix). Then, 6 years of population data (for each species) would be sufficient to solve for these unknown coefficients (tuning the model to the last 4 years of this 6-year period). At that point, the model could be run as far into the future as desired. If the model is run, starting at year 3 of the 6-year tuning window, it will be exact for 4 years (that was the point of the tuning). However, after that point, it may be accurate, or it may be complete gibberish, depending on how much those 2 populations actually depend only on each other and in a linear fashion (and not on weather, a third species population, disease, etc.). Even if the populations do only depend on each other linearly, the model may be useless if, for example, one of the species has a 5-year age of sexual maturity (so that 2 years of population history dependence wouldn’t be enough).

How can one assess the model structure? The answer is fairly simple, though tedious (and a lot of work). First, construct a simple-minded “null model” – in our 2-species population example, one might perhaps look at the average annual growth rate of each population over a century, and model each population as a fixed percentage increase over the previous year (exponential growth, no species interaction). Choose (randomly) a number of points in the past (say, 8 points, each of which is more than 10 years in the past). For each of these points, re-tune the model using the data just prior to the test point, then run 10 years of the model (beyond the tuning point). Compare these 8 time series of 10 year duration (model output) each to both the “null model” prediction and the actual data (using various “goodness of fit” measures). Does your model accurately produce the actual data? Does it do better or worse than the “null model” started at the same 8 points?

Another test: re-tune your model to successive years in the data (beginning, say, 30 years ago). Look at the time series generated by each coefficient in your model as it is re-tuned over time. Are these coefficients fairly constant? If not, your model is substantially time-dependent (likely because an important input variable is missing in your structure), and stands little chance of accurately predicting the future.

Scientific papers relying on mathematical models typically do a good job of reporting their error bars, but one should understand that those are only a reflection of input uncertainties, propagated through the model calculations. They rarely (in my experience, at least) report any sort of structural model assessment of the sort I have just described. This is not to say that they didn’t do it “behind the curtain” – only that it’s not reported. If scientists are not going to do the hard work of assessing model structure (and report the assessment results), then they should at least make their model source code available so it can be done by others.

It is also worth mentioning that mathematical models are often run for a variety of “scenarios” – that is, assumptions about the future of the “pure input” variable time series (the inputs that are not also outputs and hence part of the model feedback). These scenarios are also a source of significant model uncertainty – they are normally reported fairly carefully, but the uncertainties there are truly unknown: “it’s tough to make predictions, especially about the future.”