## Friday, April 3, 2020

### New Study Claims SARS-CoV-2 Coronavirus Originated Naturalistically

#### A Peer Review Failure

A recent study done by an international collaboration of five scientists, and published in the scientific journal nature medicine, has concluded that the SARS-CoV-2 coronavirus (also known as HCoV-19), responsible for the current global pandemic, originated naturalistically rather than via laboratory manipulation. Here I review their methods and compare their research with the claims made.

To investigate the origins of the SARS-CoV-2 coronavirus, the team used a Bayesian approach. Bayes’ Theorem, named after the eighteenth century Reverend Thomas Bayes, computes the probability of an event given some new evidence. While students will recall using Bayes’ Theorem to solve problems involving urns and colored balls, in fact it has a wide range of practical, real-world, applications.

Bayes’ Theorem has also been used to evaluate hypotheses and, in particular, theories of origins. Here, the use of Bayes’ Theorem is a bit more tricky. For example, one needs to know the prior probability of the hypothesis—that is, the probability of the hypothesis before the new evidence was obtained. One also needs to know the probability of the new evidence. And finally, one needs to know the conditional probability of the new evidence, given that the hypothesis is true.

Given these three quantities Bayes’ Theorem can then be used to compute the conditional probability of the hypothesis, given the new evidence. This may seem straightforward, and the calculation certainly is, but the required probabilities are anything but. For example, imagine using Bayes’ Theorem to compute the probability that the theory of evolution is true. Where would you find the prior probability of evolution? Or again, say your evidence is some new fossil finding. Where would you find the probability of that fossil, let alone the conditional probability of the fossil, given evolution?

Bayesian approaches are incredibly useful in practical problems in science, engineering, operations, manufacturing, and so forth. But in origins studies, this simple, straightforward approach is suddenly more difficult and fraught with danger. One can easily produce unjustified or otherwise erroneous results.

Various strategies attempt to remedy this problem of using Bayes’ Theorem in theory evaluation and origins studies. One strategy uses a Bayesian approach to compare two opposing theories. In other words, instead of simply evaluating the probability a theory is true given some new evidence, the theory is evaluated given the new evidence and the opposing theory as well. In this approach, the two opposing theories are compared to each other, and fewer probabilities are required at the start.

The cost of this simplification is the assumption that the two opposing theories are complementary. That is, one must be true, and the other must be false. There can be no other possibilities. This is often less straightforward than it seems. For example, imagine using this Bayesian approach to compare the theories of evolution and creation. Surely they are complementary, right? After all, either Darwin was right, or otherwise God created the species.

Such naivetÃ© is, unfortunately, too common. Hybrid theories, involving some combination of evolution and creation could be physically possible. Or some other theory, that we haven’t even thought of yet, could be possible. In fact, an unfortunate but very real problem is the use of Bayesian approaches to rig the calculation in one’s favor. The probabilities can be arbitrarily adjusted, and a strawman can be used for the opposing theory.

Thus Bayesian approaches are fraught with various pitfalls, including unjustified prior probabilities, unjustified opposing theories, and unjustified assumptions about complementarity. With these issues in mind I am always interested to see how scientists use Bayesian approaches for theory evaluation. Will they avoid, or fall prey, to the pitfalls?

This brings us to the SARS-CoV-2 study, whose lead author is Kristian G. Andersen, Associate Professor and Director of Infectious Disease Genomics at the Scripps Research Translational Institute. The paper is not very long, and the section on their investigation of the opposing theory of the virus origins is even shorter. Nonetheless their theory evaluation method (i.e., their method for determining the probability of the theory that the virus originated naturalistically) was less than straightforward because nowhere did they provide an overview of their method. In other words, nowhere does the paper mention Bayes’ Theorem or that they use a Bayesian approach. It is implicit.

Well perhaps that can be forgiven. After all, Bayesian approaches are rather obvious once one sees the equations in use.

But here we find another concern: there are no such equations in the paper. Again, it is implicit.

Well again, perhaps it was an innocent oversight, or required redaction due to space limitations. After all, the prior probabilities themselves contain the important information, and given them one can probably reconstruct the approach and equations. As the old joke goes, “it is left up to the student.”

But here again, we find yet another concern. Not only is Bayes nowhere mentioned, and nowhere is the particular Bayesian approach mentioned, and nowhere are any equations given, but in fact, nowhere are any probabilities given.

In fact, the rationale and explanation for their rather important conclusion is remarkably terse. The paper uses two pieces of evidence to argue against the theory that the virus arose via laboratory manipulation. And their rationale amounts only to a few sparse sentences, which I quote here. First, we have:

While the analyses above suggest that SARS-CoV-2 may bind human ACE2 with high affinity, computational analyses predict that the interaction is not ideal and that the RBD [receptor-binding domain] sequence is different from those shown in SARS-CoV to be optimal for receptor binding.

The argument boils down to this: The receptor-binding domain (RBD) in the SARS-CoV-2 spike protein binds with high affinity to the human ACE2 receptor, and in other species with high ACE2 similarity (sic, the paper erroneously refers to such high similarity as “high homology”), but that computational analyses fail to predict this, and instead predict “that the interaction is not ideal.” Therefore, they reason that the laboratory manipulation hypothesis is less likely because under that hypothesis, the observed SARS-CoV-2 RBD sequence would not have been designed. Instead, a designer would have selected a sequence with stronger predicted binding.

As you can see, this reasoning is fraught with unjustified assumptions about how a designer would have acted. The second argument is equally weak:

The second notable feature of SARS-CoV-2 is a polybasic cleavage site (RRAR) at the junction of S1 and S2, the two subunits of the spike. … Polybasic cleavage sites have not been observed in related “lineage B” betacoronaviruses, although other human betacoronaviruses, including HKU1 (lineage A), have those sites and predicted O-linked glycans. … if genetic manipulation had been performed, one of the several reverse-genetic systems available for betacoronaviruses would probably have been used. However, the genetic data irrefutably show that SARS-CoV-2 is not derived from any previously used virus backbone.

The fact that “the genetic data irrefutably show that SARS-CoV-2 is not derived from any previously used virus backbone” means very little. The importance of this evidence is contingent on the assumption that a designer would have used such a “previously used virus backbone.” The authors assert this is probable, but without justification. The assessment of what an agent would do is notoriously difficult and prone to bias.

Neither of the two evidences is particularly compelling. This weakness is, to a certain extent, reflected in tentative language used, such as “most likely,” “It is improbable,” and “would probably.”

But when we reach the Conclusions section this tentative language gives way to a new-found confidence: “the evidence shows that SARS-CoV-2 is not a purposefully manipulated virus,” and “we do not believe that any type of laboratory-based scenario is plausible.”

Finally all shadow of doubt is removed in the Abstract: “Our analyses clearly show that SARS-CoV-2 is not a laboratory construct or a purposefully manipulated virus.” This claim is entirely unsubstantiated and without merit.

This wrapping of two weak, subjective, arguments with ersatz certainty and authority paves the way for a triumphant press release where Andersen announces, without justification, that “we can firmly determine that SARS-CoV-2 originated through natural processes.”

To summarize, the authors make ever escalating claims of certainty, from a rather modest beginning of two, interesting but frankly weak, observations. The paper oversteps its bounds and should not have passed peer review.