Introduction
The ongoing pandemic caused by SARS-CoV-2 has seen the progressive emergence of different virus variants. The Centers for Disease Control and Prevention (CDC) has classified existing SARS-CoV-2 lineages into neutral variants, variants of interest (VOI) and variants of concern (VOC).1
VOI are variants with specific genetic markers that have been associated with receptor binding change, reduced neutralisation by antibodies and efficacy of treatments, potential diagnostic impact, predicted increase in transmissibility or disease severity. VOCs, on the other hand, are variants that, in addition to the possible attributes of a VOI, show impact on diagnostics, treatments or vaccines, interference with diagnostic test targets, substantially decreased susceptibility to therapies and neutralisation by antibodies, reduced vaccine-induced protection from severe disease, increased transmissibility or disease severity. A fourth classification, variants of high impact, is dedicated to variants more dangerous than VOCs, but none of the existing variants has been classified as such so far. Examples of VOCs are Alpha, Beta, Delta and Omicron.
Virus variants are classified after being isolated, and after their characteristics have emerged in a public health context, for example, enhanced transmissibility. For this reason, the countermeasures are always implemented after a variant is known, that is, the virus always has the upper hand in the arms race against the variants. Consequently, recognising a VOI or VOC as early as possible is utterly important to curb its damage, and ultimately save lives.
The virus protein sequences collected over the world are continuously deposited in the GISAID database, which was created in 2008 to promote influenza data sharing.2 GISAID is an example of informatics infrastructure implemented before the COVID-19 pandemic. It was of great importance to manage and monitor the COVID-19 emergence in the last 2 years.
Along with suitable strategies to collect and store the data, machine learning (ML) techniques have been extensively applied to analyse COVID-19 data.
Few studies are focused on variant-related predictions, for example, in isolating critical amino acid (AA) positions (or patterns) in the spike protein,3 or in forecasting novel variant potential waves.4 Importantly, these studies need input genomes that have already been isolated, that is, do not provide a viable method to generate novel genomes that could carry unknown but potentially dangerous variants. Of note, the Pango lineages framework has a specific ML module (PangoLearn).5 This module implements two simple ML approaches, decision trees and logistic regression, to classify unknown viral genomes into Pango lineages. The models are based on positional (alignment-dependent) features, and are limited to predicting known classes, that is, they can only predict known lineages.
To detect VOCs from their competitions with other variants, Zhao et al developed VOC-alarm, a statistical method based on the concept of mutational entropy.6 Authors defined the mutational entropy of a variant as a measure of the change of the mutation numbers across the globe for a lineage in a specific time period. In their analysis, Zhao et al noticed that some VOCs, such as Alpha, Delta and Omicron, grew from a small population and, as VOCs emerge, competing variants in precedent lineages decrease in population size.6 The concept of spreading mutations within a time window was also studied by Maher et al.7 A combined methodology was proposed by Makowski et al to evaluate single mutations in the spike proteins, based on two ML models, one to predict the impact of receptor binding domain mutations on ACE2 affinity and the other predicting human serum antibody affinity.8
Different from the aforementioned approaches, here we propose an ML method to timely predict the variants of concern as they are sequenced, without relying on information that needs to be collected over a period of time, such as changes in population size. That is, we develop an algorithm predicting each variant as being an ‘anomaly’ or not, using only the spike protein sequence, and ideally before the variants spread enough to manifest their related phenotypes—in other words ahead of their official classification. In recent work, we simulated the implementation of a pandemic surveillance classifier that predicts new non-neutral variants (VOCs and VOIs) monthly. Our system simulates a monthly update of a binary classifier with the new variants detected using supervised incremental learning.9 Incremental learning algorithms are able to incorporate new knowledge without a complete retraining of model parameters.10 For this reason, they can aid in evolving situations, such as during a pandemic. Yet, our incremental learning system assumes that the ground-truth class (neutral or non-neutral) for each variant is soon available at the end of the month. In the real case, this assumption does not always hold: for instance, the first Alpha sequence lately labelled as VOC was deposited in GISAID in late July 2020, while the Alpha variant was officially recognised as VOC by CDC only in late December of the same year.1
Here, we simulate the implementation of a pandemic surveillance classifier based on anomaly detection. Viruses continuously replicate, and during replications new types of variants that differ from the underlying population can arise. Detected anomalies can be new non-neutral variants. Briefly, we assume that we are in a peak state (in the space of spike protein sequences) when a specific variant is dominating the landscape, and the forthcoming of a new variant can be an anomaly that changes the state. Details of our proposed methodology can be found in the ‘Methods’ section. We will then evaluate the performance of our approach by comparing when our classifier predicts a known VOC/VOI as anomaly (in terms of date), with the date of designation as VOC by WHO as reported by the CDC. By predicting new virus sequences collected over time, the proposed approach can have the ability to raise a flag before to see variants are officially recognised as VOC/VOI by authorities.