Authors: Anne Trafton MIT News First Published May 17, 2021
In early 2020, a few months after the Covid-19 pandemic began, scientists were able to sequence the full genome of SARS-CoV-2, the virus that causes the Covid-19 infection. While many of its genes were already known at that point, the full complement of protein-coding genes was unresolved.
Now, after performing an extensive comparative genomics study, MIT researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In their study, which appears today in Nature Communications, they confirmed several protein-coding genes and found that a few others that had been suggested as genes do not code for any proteins.
“We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” says Manolis Kellis, who is the senior author of the study and a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.
The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.
The SARS-CoV-2 genome consists of nearly 30,000 RNA bases. Scientists have identified several regions known to encode protein-coding genes, based on their similarity to protein-coding genes found in related viruses. A few other regions were suspected to encode proteins, but they had not been definitively classified as protein-coding genes.
To nail down which parts of the SARS-CoV-2 genome actually contain genes, the researchers performed a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (which caused the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.
Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. The techniques are based on analyzing whether certain DNA or RNA bases are conserved between species, and comparing their patterns of evolution over time.
Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame. This gene-within-a-gene is rare in large genomes, but common in many viruses, whose genomes are under selective pressure to stay compact. The role for this new gene, as well as several other SARS-CoV-2 genes, is not known yet.
The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.
“We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” says Irwin Jungreis, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”
The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.