A novel software algorithm devised at Caltech allows scholars to effortlessly search for viruses within RNA sequence data, facilitating researchers in detecting viruses in samples and examining their effects on biological processes.
The number of distinct viruses on our planet is nearly incomprehensible: It is estimated that there are 10 million individual viruses for every star in the cosmos. Viruses are ubiquitous, even in the absence of diseases they might cause, and numerous unanswered questions remain regarding their influence on our everyday experiences. For instance, it is posited that certain neurodegenerative conditions, including Alzheimer’s and Parkinson’s, could stem from viral infections. The newly developed algorithm, based on an existing software application known as kallisto, can now illuminate this previously hidden viral domain.
The investigation was carried out in the lab of Lior Pachter (BS ’94), Bren Professor of Computational Biology and Computing and Mathematical Sciences. A manuscript detailing the research was published on DATE in the journal Nature Biotechnology.
“When analyzing RNA from a human lung specimen, for instance, you capture all RNA—mostly human, but also that from any viruses that may be infecting human cells,” states former graduate student Laura Luebbert (PhD ’24), the primary author of the study. “With conventional analysis methods, this data regarding viral presence is usually overlooked. Our tool, on the other hand, enables researchers to preserve and quantify these informations, even for unanticipated or novel viruses.”
Contemporary transcriptomic tools assess the genes expressed in cells and have generated enormous amounts of sequence data. Approaches such as single-cell RNA sequencing can pinpoint the transcriptomic material existing within individual cells, allowing researchers to comprehend the functions of various cell types within a sample. In theory, this data also provides an avenue to investigate the viruses in these samples; the new tool facilitates this.
Kallisto is a computational application capable of distinguishing viral genetic material within sequence data. The overwhelming majority of viruses that lead to common infectious illnesses are RNA viruses (which utilize RNA, rather than DNA, as their genetic material), sharing a vital segment of protein machinery known as the RNA-dependent RNA polymerase (RdRp). By searching for the genetic sequence of this protein, kallisto can identify over 100,000 species of viruses with minimal computational expense.
Luebbert and her colleagues anticipate the tool’s extensive application in datasets to track emerging diseases and explore the expansive viral ecosystem that surrounds us.
“The product is a software application designed to be approachable for any biologist,” Pachter remarks. “We expanded upon a database called PalmDB, originally created by researchers Robert C. Edgar and Artem Babaian, and introduced our own innovative algorithmic concepts. Any researcher with sequence data can utilize kallisto to discover which viruses are present in their samples and in which cells they are located.”
The paper is titled “Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes.” Alongside Luebbert and Pachter, co-authors include Caltech graduate students Delaney K. Sullivan and Maria Carilli, as well as former graduate students Kristján Eldjárn Hjörleifsson (PhD ’23), Tara Chari (PhD ’24), and Alexander Viloria Winnett (PhD ’24, now a postdoctoral researcher at Caltech). Funding was provided by Caltech, the UCLA–Caltech Medical Scientist Training Program, the National Science Foundation, the National Institutes of Health, and the Gates Foundation. Lior Pachter holds a position as an affiliated faculty member with the Tianqiao and Chrissy Chen Institute for Neuroscience at Caltech.