GISAID started in 2008, after researchers around the world expressed some reticence at putting sequence data from their surveillance of bird flu into public domain databases. Under-resourced scientists didn’t want to drop a new sequence but then get scooped on the analysis by some other researcher with a zillion-dollar lab. And as GISAID got more and more data, the people who ran it had to come up with a way to identify each sequence and put them all into context with one another. Now it’s the main data repository for SARS-CoV-2 genomes.
But the world of Covid nomenclature has two more great and noble houses. Nextstrain, based at the Fred Hutchinson Cancer Research Institute and University of Basel, is one. Its organization revolves around clades, big branches on the phylogenetic tree of life. (Nextstrain started out doing the same job for influenza.) Its names have a cheat code—clades are organized by the year they’re discovered and a letter of the alphabet, and then according to specific mutations of interest. The de Oliveira team’s variant had a bunch of mutations, but the N501Y was important. (The mutation changes an asparagine, abbreviated with the letter N, to tyrosine, abbreviated with a Y, at the 501st amino acid on the virus’ spike protein, in the RBD (that’s Receptor Binding Domain) that attaches to the human ACE2 receptor (that’s Angiotensin-Converting Enzyme).
Easy, right? (Ahem.) But then things got even more complicated. The one the UK researchers were seeing had the same mutation, among many others. To distinguish it from de Oliveira’s, each got a new designation—appending “V1” on the one from the UK and “V2” on the other. Another similar variant that led back to Manaus, in Brazil, came to be “v3.”
“We’re not trying to name everything. In fact, we’re really explicitly trying not to have more than 10 or 20 names a year, and we’re interested in picking out the most important things,” Hodcroft says. “That’s, like, big changes in the tree. When we see groups that are different in their genetics and they spread, even if it takes a while, in a region or around the world, we give those a Nextstrain clade.”
That’s not what the other bigwig in the space does, though. It’s analytical software called Pangolin—“Phylogenetic Assignment of Named Global Outbreak LINeages.” So-called Pango lineages start with a letter, initially A or B, designating the first two diverging SARS-CoV-2 sequences that emerged from China in late 2019 and early 2020. Each generation gets a number, and its descendants get an additional number, preceded by a period—but only for three generations. Four or more, and the whole lineage gets assigned to a new letter. Imagine an Obed-begat-Jesse-and-Jesse-begat-David vibe, but with diagrams and genomic receipts. “Lineages are operating on a different resolution. You can have very big ones and small ones, but the idea is to capture the emerging edge of the pandemic,” says Áine O’Toole, an evolutionary biologist at the University of Edinburgh who created Pangolin and is now one of its main developers. “The idea is to have a cluster of sequences that is linked to some sort of epidemiological piece of information.”
(After publication, O’Toole emailed me to note that while she had created the Pangolin software, she didn’t come up with the Pango notation used in the nomenclature—that was a bigger team. It’s an important distinction that also proves my point about how hard it is to name things, including the people who name things.)
Pangolin has a tricky bit. Anyone working on a viral genome can use the software to try to figure out whether they have something new, and where it might fit with all the known lineages (with data pulled from GISAID, just as Nextstrain does). But making a final call on whether a strain is indeed new, and deserves a different spot in the heuristic—its Pango lineage—is up to actual living people on the team and suggestions from scientists in the field. “I think maybe it’s something we need to work harder on, to try to convey there’s a difference between lineage designation and lineage assignment,” O’Toole says. “When we designate lineages, that’s just based on what we know. If you’ve got a new lineage and we haven’t seen it, Pangolin won’t be able to assign it, because it can’t predict lineages that will arise in the future. So there is a lag.”