Gene Knowledgebase Expands Across Tree of Life

The Language of Life Just Got a Lot More Complete

Imagine trying to understand the plot of a novel, but you only have the vocabulary list for a few chapters. You know the words, but not how they fit together, not what they mean in context, not which characters are doing what. That was the state of genetics for decades. We had the genes. We had the proteins. But we lacked a unified dictionary for what they actually do.

In February 2023, a consortium of scientists published the latest version of a project that has been quietly changing that for over two decades. The Gene Ontology knowledgebase, or GO, now covers genes from organisms across the entire tree of life, including viruses (Aleksander et al., 2023). The paper, published in Genetics, is not a flashy discovery. It is an infrastructure report. But infrastructure is what makes everything else possible.

The authors, including Suzi Aleksander, James P. Balhoff, Seth Carbon, and J. Michael Cherry, describe a system that has grown from a simple vocabulary list into a sophisticated computational model of how genes work. It is the difference between having a map with street names and having a map with traffic patterns, construction zones, and shortcuts.

What Is a Gene Ontology, Actually?

The term sounds dry. "Ontology" is a philosopher's word. But the idea is simple: it is a controlled vocabulary for describing what genes and their products do. Instead of one scientist saying a gene "participates in cell death" and another saying it "causes apoptosis," the ontology forces everyone to use the same precise term.

The GO knowledgebase has three components, each more ambitious than the last.

The Ontology Itself

This is the dictionary. It is a computational knowledge structure describing the functional characteristics of genes (Aleksander et al., 2023). It organizes terms into three domains:

▸Molecular function: What a gene product does at the molecular level. Binds ATP. Catalyzes a reaction. Transports ions.
▸Biological process: The larger program the gene product participates in. Cell division. Immune response. Circadian rhythm.
▸Cellular component: Where the gene product lives. The nucleus. The mitochondria. The cell membrane.

Each term is linked to others through defined relationships. "Protein kinase activity" is a type of "catalytic activity." "DNA repair" is a part of "cellular response to DNA damage stimulus." This structure lets computers reason about genes in ways that simple keyword searches cannot.

The Annotations

This is where the rubber meets the road. An annotation is an evidence-supported statement that a specific gene product has a particular functional characteristic (Aleksander et al., 2023). It is not a guess. It is a claim backed by experimental evidence, computational analysis, or author statements from published papers.

The authors report that GO annotations now cover genes from organisms across the tree of life. But here is the catch: most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms (Aleksander et al., 2023). This means we know a lot about mouse genes, fruit fly genes, and yeast genes. We know far less about the genes of, say, a deep-sea sponge or a rare Amazonian tree.

GO Causal Activity Models

This is the newest and most powerful component. GO Causal Activity Models, or GO-CAMs, are mechanistic models of molecular pathways (Aleksander et al., 2023). They link multiple GO annotations together using defined relations to show how genes work together in biological processes.

Think of it this way. The ontology gives you the vocabulary. The annotations give you the sentences. GO-CAMs give you the paragraphs and chapters. They show causation. They show sequence. They show that protein A activates protein B, which then represses gene C, leading to cell differentiation.

The Scale of the Thing

The numbers are staggering. The GO knowledgebase contains hundreds of thousands of annotations for tens of thousands of genes. Each annotation receives extensive quality assurance checks, reviews, and user feedback (Aleksander et al., 2023). The consortium includes scientists from institutions around the world.

But scale alone is not impressive. What matters is that the system is alive. It is continually expanded, revised, and updated in response to newly published discoveries (Aleksander et al., 2023). When a lab publishes a paper showing that a previously uncharacterized protein actually binds RNA, that information can be added to the ontology within weeks.

This is the opposite of a static encyclopedia. It is a living document that evolves as science evolves.

Why This Matters More Than You Think

Here is the thing about gene function: it is not obvious from the sequence. You can sequence a genome and get a list of genes, but that is like having a parts list for a car without knowing what each part does. Is that gene a kinase? A transcription factor? A structural protein? The sequence alone cannot tell you.

The GO knowledgebase solves this by connecting sequence to function through evidence. And because it is standardized, a researcher studying a gene in zebrafish can immediately see what is known about that same gene in humans, mice, and yeast. This cross-species comparison is where the magic happens.

The authors describe how GO annotations are used by nearly every major bioinformatics tool. When you search a gene in a database, the GO terms are what tell you what that gene does. When you do a gene enrichment analysis, GO terms are what tell you which biological processes are overrepresented in your data.

Without GO, modern genomics would be chaos. Every lab would use its own vocabulary. Every paper would describe function in its own way. Comparisons would be impossible. The GO knowledgebase is the invisible infrastructure that makes genome-wide analysis possible.

The Model Organism Problem

Here is the uncomfortable truth the paper does not hide. Most of our gene function knowledge comes from a handful of lab favorites. Mouse, rat, zebrafish, fruit fly, nematode, yeast, and a few plants and bacteria. The authors acknowledge this directly: most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms (Aleksander et al., 2023).

This creates a bias. We know a lot about genes that are conserved across these species. We know far less about genes that are unique to nonmodel organisms. A gene that helps a desert plant survive drought might have no known function because nobody has studied it in a lab.

The GO knowledgebase can only include what has been published. It cannot invent knowledge. So the annotations reflect the priorities of the scientific community. If nobody has studied a gene, it gets no annotations. This is not a flaw in the ontology. It is a reflection of our collective ignorance.

How Annotations Are Made

The authors describe a rigorous process. Every annotation must be supported by evidence. The evidence can be experimental, such as a direct assay showing that a protein binds DNA. It can be computational, such as a sequence similarity search suggesting that a gene has kinase activity. It can be from an author statement in a published paper.

Each annotation includes an evidence code that tells users how confident they should be. Experimental evidence codes carry more weight than computational predictions. But even computational annotations are valuable. They provide hypotheses that can be tested.

The quality assurance process is extensive. Annotations are reviewed by curators. User feedback is incorporated. When a new paper contradicts an existing annotation, the annotation is updated or removed. The system is designed to be self-correcting.

GO Causal Activity Models: The Next Frontier

The most exciting development described in the paper is the expansion of GO-CAMs. These are not just lists of annotations. They are models of causation and regulation.

A GO-CAM might show that a receptor protein, when activated by a ligand, phosphorylates a kinase. That kinase then enters the nucleus and activates a transcription factor. The transcription factor then binds to a promoter and turns on a set of genes involved in cell growth.

Each step in this model is supported by evidence. The relations between steps are defined. The model can be queried, compared, and updated. This is a huge step beyond simple annotations.

The authors report that GO-CAMs are being created for an increasing number of biological processes. They are particularly useful for complex pathways that involve multiple cell types, multiple tissues, and multiple time points. A simple annotation cannot capture that a protein acts at one step in a pathway but not at another. A GO-CAM can.

What the Paper Does Not Prove

The GO knowledgebase is powerful, but it has limitations. The paper is honest about these.

First, the ontology is only as good as the evidence behind it. If a published paper contains an error, that error can propagate into the ontology. The QA process catches many of these, but not all.

Second, the ontology is biased toward wellstudied organisms and processes. If you are studying a rare disease gene that has only been investigated in one paper, the annotations will be sparse. If you are studying a gene in a nonmodel organism, the annotations may rely heavily on computational predictions.

Third, the ontology cannot capture everything. Some gene functions are context dependent. A protein might act as a kinase in one cell type and a scaffold in another. The ontology can represent this, but only if the evidence supports it.

Fourth, the GO-CAMs are still relatively new. The authors describe them as a work in progress. They are not yet comprehensive for all biological processes. Building a complete GO-CAM for every pathway in every organism is a monumental task.

The Practical Impact

For working scientists, the GO knowledgebase is an essential tool. Here is how it changes the game.

Gene Discovery

When you sequence a new genome, the first thing you do is annotate the genes. You search for sequence similarity to known genes. Then you use GO to infer function. Without GO, you would have no way to organize that information.

Enrichment Analysis

When you do an experiment that identifies a list of differentially expressed genes, you need to know what those genes do. GO enrichment analysis tells you if your gene list is enriched for particular biological processes, molecular functions, or cellular components. This is how you go from a list of gene names to a biological insight.

CrossSpecies Comparison

When you find a gene in zebrafish that is involved in heart development, you can immediately see what is known about the human version. If the human gene is associated with congenital heart defects, you have a hypothesis. This crossspecies comparison is only possible because of the standardized GO vocabulary.

Hypothesis Generation

When you have a gene of unknown function, you can look at its GO annotations. Even if the annotations are based on computational predictions, they give you a starting point. You can test whether the gene really has kinase activity or really localizes to the nucleus.

What This Actually Means

The GO knowledgebase is not a flashy discovery. It is not a cure for a disease or a new technology. It is infrastructure. But infrastructure is what makes everything else possible.

▸If you work with genes, you use GO whether you know it or not. Every major bioinformatics database, every genome browser, every enrichment analysis tool depends on GO annotations. The knowledgebase is the foundation.

▸The bias toward model organisms is real and problematic. If you study a nonmodel organism, your gene annotations will be sparse. This is not a failure of GO. It is a reflection of where research funding has gone. The solution is more research on diverse organisms.

▸GO-CAMs represent a fundamental shift in how we represent gene function. Instead of static annotations, we get dynamic models of causation and regulation. This is the future of functional genomics.

▸The quality assurance process matters. The GO consortium does not just dump annotations into a database. They review, update, and correct. This is why the knowledgebase is trusted.

▸The system is designed to be selfcorrecting. When new evidence contradicts old annotations, the annotations change. This is how science should work. The GO knowledgebase is a model for how to build a living, evolving resource.

The Gene Ontology knowledgebase is one of the most important scientific projects you have never heard of. It is the dictionary that makes the language of life legible. And it just got a lot more complete.

References

[1]Suzi Aleksander, James P. Balhoff, Seth Carbon, J. Michael Cherry (2023). The Gene Ontology knowledgebase in 2023. GeneticsDOI· 2,646 citations