Gene Ontology Database Maps Functions Across All Life

The Universal Language of What Genes Actually Do

In 1998, a small group of biologists sat down in a room and started arguing about what words mean. Not in the philosophical sense. They were trying to solve a practical crisis. Scientists studying fruit flies, yeast, and mice were all discovering similar genes doing similar things, but they called them different names. A gene that helps a fruit fly build an eye might be called "eyeless" in one paper and something completely unpronounceable in another. The same protein could be described as "involved in cell division" by one lab and "regulates mitosis" by another. The field was drowning in synonyms.

So they built a dictionary. Not for words, but for functions. They called it the Gene Ontology, or GO, and it became one of the most quietly revolutionary tools in modern biology. In the 25 years since, that dictionary has grown into a sprawling knowledgebase that now catalogs the functions of genes across every domain of life, from bacteria to blue whales to the viruses that infect them. The latest update, published in Genetics by Suzi Aleksander, James P. Balhoff, Seth Carbon, J. Michael Cherry, and the entire Gene Ontology Consortium, reveals just how far the project has come and where it is headed next (Aleksander et al., 2023).

This is not a database that just stores sequences. This is a database that stores what genes do.

The Three Layer Cake of Biological Meaning

The Gene Ontology is not one thing. It is three things stacked on top of each other, each layer adding more resolution to the picture of what a gene does inside a living cell.

Layer One: The Controlled Vocabulary

The foundation is a structured vocabulary of about 45,000 terms. These terms are organized into three categories that cover every possible thing a gene product can do.

The first category is molecular function. This is what a gene product does at the chemical level. Does it cut DNA? Does it bind to calcium? Does it transfer a phosphate group from one molecule to another? These are atomic actions, the smallest unit of biological work.

The second category is biological process. This is the larger program that the molecular function participates in. Cutting DNA might be part of DNA repair, or it might be part of programmed cell death. The same molecular function can serve different processes depending on context.

The third category is cellular component. This is where the action happens. Is the protein floating in the cytoplasm? Embedded in the membrane? Sitting inside the nucleus? Location matters because it constrains what a protein can actually interact with.

Each term is linked to others in a hierarchy. If a gene is annotated as "DNA repair," it is automatically understood to be part of the broader category "response to DNA damage stimulus," which is part of "response to stress." The structure is a directed acyclic graph, which is a fancy way of saying that a term can have multiple parents and multiple children, but it never loops back on itself. This prevents the kind of circular logic that would make the ontology useless.

Layer Two: The Annotations

The vocabulary alone is just a list. The real power comes from the annotations, which are evidence supported statements that link a specific gene in a specific organism to a specific GO term.

As of 2023, the GO knowledgebase contains over 8 million annotations covering genes from more than 150,000 species (Aleksander et al., 2023). That number is staggering until you realize that the vast majority of those annotations come from a handful of model organisms. Humans, mice, fruit flies, brewer's yeast, the mustard plant Arabidopsis, the roundworm C. elegans, and the bacterium E. coli account for the bulk of experimentally derived knowledge.

Every annotation comes with an evidence code that tells you how reliable it is. Experimental evidence codes mean a scientist actually ran an experiment. Computational evidence codes mean the annotation was inferred from sequence similarity or other computational methods. Author statements mean the annotation came from a paper's text. There is a big difference between "we watched this protein cut DNA in a test tube" and "this gene looks kind of like a gene that cuts DNA in mice."

Layer Three: The Causal Models

The newest and most ambitious layer is GO Causal Activity Models, or GO CAMs. These are mechanistic models of biological processes built by linking multiple GO annotations together with defined relations.

Think of it this way. A standard GO annotation tells you that protein A has molecular function X and participates in biological process Y. A GO CAM tells you that protein A performs function X, which activates protein B, which performs function Z, which then represses protein C, and so on. It is a map of causality, not just a list of parts.

The GO CAMs are still in their early stages. As of the 2023 update, the knowledgebase contains about 1,500 of these models (Aleksander et al., 2023). But they represent a shift from describing what genes do to describing how they do it together.

The Elephant in the Room: Most Life Is Unknown

Here is the uncomfortable truth that the GO knowledgebase makes visible. We know a lot about a few species and almost nothing about the rest.

The authors are refreshingly honest about this. They write that "most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms" (Aleksander et al., 2023). This is not a criticism of the database. It is a reflection of how biology works. We cannot do experiments on every species. We cannot even do experiments on most species. So we rely on inference.

This creates a bias problem. If you are studying a gene in an obscure deep sea worm and you want to know what it does, the GO database will tell you what its cousin in a fruit fly does. That is useful. But it is also a guess. The worm might have evolved a completely different function for that gene. The database does not hide this uncertainty. It encodes it in the evidence codes. But users who do not look carefully might assume the annotation is as solid as the one for the fruit fly.

The authors address this by emphasizing the quality assurance processes that the consortium uses. Every annotation goes through automated checks and manual reviews. The consortium actively solicits user feedback and corrects errors. But the fundamental asymmetry remains. We know more about a fruit fly's genes than we do about 99 percent of the species on Earth.

How the Ontology Keeps Up With Discovery

Science moves fast. The GO knowledgebase has to move with it. The 2023 update reveals how the consortium manages this ongoing challenge.

New Terms for New Discoveries

Every year, biologists discover new molecular functions and new biological processes. The GO consortium adds new terms to accommodate them. Between 2022 and 2023, they added terms related to the biology of extracellular vesicles, which are tiny bubbles that cells release to communicate with each other. They added terms for new types of post translational modifications, the chemical tags that cells attach to proteins to change their behavior. They added terms for viral processes that were not previously captured.

The process for adding a term is surprisingly democratic. Anyone can suggest a new term through the consortium's GitHub repository. The request is reviewed by domain experts and ontology editors. If it passes, it gets added to the next release. The ontology is not a static monument. It is a living document that changes every few weeks.

Keeping Annotations Current

Annotations are not permanent. A paper from 2005 might have claimed that a particular gene is involved in a particular process. A paper from 2023 might show that the original claim was wrong or incomplete. The consortium actively updates annotations to reflect the latest evidence.

This is harder than it sounds. Older papers might have used different gene names or different experimental techniques. The consortium has a team of biocurators who read new papers and update the database accordingly. As of 2023, the team includes curators at multiple institutions around the world, each specializing in different organisms or biological domains (Aleksander et al., 2023).

Handling the Virus Problem

Viruses are a special challenge. They are not alive in the conventional sense, but they have genes that do things. The GO knowledgebase includes annotations for viral genes, but the ontology terms were originally designed for cellular organisms. A virus does not "reproduce" the same way a bacterium does. It hijacks a host cell's machinery.

The consortium has been working to make the ontology more virus friendly. They added terms specifically for viral processes and modified existing terms to accommodate the unique biology of viruses. This is not just an academic exercise. Understanding viral gene functions is critical for developing antiviral drugs and vaccines.

What the Research Does NOT Prove

The GO knowledgebase is a powerful tool, but it has limits. The authors do not hide them, but readers who are not careful might overinterpret what the database can do.

The database does not prove that a gene actually has a function in every context. An annotation might say that a gene is involved in "DNA repair" based on an experiment in a lab under specific conditions. That same gene might do something completely different in a different tissue or under different environmental conditions. The ontology captures what a gene can do, not necessarily what it always does.

The database does not capture the quantitative aspects of gene function. It tells you that a protein binds to DNA. It does not tell you how tightly it binds, how fast the binding happens, or how many molecules are involved. These details matter for building accurate models of cellular behavior.

The database does not resolve the tension between evolution and function. Two genes that look similar might have diverged in function millions of years ago. The GO annotations based on sequence similarity will tell you they are functionally equivalent. They might be wrong.

The authors acknowledge that "GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms" (Aleksander et al., 2023). This is not a flaw in the database. It is a feature of the scientific literature that the database faithfully reflects.

The Future: From Static Database to Dynamic Model

The GO consortium is not resting. The 2023 paper outlines several directions for future development.

One major push is toward better integration with other biological databases. The GO knowledgebase is part of a larger ecosystem that includes protein sequence databases, structural biology databases, and pathway databases. The consortium is working to make it easier to move data between these resources.

Another push is toward more automated annotation. Manual curation is slow and expensive. A single biocurator might annotate a few hundred genes per year. There are millions of genes waiting to be annotated. Machine learning and natural language processing could help automate the process, but the consortium is cautious. Automated annotations are less reliable than experimental ones, and the consortium wants to maintain quality.

The most ambitious direction is the expansion of GO CAMs. These causal models represent a shift from describing parts to describing systems. If the consortium can build enough of them, biologists might one day be able to simulate a biological process in silico before running an experiment in the lab.

What This Actually Means

The Gene Ontology knowledgebase is not a research paper you read once and forget. It is a tool that changes how biology gets done. Here is what it means for working scientists and for anyone who cares about understanding life.

▸If you are a biologist studying a gene in a non model organism, the GO database is your first stop. It will tell you what the gene's relatives do in better studied species. But check the evidence codes. An annotation based on sequence similarity is a hypothesis, not a fact. Treat it accordingly.

▸If you are a bioinformatician building a machine learning model, the GO database is a gold standard for training data. The structured hierarchy of terms gives you a way to measure how similar two genes are in function, not just in sequence. The 2023 update makes this data easier to access programmatically through improved APIs.

▸If you are a science communicator, the GO database is a reminder that most biological knowledge is concentrated in a handful of species. When you write about a gene "causing" something in humans, remember that the annotation might have been inferred from a mouse or a yeast experiment. The uncertainty is real.

▸If you are a student learning molecular biology, the GO database is a map of the territory. Use it to understand how functions connect to processes and how processes connect to cellular locations. It is the closest thing biology has to a periodic table of gene functions.

▸If you are a funder or a policy maker, the GO database is evidence that basic research on model organisms pays off. Every annotation in that database started with an experiment, often on a fruit fly or a worm or a weed. That knowledge is now being used to understand human diseases, design new drugs, and engineer new organisms. The investment in basic biology is not a luxury. It is the foundation.

The Gene Ontology started as a solution to a vocabulary problem. It has become something more. It is a record of everything we know about what genes do, organized in a way that a computer can read and a human can understand. It is incomplete. It is biased toward a few species. It is constantly changing. But it is the best map we have.

And it is getting better every day.

References

[1]Suzi Aleksander, James P. Balhoff, Seth Carbon, J. Michael Cherry (2023). The Gene Ontology knowledgebase in 2023. GeneticsDOI· 2,646 citations