How Does the Genome Work? – Part 1 of 5

The genome as a computational system. Five parts: DNA as executable code, the cell as hardware, the epigenome as runtime environment, the bootstrap problem, and what the architecture predicts.

Share
How Does the Genome Work? – Part 1 of 5

How Does the Genome Work?

The Cell as a Computational System

Meaning Books, 2026

Standalone Foundation Paper

Disclaimer: This paper was developed collaboratively between D. L. White and Claude (Anthropic). White directed the inquiry and posed the core questions. Claude provided technical reasoning and co-developed the argument chain. The intent is to examine the genome as an information system using the language and concepts of computing architecture. The analogy is not metaphorical. It is structural.

Part 1 — DNA as Executable Code

The Alphabet

Every known information processing system begins with an encoding scheme — a way to represent information in discrete, copyable units. English uses 26 letters. Binary computing uses two symbols: 0 and 1. Morse code uses dots and dashes. The specific symbols do not matter. What matters is that they are discrete (clearly distinguishable from one another), combinatorial (arrangeable in sequences that carry meaning), and copyable (reproducible without loss of information).

DNA uses four chemical bases: adenine (A), thymine (T), cytosine (C), and guanine (G). These are the alphabet. Each base is a nucleotide — a molecule consisting of a sugar, a phosphate group, and the nitrogenous base that gives it its identity. The bases are strung along a sugar-phosphate backbone in a linear sequence, and it is the order of the bases — not their chemistry per se — that carries the information. A strand of DNA is, in the most literal sense, a written sequence in a four-letter alphabet.

The double-helix structure discovered by Watson and Crick in 1953 adds a critical feature: complementary base pairing. A always pairs with T. C always pairs with G. This means every strand of DNA carries its own backup copy in the opposing strand. The information is stored twice, in complementary form, enabling both error detection and faithful replication. This is not a biological quirk. It is a parity scheme — functionally identical to the error-detection systems used in digital data storage.

The human genome contains approximately 3.2 billion base pairs. In information-theoretic terms, each base pair encodes 2 bits of information (four possible states = log2(4) = 2). The total information content is therefore approximately 6.4 billion bits, or roughly 760 megabytes. This is comparable to the contents of a modest software library — but with compression and contextual encoding that make the effective information content substantially higher than the raw bit count suggests.

The Code

In computing, raw data becomes functional only when it is organized into executable instructions. A hard drive full of random bytes does nothing. The same bytes arranged into a program — with defined entry points, instruction sequences, and termination signals — can operate a machine.

DNA works the same way. The four-base alphabet is organized into functional units at multiple scales.

Codons are the basic instruction words. A codon is a sequence of three consecutive bases — for example, ATG or GCA. With four possible bases at each of three positions, there are 64 possible codons (4 x 4 x 4 = 64). These 64 codons encode 20 amino acids plus start and stop signals. The mapping from codons to amino acids is called the genetic code — and it is essentially universal across all known life, from bacteria to humans. This universality is itself remarkable: every organism on Earth uses the same encoding table, just as every computer on the internet uses the same ASCII character set.

The codon table includes built-in redundancy. Most amino acids are specified by more than one codon — leucine, for example, is encoded by six different codons (TTA, TTG, CTT, CTC, CTA, CTG). This is called degeneracy, and it serves the same function as redundant encoding in telecommunications: it provides error tolerance. A single-base mutation in the third position of a leucine codon often produces another leucine codon, leaving the protein unchanged. The code is structured to minimize the damage from copying errors. That is not a property of random sequences. It is a property of engineered communication protocols.

Genes are the functional subroutines. A gene is a defined sequence of DNA that encodes a specific protein (or functional RNA molecule). It has a start signal (the ATG codon, encoding methionine), a body of instruction codons, and a stop signal (TAA, TAG, or TGA). When the cell needs a particular protein, it locates the gene, copies its sequence into a messenger molecule (messenger RNA, or mRNA), and sends that message to the protein-assembly machinery for execution.

In computing terms, a gene is a callable function — a named, bounded unit of code with a defined input (the start codon and regulatory signals), a defined process (the coding sequence), and a defined output (the protein product). The genome contains approximately 20,000 protein-coding genes in humans. This is the callable function library.

What Kind of Information Is This?

The preceding sections describe the genome's structure — its alphabet, its encoding, and its functional organization. But structure alone does not answer the deeper question: what kind of information does the genome contain, and what does that tell us about how it came to exist?

Information theory, formalized by Claude Shannon in 1948, provides the framework for answering this question rigorously.

Shannon information measures the minimum number of bits required to encode a message given the probabilities of its symbols. A DNA sequence of 3.2 billion base pairs, with four equiprobable bases at each position, carries approximately 6.4 billion bits of Shannon information. This is a measure of capacity — how much data the sequence contains — but it makes no distinction between meaningful code and random noise. A randomly generated sequence of 3.2 billion bases would carry the same Shannon information as the human genome. Shannon information tells you how much is there. It does not tell you whether it means anything.

This distinction matters. A hard drive filled with random bits has maximum Shannon entropy. A hard drive running an operating system has lower Shannon entropy (the code has structure, patterns, and redundancy that make it compressible). Yet the operating system is the one carrying functional information. Shannon entropy alone cannot identify functional content. A different measure is needed.

Functional information — also called specified complexity — is the concept that separates meaningful sequences from random ones. A sequence has functional information when it is simultaneously complex (not reducible to a simple repeating pattern) and specified (matching an independent functional requirement). Complexity without specification is noise. Specification without complexity is a crystal. The genome is both complex and specified — and this combination has distinctive origin implications.

Consider the three categories:

Repetitive order. A DNA sequence of ATATATATAT is simple, predictable, and compressible to a short formula ("repeat AT"). It carries low functional information. Salt crystals, standing waves, and periodic chemical reactions produce this kind of order. Natural processes generate it routinely.

Random disorder. A DNA sequence of randomly assembled bases is complex — incompressible, high Shannon entropy — but unspecified. It does not encode a functional protein. It does not match any independent requirement. Thermal noise, radioactive decay, and Brownian motion produce this kind of complexity. Natural processes generate it routinely.

Specified complexity. The hemoglobin gene is complex — you cannot reduce its 444-codon sequence to a simple formula. And it is specified — it encodes a protein that folds into a precise three-dimensional structure, binds oxygen with specific affinity (P50 = 26.6 mmHg), releases it cooperatively under pH-dependent allosteric control, and interfaces with dozens of other molecular systems. The sequence is simultaneously improbable and functional. In every observed domain of human experience — language, software, engineering blueprints, communication protocols — this combination has one observed source: intelligent agency. No demonstrated natural process produces it.

The quantitative argument. Consider a single gene encoding a modestly sized protein of 300 amino acids. Each position can hold any of 20 amino acids. The total sequence space is 20^300, approximately 10^390. What fraction of that space encodes a stable, functional protein?

Estimates vary across the literature and the range should be stated honestly. Doug Axe, working with beta-lactamase variants, estimated roughly 1 in 10^77 sequences in a local region of sequence space retained function. This is the most restrictive published estimate. More recent work by Thornton, Gaucher, and others exploring protein evolution through ancestral reconstruction and directed evolution experiments suggests functional sequences may be more connected in sequence space than Axe's estimate implies — with neutral networks and promiscuous functions providing navigable pathways between functional islands. Their estimates of functional fraction are more generous, ranging from 1 in 10^20 to 1 in 10^40 for individual folds.

The range matters, but the conclusion does not change across it. Even at the most generous estimate (1 in 10^20), producing a single functional 300-amino-acid protein by random search requires on the order of 10^20 independent trials. The total number of molecular events that have occurred in the observable universe since its origin — every particle, every interaction, every Planck time for 13.8 billion years — is estimated at approximately 10^139. One protein is feasible under this budget at the generous end. But the genome does not contain one protein. It contains 20,000 protein-coding genes, approximately 400,000 regulatory elements, and a splicing system that generates over 100,000 distinct protein variants — all of which must be mutually compatible and functionally coordinated. The system-level improbability is not the sum but the product of the individual improbabilities, because the components must work together.

This is not an argument from incredulity. It is an argument from combinatorics. The search space is measured. The target fraction is estimated (with acknowledged uncertainty). The probabilistic resources are calculated. At every published estimate of functional fraction, the coordinated system exceeds what undirected search can produce.

The direction of information flow. There is a second principle from information theory that bears directly on the genome. Information degrades under transmission and copying. This is a consequence of the second law of thermodynamics applied to information systems: errors accumulate, signal degrades, entropy increases. Shannon proved that reliable communication over a noisy channel requires error-correcting codes — and the genome has them (codon degeneracy, mismatch repair, proofreading polymerases, double-strand break repair). But even with error correction, the direction of information flow without intelligent input is always downhill. Copies are worse than originals. Mutations degrade information content. Selection can preserve functional sequences against degradation, but it cannot generate new specified complexity — it can only choose among variants that already exist.

The observed mutational load in every studied genome is consistent with this directional prediction: all populations carry a burden of mildly deleterious mutations that accumulate faster than selection can remove them, a phenomenon documented by Kondrashov, Lynch, and others. The genome is degrading from a delivered state. The information is running downhill.

The Regulatory Architecture

A program consisting only of subroutines, with no control logic to determine which subroutines run, when, and in what order, is not a program. It is a library — useful only if something else decides what to call and when. The difference between a library and a program is the control logic.

The genome has extensive control logic. In fact, the majority of functional DNA is regulatory rather than protein-coding. Only approximately 1.5% of the human genome directly encodes proteins. The remaining 98.5% was once dismissed as "junk DNA" — non-functional evolutionary debris accumulated over millions of years. This assessment is being revised, though the extent of revision is actively debated.

The ENCODE project (Encyclopedia of DNA Elements), published in 2012, reported that approximately 80% of the genome shows biochemical activity — transcription, protein binding, or chromatin modification. This claim generated significant scientific pushback. Critics (notably Dan Graur and others) argued that biochemical activity is not the same as biological function — a transcription factor may bind a DNA sequence non-specifically, or a region may be transcribed at negligible levels, without either event carrying functional significance. Current estimates of the functionally constrained fraction of the genome range from approximately 8-10% (based on evolutionary conservation) to 20-40% (based on broader definitions of regulatory function), with the true answer likely dependent on how "function" is defined.

For the purposes of this paper, the debate does not need to be resolved. What is not debated is the existence, complexity, and functional importance of the regulatory architecture itself. The following elements are well-established, independently characterized, and functionally demonstrated:

Promoters are the on switches. A promoter is a DNA sequence immediately upstream (before) a gene that serves as the binding site for RNA polymerase — the enzyme that copies DNA into mRNA. Without a functional promoter, the gene cannot be read. The promoter determines whether a gene is expressed. In computing terms, the promoter is the function call — the instruction that tells the processor to load and execute a specific subroutine.

Enhancers and silencers are conditional modifiers. An enhancer is a DNA sequence — sometimes located thousands of base pairs away from the gene it regulates — that increases the rate of transcription when activated. A silencer does the opposite. These elements respond to specific proteins called transcription factors, which bind to them based on signals from the cell's environment. An enhancer activated by a heat-sensitive transcription factor, for example, will upregulate its target gene when the cell detects elevated temperature — executing a pre-loaded subroutine in response to an environmental input. This is conditional execution: if temperature > threshold, then activate gene X. The logic is explicit, the trigger is environmental, and the response is pre-programmed.

Transcription factors are the variables in the control logic. A transcription factor is a protein that binds to specific DNA sequences (enhancers, silencers, or promoters) and modulates gene expression. The human genome encodes approximately 1,600 transcription factors. Each one recognizes a specific DNA motif. Many respond to environmental signals — hormone levels, nutrient availability, oxygen concentration, mechanical stress, temperature, light exposure. The transcription factor network constitutes a massively parallel conditional execution system: thousands of environmental variables simultaneously modulating thousands of genes through a web of regulatory interactions.

Splicing is runtime code editing. After a gene is transcribed into pre-mRNA, the transcript is edited before translation. Segments called introns are removed, and the remaining segments — exons — are joined together to form the mature mRNA. But the splicing is not fixed. The same gene can be spliced in different ways, producing different proteins from the same DNA sequence. This is called alternative splicing, and it occurs in approximately 95% of human multi-exon genes. A single gene can produce dozens of different protein variants depending on which exons are included.

In computing terms, this is polymorphism — a single function name that produces different outputs depending on the calling context. The code is not rewritten. It is read differently based on the cell's state. The instructions for all possible outputs are present in the same gene. The decision about which output to produce is made at runtime, not at the source-code level.

Non-coding RNAs are regulatory scripts. The genome produces thousands of RNA molecules that are never translated into protein. These include microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and small interfering RNAs (siRNAs). Their functions are regulatory: they bind to mRNA transcripts and either degrade them (preventing translation) or modulate their stability and translation rate. In computing terms, these are runtime scripts — small executable modules that monitor and adjust the output of other programs without modifying the underlying source code.

Origin of This Architecture

The genome's computational architecture — its encoding system, executable code, regulatory logic, and information content — is not in dispute. How such a system came into existence, however, remains an open question in science.

The prevailing view holds that this architecture arose through undirected chemical and evolutionary processes over deep time. The proposition developed in this paper offers a different explanation: the observed features are best accounted for by an intelligent engineering process capable of producing highly integrated, specified, and functionally coherent information systems.

This paper does not aim to refute alternative explanations. Its goal is more modest: to describe the genome's architecture as precisely as possible using the language and concepts of information processing, then test how well the engineering proposition coheres with the observable facts — from the structure of the code to its regulatory sophistication and directional information flow. Subsequent parts examine the physical execution machinery, runtime dynamics, and bootstrap dependencies to further assess the strength of this framework.

The reader is invited to weigh the proposition on its merits.

The Architecture Is Not an Analogy

The preceding sections describe an information processing system with an encoding alphabet, a universal instruction set, callable functional units, a regulatory architecture incorporating conditional execution and polymorphism, an information content whose specified complexity challenges natural probabilistic resources at every published estimate, and a documented direction of information flow that is consistently downhill.

These comparisons — alphabet, code, subroutines, control logic, conditional execution, polymorphism, runtime scripts — are not metaphors imposed on biology by an outside observer. They are structural descriptions of what the genome demonstrably does, using the most precise language available. The genome does not merely resemble an information processing system. It is one, operating on principles that are functionally identical to those used in human-engineered computing systems.

The difference is scale, integration, and information density. A modern operating system kernel contains approximately 25 million lines of code (Linux kernel as of 2023). The human genome, with 3.2 billion base pairs organized into 20,000 genes regulated by hundreds of thousands of regulatory elements and 1,600 transcription factors executing alternative splicing across 95% of multi-exon genes — is an information system of a qualitatively different order. Not merely larger, but more deeply integrated, more conditionally responsive, and more error-tolerant than any system human engineering has produced.

Information theory adds the quantitative spine to this structural observation. The genome contains specified complexity at a level that challenges natural probabilistic resources at every published estimate. Its information content is degrading, not improving, consistent with a system subject to ongoing entropy. Its error-correction mechanisms slow the degradation but cannot reverse it and are themselves encoded in the system they protect — a circularity examined in Part 4.

Part 2 examines the physical machinery that reads and executes the code. Part 3 examines the runtime environment that determines which code is executed under which conditions. Part 4 examines the bootstrap problem — the circular dependency between the code and the machinery it specifies. Part 5 connects these observations to the broader framework of the project.

The reader is invited to follow the evidence and draw conclusions.


Continue to Part 2 → How Does the Genome Work? – Part 2 of 5