by Anthony Aguirre

Chapter 3: Key aspects of how modern general AI systems are made

Most of the world's most cutting-edge AI systems are made using surprisingly similar methods. Here are the basics.

Save the whole paper:

Essay navigation

To really understand a human you need to know something about biology, evolution, child-rearing, and more; to understand AI you also need to know about how it is made. Over the past five years, AI systems have evolved tremendously both in capability and complexity. A key enabling factor has been the availability of very large amounts of computation (or colloquially “compute” when applied to AI).

The numbers are staggering. About 10²⁵ − 10²⁶ “floating-point operations” (FLOP)¹⁶ are used in the training of models like the GPT series, Claude, Gemini, etc.¹⁷ (For comparison, if every human on Earth worked non-stop doing one calculation every five seconds, it would take around a billion years to accomplish this.) This huge amount of computation enables training of models with up to trillions of model weights on terabytes of data – a large fraction of all of the quality text that has ever been written alongside large libraries of sounds, images and video. Complementing this training with additional extensive training reinforcing human preferences and good task performance, models trained in this way exhibit human-competitive performance across a significant span of basic intellectual tasks, including reasoning and problem solving.

We also know (very, very roughly) how much computation speed, in operations per second, is sufficient for the inference speed¹⁸ of such a system to match the speed of human text processing. It is about 10¹⁵ − 10¹⁶ FLOP per second.¹⁹

While powerful, these models are by their nature limited in key ways, quite analogous to how an individual human would be limited if forced to simply output text at a fixed rate of words per minute, without stopping to think or using any additional tools. More recent AI systems address these limitations through a more complex process and architecture combining several key elements:

One or more neural networks, with one model providing the core cognitive capacity, and up to several others performing other more narrow tasks;
Tooling provided to and usable by the model – for example ability to search the web, create or edit documents, execute programs, etc.
Scaffolding that connects input and outputs of neural networks. A very simple scaffold might just allow two “instances” of an AI model to converse with each other, or one to check the work of another.²⁰
Chain-of-thought and related prompting techniques do something similar, causing a model to for example generate many approaches to a problem, then process those approaches for an aggregate answer.
Retraining models to make better use of tools, scaffolding, and chain-of-thought.

Because these extensions can be very powerful (and include AI systems themselves), these composite systems can be quite sophisticated and dramatically enhance AI capabilities.²¹ And recently, techniques in scaffolding and especially chain-of-thought prompting (and folding results back into retraining models to use these better) have been developed and employed in o1, o3, and DeepSeek R1 to do many passes of inference in response to a given query.²² This in effect allows the model to “think about” its response and dramatically boosts these models’ ability to do high-caliber reasoning in science, math, and programming tasks.²³

For a given AI architecture, increases in training computation can be reliably translated into improvements in a set of clearly-defined metrics. For less crisply defined general capabilities (such as those discussed below), the translation is less clear and predictive, but it is almost certain that larger models with more training computation will have new and better capabilities, even if it is hard to predict what those will be.

Similarly, composite systems and especially advances in “chain of thought” (and training of models that work well with it) have unlocked scaling in inference computation: for a given trained core model, at least some AI system capabilities increase as more computation is applied that allows them to “think harder and longer” about complex problems. This comes at a steep computing speed cost, requiring hundreds or thousands of more FLOP/s to match human performance.²⁴

While only a part of what is leading to rapid AI progress,²⁵ the role of computation and the possibility of composite systems will prove crucial to both preventing uncontrollable AGI and developing safer alternatives.

10²⁵ means 1 followed by 25 zeros, or ten trillion trillion. A FLOP is just an arithmetic addition or multiplication of numbers with some precision. Note that AI hardware performance can vary by a factor of ten more depending upon the precision of the arithmetic and the architecture of the computer. Counting logic-gate operations (ANDS, ORS, AND NOTS) would be fundamental but these are not commonly available or benchmarked; for present purposes it is useful to standardize on 16-bit operations (FP16), though appropriate conversion factors should be established.↩︎
A collection of estimates and hard data is available from Epoch AI and indicates about 2 × 10²⁵ 16-bit FLOP for GPT-4; this roughly matches numbers that were leaked for GPT-4. Estimates for other mid-2024 models are all within a factor of a few of GPT-4.↩︎
Inference is simply the process of generating an output from a neural network. Training can be considered a succession of many inferences and model-weight tweaks.↩︎
For text production, the original GPT-4 required 560 TFLOP per token generated. Around 7 tokens/s is needed to keep up with human thought, so this gives ≈ 3 × 10¹⁵ FLOP/s. But efficiencies have driven this down; this NVIDIA brochure for example indicates as little as 3 × 10¹⁴FLOP/s for a comparably-performing Llama 405B model.↩︎
As a slightly more complex example, an AI system might first generate several possible solutions to a math problem, then use another instance to check each solution, and finally use a third to synthesize the results into a clear explanation. This allows for more thorough and reliable problem-solving than a single pass.↩︎
See for example details on OpenAI’s “Operator”, Claude’s tool capabilities, and AutoGPT. OpenAI’s Deep Research probably has a quite sophisticated architecture but details are not available.↩︎
Deepseek R1 relies on iteratively training and prompting the model so that the final trained model creates extensive chain-of-thought reasoning. Architectural details are not available for o1 or o3, however Deepseek has revealed that there is no particular “special sauce” required to unlock capability scaling with inference. But despite receiving a great deal of press as upending the “status quo” in AI, it does not impact the core claims of this essay.↩︎
These models significantly outperform standard models on reasoning benchmarks. For instance, in the GPQA Diamond Benchmark—a rigorous test of PhD-level science questions—GPT-4o scored 56%, while o1 and o3 achieved 78% and 88%, respectively, far exceeding the 70% average score of human experts.↩︎
OpenAI’s O3 probably expended ∼ 10²¹ − 10²² FLOP to complete each of the ARC-AGI challenge questions, which competent humans can do in (say) 10-100 seconds, giving a figure more like ∼ 10²⁰ FLOP/s.↩︎
While computation is a key measure of AI system capability, it interacts with both data quality and algorithmic improvements. Better data or algorithms can reduce computational requirements, while more computation can sometimes compensate for weaker data or algorithms.↩︎

Please submit feedback and corrections to taylor@futureoflife.org

Keep The Future Human

Learn how we can keep the future human and deliver the extraordinary benefits of AI – without the unacceptable risk.

by Anthony Aguirre