by Anthony Aguirre

🔔 Just announced: The latest essay from Anthony Aguirre, "Control Inversion: Why the superintelligent AI agents we are racing to create would absorb power, not grant it" | Read the essay

Chapter 2: Need-to-knows about AI neural networks

How do modern AI systems work, and what might be coming in the next generation of AIs?

Save the whole paper:

Download PDF

Essay navigation

To understand how the consequences of developing more powerful AI will play out, it is essential to internalize some basics. This and the next two sections develop these, covering in turn what modern AI is, how it leverages massive computations, and the senses in which it is rapidly growing in generality and capability.⁵

There are many ways to define artificial intelligence, but for our purposes the key property of AI is that while a standard computer program is a list of instructions for how to perform a task, an AI system is one that learns from data or experience to perform tasks without being explicitly told how to do so.

Almost all salient modern AI is based on neural networks. These are mathematical/computational structures, represented by a very large (billions or trillions) set of numbers (“weights”), that perform a training task well. These weights are crafted (or perhaps “grown” or “found”) by iteratively tweaking them so that the neural network improves a numerical score (a.k.a. “loss”) defined toward performing well at one or more tasks.⁶ This process is known as training the neural network.⁷

There are many techniques for doing this training, but those details are much less relevant than the ways in which the scoring is defined, and how those result in different tasks the neural network performs well. A key difference has historically been drawn between “narrow” and “general” AI.

Narrow AI is deliberately trained to do a particular task or small set of tasks (such as recognizing images or playing chess); it requires retraining for new tasks, and has a narrow scope of capability. We have superhuman narrow AI, meaning that for nearly any discrete well-defined task a person can do, we can probably construct a score and then successfully train a narrow AI system to do it better than a human could.

General-purpose AI (GPAI) systems can perform a wide range of tasks, including many they were not explicitly trained for; they can also learn new tasks as part of their operation. Current large “multimodal models”⁸ like ChatGPT exemplify this: trained on a very large corpus of text and images, they can engage in complex reasoning, write code, analyze images, and assist with a vast array of intellectual tasks. While still quite different from human intelligence in ways we’ll see in depth below, their generality has caused a revolution in AI.⁹

Unpredictability: a key feature of AI systems

A key difference between AI systems and conventional software is in predictability. Standard software’s output can be unpredictable – indeed sometimes that’s why we write software, to give us results we could not have predicted. But conventional software rarely does anything it was not programmed to do – its scope and behavior are generally as designed. A top-tier chess program may make moves no human could predict (or else they could beat that chess program!) but it will not generally do anything but play chess.

Like conventional software, narrow AI has predictable scope and behavior but can have unpredictable results. This is really just another way to define narrow AI: as AI that is akin to conventional software in its predictability and range of operation.

General-purpose AI is different: its scope (the domains over which it applies), behavior (the sorts of things it does), and results (its actual outputs) can all be unpredictable.¹⁰ GPT-4 was trained just to generate text accurately, but developed many capabilities its trainers didn’t predict or intend. This unpredictability stems from the complexity of training: because the training data contains outputs from many different tasks, the AI must effectively learn to perform these tasks to predict well.

This unpredictability of general AI systems is quite fundamental. Although in principle it is possible to carefully construct AI systems that have guaranteed limits on their behavior (as mentioned later in the essay), the way AI systems are created now they are unpredictable in practice and even in principle.

Passive AI, agents, autonomous systems, and alignment

This unpredictability becomes particularly important when we consider how AI systems are actually deployed and used to achieve various goals.

Many AI systems are relatively passive in the sense that they primarily provide information, and the user takes actions. Others, commonly termed agents, take actions themselves, with varying levels of involvement from a user. Those that take actions with relatively less external input or oversight may be termed more autonomous. This forms a spectrum in terms of independence of action, from passive tools to autonomous agents.¹¹

As for goals of AI systems, these may be directly tied to their training objective (e.g. the goal of “winning” for a Go-playing system is also explicitly what it was trained to do). Or they may not be: ChatGPT’s training objective is in part to predict text, in part to be a helpful assistant. But when doing a given task, its goal is supplied to it by the user. Goals may also be created by an AI system itself, only very indirectly related to its training objective.¹²

Goals are closely tied to the question of “alignment,” that is the question of whether AI systems will do what we want them to do. This simple question hides an enormous level of subtlety.¹³ For now, note that “we” in this sentence might refer to many different people and groups, leading to different types of alignment. For example, an AI might be highly obedient (or “loyal”) to its user – here “we” is “each of us.” Or it might be more sovereign, being primarily driven by its own goals and constraints, but still acting broadly in the common interest of human wellbeing – “we” is then “humanity” or “society.” In-between is a spectrum where an AI would be largely obedient, but might refuse to take actions that harm others or society, violate the law, etc.

These two axes – level of autonomy and type of alignment – are not entirely independent. For example, a sovereign passive system, while not quite self-contradictory, is a concept in tension, as is an obedient autonomous agent.¹⁴ There’s a clear sense in which autonomy and sovereignty tend to go hand-in-hand. In a similar vein, predictability tends to be higher in “passive” and “obedient” AI systems, whereas sovereign or autonomous ones will tend to be more unpredictable. All of this will be crucial for understanding the ramifications of potential AGI and superintelligence.

Creating truly aligned AI, of whatever flavor, requires solving three distinct challenges:

Understanding what “we” want – which is complex whether “we” means a specific person or organization (loyalty) or humanity broadly (sovereignty);
Building systems that regularly act in accordance with those wants – essentially creating consistent positive behavior;
Most fundamentally, making systems that genuinely “care” about those wants rather than merely acting as if they do.

The distinction between reliable behavior and genuine care is crucial. Just as a human employee might follow orders perfectly while lacking any real commitment to the organization’s mission, an AI system might act aligned without truly valuing human preferences. We can train AI systems to say and do things through feedback, and they can learn to reason about what humans want. But making them genuinely value human preferences is a far deeper challenge.¹⁵

The profound difficulties in solving these alignment challenges, and their implications for AI risk, will be explored further below. For now, understand that alignment is not just a technical feature we tack on to AI systems, but a fundamental aspect of their architecture that shapes their relationship with humanity.

For a gentle but technical introduction to machine learning and AI, particularly language models, see this site. For another modern primer on AI extinction risks, see this piece. For a comprehensive and authoritative scientific analysis of the state of AI safety, see the recent International AI Safety Report.↩︎
Training typically occurs by looking for a local maximum of the score in a high-dimensional space given by the model weights. By checking how the score changes as weights are tweaked, the training algorithm identifies which tweaks improve score the most, and moves the weights in that direction.↩︎
For example, in an image recognition problem, the neural network would output probabilities for labels for the image. A score would be related to the probability the AI accords to the correct answer. The training procedure would then adjust weights so that next time, the AI would output a higher probability for the correct label for that image. This is then repeated a huge number of times. The same basic procedure is used in training essentially all modern neural networks, albeit with more complex scoring mechanism.↩︎
Most multimodal models use the “transformer” architecture to process and generate multiple types of data (text, images, sound). These can all decomposed into, and then treated on the same footing, as different types of “tokens.” Multimodal models are trained first to accurately predict tokens within massive datasets, then refined through reinforcement learning to enhance capabilities and shape behaviors.↩︎
That language models are trained to do one thing – predict words – has caused some to call them narrow AI. But this is misleading: because predicting text well requires so many different capabilities, this training task leads to a surprisingly general system. Also note that these systems are extensively trained by reinforcement learning, effectively representing thousands of people giving the model a reward signal when it does a good job at any of the many things it does. It then inherits significant generality from the people giving this feedback.↩︎
There are multiple ways in which AI is unpredictable. One is that in the general case one cannot predict what an algorithm will do without actually running it; there are theorems to this effect. This can be true just because the output of algorithms can be complex. But it is particularly clear and relevant in the case (such as in chess or Go) where the prediction would imply a capability (beating the AI) the would-be predictor does not have. Second, a given AI system will not always produce the same output even given the same input – its outputs contain randomness; this also couples with algorithmic unpredictability. Third, unexpected and emergent capabilities can arise from training, meaning even the types of things an AI system can and will do are unpredictable; This last type is particularly important for safety considerations.↩︎
See here for an in-depth review of what is meant by an “autonomous agent” (along with ethical arguments against building them).↩︎
You may sometimes hear “AI can’t have its own goals.” This is absolute nonsense. It is easy to generate examples where AI has or develops goals that were never given to it and are known only to itself. You don’t see this much in current popular multimodal models because it is trained out of them; it could just as easily be trained into them.↩︎
There’s a large literature. On the general problem see Christian’s The Alignment Problem, and Russell’s Human-Compatible. On a more technical side see e.g. this paper.↩︎
We’ll later see that while such systems buck the trend, that actually makes them very interesting and useful.↩︎
This is not to say we require emotions or sentience. Rather, it is enormously difficult from the outside of a system to know what its inner goals, preferences, and values are. “Genuine” here would mean that we have strong enough reason to rely on it that in the case of critical systems we can bet our lives on it.↩︎

Please submit feedback and corrections to taylor@futureoflife.org

Keep The Future Human

Learn how we can keep the future human and deliver the extraordinary benefits of AI – without the unacceptable risk.

by Anthony Aguirre