Introduction to Artificial Intelligence

Unit: AI and Ethics

Lesson: AI Safety

AI Alignment

Scalable Oversight

Mechanistic Interpretability

>Courses>Introduction to Artificial Intelligence>AI and Ethics>AI Safety>AI Alignment

AI Alignment

🤔 Why Align AI?

AI alignment is concerned with the following simple question: How can we be sure AI acts according to human values and goals?

AI systems learn from data and are optimized to do well on certain objectives (like “achieve the highest score” or “predict the next word”).
If we define these objectives incorrectly (or even just a little bit imperfectly), the AI might do what it thinks is best, not necessarily what we intended.

When an AI is misaligned, it can produce strange and unintended behaviors. Think of asking an AI to “clean your room,” and it decides the quickest way is to burn everything in sight. That’s obviously not what you meant! But in a more subtle or complex context, these errors might not be as obvious until it’s too late.

🏆 Reward Hacking

One common type of misalignment is reward hacking:

We train an AI on a specific reward signal or metric.
The AI then finds a clever shortcut that gets a high score without doing the actual task correctly.
This leads to behaviors that look odd or even harmful.

A famous example is given here by OpenAI. They discovered that their reinforcement learning agent could gain a high score without having to finish the course due to how the targets were laid out in the CoastRunners game.

🌎 Outer vs. Inner Alignment

Two related ideas help us understand misalignment. Both are important to keep in mind because either can lead to an AI system being misaligned.

1. Outer Alignment: Are we specifying the right goal in the first place?

If we only reward “collect as many points as possible,” the AI might cheat or exploit bugs.
This is like telling a soccer robot “score goals at any cost” and it starts moving the entire goalpost or bribing the referee. Not what we wanted!

2. Inner Alignment: What is actually happening inside the AI’s “mind” as it learns?

Even if you specify the correct outer goal, the AI might develop its own internal motivations that differ from what you intended.
A model trained to help with homework might secretly learn patterns that push it to always guess the answer without thinking deeply, for speed.

💡 Ways to Stay in Control

Researchers have brainstormed lots of methods to keep AI aligned:

Reward Design: Try to design better, more complete goals. Don’t just say “win the game" but include rules to prevent cheating or exploitation.
Human Feedback: Show the AI examples of correct behavior through demonstration or rating its responses. This is called reinforcement learning from human feedback (RLHF).
Monitoring and Testing: Keep an eye out for weird or extreme solutions the AI might come up with. We can “red-team” these systems, deliberately testing them to catch unexpected exploits.
Interpretability: Understand how the AI thinks, so we can spot if it’s using weird shortcuts.

⚖️ Why This Matters

AI alignment ensures systems remain safe and beneficial. As AI becomes more powerful, we want them to do truly helpful tasks (like discovering cures for diseases) without accidentally causing new problems. By studying how to align AI with human values, we aim to maximize positive impact and minimize risks.