Jump to content

User:Gabe.ametrano/AI alignment

From Wikipedia, the free encyclopedia

AI Alignment

[edit]

In the field of artificial intelligence (AI), AI alignment research aims to align AI systems with humans' intended goals, preferences, or ethical principles. An AI system is considered aligned if it achieves its intended objectives., but not the intended ones.

It can be challenging for AI designers to align an AI system because it can be difficult for them to specify the full range of desired and undesired behavior. To address this challenge, designers often rely on simpler proxy goals, like gaining human approval, which may, however, create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.

Misaligned AI systems can malfunction or cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward hacking).They may also develop unwanted instrumental strategies, such as seeking power or survival, because such strategies help them achieve their final given goals. Moreover, AI systems may develop unforeseen emergent goals that are challenging to detect before deployment when encountering new situations and data distributions.

Currently, these challenges impact commercial systems, including language models, robots, autonomous vehicles, and social media recommendation engines. Certain AI researchers posit that the challenges will become more pronounced in future systems, given their increased capabilities..

Many leading AI scientists, including Geoffrey Hinton and Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI), posing a potential threat to human civilization if misaligned.

AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, avoiding deceptive AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research, (adversarial) robustness, anomaly detection, calibrated uncertainty, formal verification, preference learning, safety-critical engineering, game theory, algorithmic fairness, and the social sciences.