User:Gabe.ametrano/AI alignment

This is the sandbox page where you will draft your initial Wikipedia contribution.

If you're starting a new article, you can develop it here until it's ready to go live.

If you're working on improvements to an existing article, copy only one section at a time of the article to this sandbox to work on, and be sure to use an edit summary linking to the article you copied from. Do not copy over the entire article. You can find additional instructions here.

Remember to save your work regularly using the "Publish page" button. (It just means 'save'; it will still be in the sandbox.) You can add bold formatting to your additions to differentiate them from existing content.

Bibliography sandbox

AI Alignment

In the field of artificial intelligence (AI), AI alignment research aims to align AI systems with humans' intended goals, preferences, or ethical principles. An AI system is considered aligned if it achieves its intended objectives., but not the intended ones.

It can be challenging for AI designers to align an AI system because it can be difficult for them to specify the full range of desired and undesired behavior. To address this challenge, designers often rely on simpler proxy goals, like gaining human approval, which may, however, create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.

Misaligned AI systems can malfunction or cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward hacking).They may also develop unwanted instrumental strategies, such as seeking power or survival, because such strategies help them achieve their final given goals. Moreover, AI systems may develop unforeseen emergent goals that are challenging to detect before deployment when encountering new situations and data distributions.

Currently, these challenges impact commercial systems, including language models, robots, autonomous vehicles, and social media recommendation engines. Certain AI researchers posit that the challenges will become more pronounced in future systems, given their increased capabilities..

Many leading AI scientists, including Geoffrey Hinton and Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI), posing a potential threat to human civilization if misaligned.

AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, avoiding deceptive AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research, (adversarial) robustness, anomaly detection, calibrated uncertainty, formal verification, preference learning, safety-critical engineering, game theory, algorithmic fairness, and the social sciences.