User:SoerenMind/sandbox/Alignment and control

This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft.

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs
This page was last edited by Citation bot (talk | contribs) 2 years ago. (Update timer)

Finished writing a draft article? Are you ready to request an experienced editor review it for possible inclusion in Wikipedia? Submit your draft for review!

New article name goes here new article content ...

Problem Description

As in existing article AI control problem

Alignment

The main approach to preventing such problems from arising within superintelligent AIs is to ensure that their goals are aligned with human values, so that they won’t pursue undesirable outcomes. However, experts do not currently know how to reliably develop AIs which possess specific abstract goals or values. Ongoing research aims to address a range of problems in the field.

The scope of alignment

Research on alignment varies by the scope of behaviour it aims to train AIs to achieve; OpenAI researcher Paul Christiano distinguishes two broad categories. Narrowly aligned AIs can carry out tasks in accordance with the user’s instrumental preferences,^[1] without necessarily understanding the user’s long-term goals. Narrow alignment can apply to AIs with general capabilities, but also to AIs that are specialised for individual tasks. For example, we would like question-answering systems to respond to questions truthfully without selecting their answers to manipulate humans or bring about long-term effects.

By contrast, ambitious alignment involves encoding the correct or best scheme of human values into AIs that are able to act autonomously at a large scale, which requires addressing moral and political problems.^[2] For example, in Human Compatible, Berkeley professor Stuart Russell proposes that AI systems be designed with the sole objective of maximizing the realization of human preferences.^[3]^: 173 The "preferences" Russell refers to "are all-encompassing; they cover everything you might care about, arbitrarily far into the future." AI ethics researcher Iason Gabriel argues that we should align AIs with “principles that would be supported by a global overlapping consensus of opinion, chosen behind a veil of ignorance and/or affirmed through democratic processes.”^[2] Eliezer Yudkowsky of the Machine Intelligence Research Institute has proposed the goal of fulfilling humanity’s coherent extrapolated volition (CEV), roughly defined as the set of values which humanity would share at reflective equilibrium, i.e. after a long, idealised process of refinement.^[4]^[2]

Specifications of AI goals

We can phrase the goals of alignment in terms of the following distinction between three different types of specification:^[5]

ideal specification (the “wishes”), corresponding to the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator;

design specification (the “blueprint”), corresponding to the specification that we actually use to build the AI system, e.g. the reward function that a reinforcement learning system maximises;

revealed specification (the “behaviour”), which is the specification that best describes what actually happens, e.g. the reward function we can reverse-engineer from observing the system’s behaviour using, say, inverse reinforcement learning. This is typically different from the one provided by the human operator because AI systems are not perfect optimisers or because of other unforeseen consequences of the design specification.

[ADD COAST RUNNERS GIF]

AI alignment researchers aim to ensure that the revealed specification matches the ideal specification, by creating the best design specification for building the AI. A mismatch between the ideal specification and the design specification is known as outer misalignment, whereas a mismatch between the design specification and the revealed specification is known as inner misalignment.^[6]^[7] Outer misalignment might arise because of mistakes in specifying the objective function (design specification).^[8] For example, a reinforcement learning agent trained on the game of CoastRunners learned to move in circles while repeatedly crashing, which got it a higher score than finishing the race (see animated figure).^[9]

Inner misalignment arises when the agent pursues a goal that is aligned with the design specification on the training data but not elsewhere. ^[6]^[7]^[10] This type of misalignment is often compared to human evolution: evolution selected for genetic fitness (design specification) in our ancestral environment, but in the modern environment human goals (revealed specification) are not aligned with maximizing genetic fitness. For example, our taste for sugary food, which originally increased fitness, today leads to overeating and health problems. Inner misalignment is a particular concern for agents which are trained in large open-ended environments, where a wide range of unintended goals may emerge.^[7]

Scalable oversight

One approach to preventing misspecified objective functions is to ask humans to evaluate and score the AI’s behaviour.^[11]^[12] However, humans are also fallible, and might score some undesirable solutions highly - for instance, a virtual robot hand shown on the right learned to ‘pretend’ to grasp an object to get positive feedback.^[13] And thorough human supervision is expensive, meaning that this method could not realistically be used to evaluate all actions. Additionally, complex tasks (such as making economic policy decisions) might produce too much information for an individual human to evaluate. And long-term tasks such as predicting the climate cannot be evaluated without extensive human research.^[14] The pitfalls of using feedback from unassisted humans are illustrated by AI systems that use 'likes' or click-throughs as human feedback, which may lead to addiction.^[15]^[16]

A key open problem in alignment research is how to create a design specification which avoids outer misalignment, given only limited access to a human supervisor - known as the problem of scalable oversight.^[12] Much ongoing research in AI alignment attempts to address this issue; some of the most prominent research agendas are discussed below.

[ADD ROBOT HAND GIF]

Training by debate

OpenAI researchers have proposed training aligned AI by means of debate between AI systems, with the winner judged by humans.^[17] Such debate is intended to bring the weakest points of an answer to a complex question or problem to human attention, as well as to train AI systems to be more beneficial to humans by rewarding them for truthful and safe answers. This approach is motivated by the expected difficulty of determining whether an AGI-generated answer is both valid and safe by human inspection alone. Joel Lehman characterizes debate as one of “the long term safety agendas currently popular in ML”, with the other two being reward modelling and iterated amplification (see below).^[18]

Reward modeling

Reward modeling refers to a system of reinforcement learning in which an agent receives rewards from a model trained to imitate human feedback.^[19] In reward modeling, instead of receiving reward signals directly from humans or from a static reward function, an agent receives its reward signals through a human-trained model that can operate independently of humans. The reward model is concurrently trained by human feedback on the agent's behavior during the same period in which the agent is being trained by the reward model.^[20]

In 2017, researchers from OpenAI and DeepMind reported that a reinforcement learning algorithm using a feedback-predicting reward model was able to learn complex novel behaviors in a virtual environment.^[21] In one experiment, a virtual robot was trained to perform a backflip in less than an hour of evaluation using 900 bits of human feedback. In 2020, researchers from OpenAI described using reward modeling to train language models to produce short summaries of Reddit posts and news articles, with high performance relative to other approaches.^[22] However, they observed that beyond the predicted reward associated with the 99th percentile of reference summaries in the training dataset, optimizing for the reward model produced worse summaries rather than better.

A long-term goal of this line of research is to create a recursive reward modelling setup for training agents on tasks too complex or costly for humans to evaluate directly.^[23] For example, if we wanted to train an agent to write a fantasy novel using reward modelling, we would need humans to read and holistically assess enough novels to train a reward model to match those assessments, which might be prohibitively expensive. But this would be easier if we had access to assistant agents which could extract a summary of the plotline, check spelling and grammar, summarize character development, assess the flow of the prose, and so on. Each of those assistants could in turn be trained via reward modelling.

The general term for a human working with AIs to perform tasks that the human couldn’t by themselves is an amplification step, because it amplifies the capabilities of a human beyond what they would normally be capable of. Since recursive reward modelling involves a hierarchy of several of these steps, it’s one example of a broader class of safety techniques known as iterated amplification.^[24] In addition to techniques which make use of reinforcement learning, other proposed iterated amplification techniques rely on supervised learning, or imitation learning, to scale up human abilities.

Inferring human preferences from behaviour

Stuart Russell has advocated a new approach to the development of beneficial machines, in which:^[3]^: 182

1. The machine’s only objective is to maximize the realisation of human preferences.
2. The machine is initially uncertain about what those preferences are.
3. The ultimate source of information about human preferences is human behaviour.

An early example of this approach is Russell and Ng’s inverse reinforcement learning, in which AIs infer the preferences of human supervisors from those supervisors’ behaviour, by assuming that the supervisors act to maximise some reward function. More recently, Hadfield-Menell et al. have extended this paradigm to allow humans to modify their behaviour in response to the AIs’ presence (for example, by favouring pedagogically useful actions), which they call “assistance games” (also known as cooperative inverse reinforcement learning).^[3]^: 202 ^[25] Compared with debate and iterated amplification, assistance games rely more explicitly on the assumption of (noisy) human rationality; it is unclear how to extend them to cases in which humans are systematically biased or otherwise suboptimal.

Embedded agency

Work on scalable oversight largely occurs within formalisms such as POMDPs. Embedded agency^[26]^[27] is another major strand of research, which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper^[28] with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from DeepMind researcher Viktoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.^[8] This class of problems has been formalised using causal incentive diagrams.^[28] Everitt and Hutter’s current reward function algorithm^[29] addresses it by designing agents which evaluate future actions according to their current reward function. This approach is also intended to prevent problems from more general self-modification which AIs might carry out.^[30]^[26]

Other work in this area focuses on developing new frameworks and algorithms for other properties we might want to capture in our design specification.^[26] For example, we would like our agents to reason correctly under uncertainty in a wide range of circumstances. As one contribution to this, Leike et al. provide a general way for Bayesian agents to model each other’s policies in a multi-agent environment, without ruling out any realistic possibilities.^[31] And the Garrabrant induction algorithm extends probabilistic induction to be applicable to logical (rather than only empirical) facts.^[32]

Approaches to inner alignment

An inner alignment failure occurs when the goals an AI pursues during deployment (its revealed specification) deviate from the goals it was trained to pursue in its original environment (its design specification). Paul Christiano argues for using interpretability to detect such deviations, using adversarial training to detect and penalize them, and using formal verification to rule them out.^[33] These research areas are active focuses of work in the machine learning community, although that work is not normally aimed towards solving AGI alignment problems. Building on early adversarial examples for image classifiers,^[34] a wide body of literature now exists on techniques for generating adversarial examples, and for creating models robust to them.^[35] Meanwhile research on verification includes techniques for training neural networks whose outputs provably remain within identified constraints.^[36] Interpretability research will be discussed in more detail in the Capability Control section.

Capability control

Capability control proposals aim to increase our ability to monitor and control the behaviour of AI systems, in order to reduce the danger they might pose if misaligned. However, capability control becomes less effective as our agents become more intelligent and their ability to exploit flaws in our control systems increases. Therefore, Bostrom and others recommend capability control methods only as a supplement to alignment methods.^[37]

Interruptibility

One potential way to prevent harmful outcomes is to give human supervisors the ability to easily shut down a misbehaving AI via an “off-switch”. However such AIs will have instrumental incentives to disable any off-switches, unless measures are put in place to prevent this. This problem has been formalised as an assistance game between a human and an AI, in which the AI can choose whether to disable its off-switch; and then, if the switch is still enabled, the human can choose whether to press it or not.^[38] A standard approach to such assistance games is to ensure that the AI interprets human choices as important information about its intended goals.^[3]^: 208

Alternatively, Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called safely interruptible agents, can learn to become indifferent to whether their off-switch gets pressed.^[39]^[40] This approach has the limitation that an AI which is completely indifferent to whether it is shut down or not is also unmotivated to care about whether the off-switch remains functional, and could incidentally and innocently disable it in the course of its operations (for example, for the purpose of removing and recycling an unnecessary component). More broadly, indifferent agents will act as if the off-switch can never be pressed, and might therefore fail to make contingency plans to arrange a graceful shutdown.^[41]^[40]

Interpretability and analysis

Analysis of the mechanisms underlying an AI’s behaviour can help to identify when that behaviour will have undesirable consequences. The main challenge is that neural networks are by default highly uninterpretable, and are often described as “black boxes”.^[42] Approaches to addressing this operate at multiple levels. Some techniques allow visualisations of the inputs which individual neurons respond to most strongly. Several groups have found that neurons can be aggregated into circuits which perform human-comprehensible functions, some of which reliably arise across different networks trained independently.^[43]^[44]

At a higher level, various techniques exist to extract compressed representations of the features of given inputs, which can then be analysed by standard clustering techniques. Alternatively, networks can be trained to output linguistic explanations of their behaviour, which are then directly human-interpretable.^[45] Model behaviour can also be explained with reference to training data - for example, by evaluating which training inputs influenced a given behaviour the most.^[46]

Boxing

An AI box is a proposed method of capability control in which an AI is run on an isolated computer system with heavily restricted input and output channels - for example, text-only channels and no connection to the internet. While this reduces the AI’s ability to carry out undesirable behaviour, it also reduces its usefulness. However, boxing has fewer costs when applied to a question-answering system, which doesn’t require interaction with the world in any case.

The likelihood of security flaws involving hardware or software vulnerabilities can be reduced by formally verifying the design of the AI box. Security breaches may also occur if the AI is able to manipulate the human supervisors into letting it out, via its understanding of their psychology.^[47]

References

^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ ^a ^b ^c Gabriel, Iason (1 September 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551.
^ ^a ^b ^c ^d Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.
^ Yudkowsky, Eliezer (2011). "Complex Value Systems in Friendly AI". Artificial General Intelligence. Lecture Notes in Computer Science. Vol. 6830. pp. 388–393. doi:10.1007/978-3-642-22887-2_48. ISBN 978-3-642-22886-5.
^ Ortega, Pedro; Maini, Vishal; DeepMind Safety Team (27 September 2018). "Building safe artificial intelligence: specification, robustness, and assurance". Medium. Retrieved 12 December 2020.
^ ^a ^b Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (11 June 2019). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ ^a ^b ^c Ecoffet, Adrien; Clune, Jeff; Lehman, Joel (1 July 2020). "Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity". Artificial Life Conference Proceedings. 32: 27–35. arXiv:2006.07495. doi:10.1162/isal_a_00323. S2CID 219687488.
^ ^a ^b Krakovna, Victoria; Legg, Shane. "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved 6 January 2021.
^ Clark, Jack; Amodei, Dario (22 December 2016). "Faulty Reward Functions in the Wild". OpenAI. Retrieved 6 January 2021.
^ Christian, Brian (2020). The Alignment Problem: Machine Learning and Human Values. W.W. Norton. ISBN 978-0-393-63582-9.
^ Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep reinforcement learning from human preferences". arXiv:1706.03741 [stat.ML]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ ^a ^b Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (25 July 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Amodei, Dario; Christiano, Paul; Ray, Alex (13 June 2017). "Learning from Human Preferences". OpenAI. Retrieved 6 January 2021.
^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (19 October 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Ekstrand, Michael D.; Willemsen, Martijn C. (7 September 2016). "Behaviorism is Not Enough: Better Recommendations through Listening to Users". Proceedings of the 10th ACM Conference on Recommender Systems. Association for Computing Machinery: 221–224. doi:10.1145/2959100.2959179. S2CID 2846294.
^ Milli, Smitha; Belli, Luca; Hardt, Moritz (2021). "From Optimizing Engagement to Measuring Value". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. pp. 714–722. arXiv:2008.12623. doi:10.1145/3442188.3445933. ISBN 9781450383097. S2CID 221370628.
^ Irving, Geoffrey; Christiano, Paul; Amodei, Dario; OpenAI (October 22, 2018). "AI safety via debate". arXiv:1805.00899 [stat.ML].
^ Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill (May 2020). Genetic Programming Theory and Practice XVII. Springer Nature. ISBN 978-3-030-39958-0.
^ Leike, Jan; Kreuger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].
^ Everitt, Tom; Hutter, Marcus (15 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning". arXiv:1908.04734v2 [cs.AI].
^ Christiano, Paul; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741 [stat.ML].
^ Stiennon, Nisan; Ziegler, Daniel; Lowe, Ryan; Wu, Jeffrey; Voss, Chelsea; Christiano, Paul; Ouyang, Long (September 4, 2020). "Learning to Summarize with Human Feedback".
^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (19 October 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (12 November 2016). "Cooperative Inverse Reinforcement Learning". arXiv:1606.03137 [cs.AI].
^ ^a ^b ^c Everitt, Tom; Lea, Gary; Hutter, Marcus (21 May 2018). "AGI Safety Literature Review". 1805.01109. arXiv:1805.01109.
^ Demski, Abram; Garrabrant, Scott (6 October 2020). "Embedded Agency". arXiv:1902.09469. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane (6 September 2019). "Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings". arXiv:1902.09980. {{cite journal}}: Cite journal requires |journal= (help)
^ Everitt, Tom; Hutter, Marcus (20 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective". 1908.04734. arXiv:1908.04734.
^ Everitt, Tom; Filan, Daniel; Daswani, Mayank; Hutter, Marcus (10 May 2016). "Self-Modification of Policy and Utility Function in Rational Agents". arXiv:1605.03142. {{cite journal}}: Cite journal requires |journal= (help)
^ Leike, Jan; Taylor, Jessica; Fallenstein, Benya (25 June 2016). "A formal solution to the grain of truth problem". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press: 427–436.
^ Garrabrant, Scott; Benson-Tilsen, Tsvi; Critch, Andrew; Soares, Nate; Taylor, Jessica (7 December 2020). "Logical Induction". arXiv:1609.03543 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Christiano, Paul (11 September 2019). "Conversation with Paul Christiano". AI Impacts. AI Impacts. Retrieved 6 January 2021.
^ Szegedy, Christian; Zaremba, Wojciech; Sutskever, Ilya; Bruna, Joan; Erhan, Dumitru; Goodfellow, Ian; Fergus, Rob (19 February 2014). "Intriguing properties of neural networks". arXiv:1312.6199 [cs.CV]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ Serban, Alex; Poll, Erik; Visser, Joost (12 June 2020). "Adversarial Examples on Object Recognition: A Comprehensive Survey". ACM Computing Surveys. 53 (3): 66:1–66:38. doi:10.1145/3398394. hdl:2066/221052. ISSN 0360-0300. S2CID 218518141.
^ Kohli, Pushmeet; Dvijohtham, Krishnamurthy; Uesato, Jonathan; Gowal, Sven. "Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification". Deepmind. Retrieved 6 January 2021.
^ Cite error: The named reference superintelligence was invoked but never defined (see the help page).
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (15 June 2017). "The Off-Switch Game". arXiv:1611.08219 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)
^ "Google developing kill switch for AI". BBC News. 8 June 2016. Retrieved 12 June 2016.
^ ^a ^b Orseau, Laurent; Armstrong, Stuart (25 June 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press: 557–566.
^ Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
^ Montavon, Grégoire; Samek, Wojciech; Müller, Klaus Robert (2018). "Methods for interpreting and understanding deep neural networks". Digital Signal Processing: A Review Journal. 73: 1–15. doi:10.1016/j.dsp.2017.10.011. ISSN 1051-2004. S2CID 207170725.
^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (10 March 2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3): e00024.001. doi:10.23915/distill.00024.001. ISSN 2476-0757. S2CID 215930358.
^ Li, Yixuan; Yosinski, Jason; Clune, Jeff; Lipson, Hod; Hopcroft, John (8 December 2015). "Convergent Learning: Do different neural networks learn the same representations?". Feature Extraction: Modern Questions and Challenges. PMLR: 196–212.
^ Hendricks, Lisa Anne; Akata, Zeynep; Rohrbach, Marcus; Donahue, Jeff; Schiele, Bernt; Darrell, Trevor (2016). "Generating Visual Explanations". Computer Vision – ECCV 2016. Lecture Notes in Computer Science. 9908. Springer International Publishing: 3–19. arXiv:1603.08507. doi:10.1007/978-3-319-46493-0_1. ISBN 978-3-319-46492-3. S2CID 12030503.
^ Koh, Pang Wei; Liang, Percy (17 July 2017). "Understanding Black-box Predictions via Influence Functions". International Conference on Machine Learning. PMLR: 1885–1894.
^ Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.

[reward_modelling-1] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[Gabriel-2] Gabriel, Iason (1 September 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551.

[HC-3] Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.

[4] Yudkowsky, Eliezer (2011). "Complex Value Systems in Friendly AI". Artificial General Intelligence. Lecture Notes in Computer Science. Vol. 6830. pp. 388–393. doi:10.1007/978-3-642-22887-2_48. ISBN 978-3-642-22886-5.

[DM_safety_overview-5] Ortega, Pedro; Maini, Vishal; DeepMind Safety Team (27 September 2018). "Building safe artificial intelligence: specification, robustness, and assurance". Medium. Retrieved 12 December 2020.

[inner_opt-6] Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (11 June 2019). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[OpenAI_open_ended-7] Ecoffet, Adrien; Clune, Jeff; Lehman, Joel (1 July 2020). "Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity". Artificial Life Conference Proceedings. 32: 27–35. arXiv:2006.07495. doi:10.1162/isal_a_00323. S2CID 219687488.

[DM_specification_gaming-8] Krakovna, Victoria; Legg, Shane. "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved 6 January 2021.

[Openai_boat-9] Clark, Jack; Amodei, Dario (22 December 2016). "Faulty Reward Functions in the Wild". OpenAI. Retrieved 6 January 2021.

[alignment_prob-10] Christian, Brian (2020). The Alignment Problem: Machine Learning and Human Values. W.W. Norton. ISBN 978-0-393-63582-9.

[11] Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep reinforcement learning from human preferences". arXiv:1706.03741 [stat.ML]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[concrete_problems-12] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (25 July 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[Openai_robot_hand-13] Amodei, Dario; Christiano, Paul; Ray, Alex (13 June 2017). "Learning from Human Preferences". OpenAI. Retrieved 6 January 2021.

[14] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (19 October 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[behaviorism_addiction-15] Ekstrand, Michael D.; Willemsen, Martijn C. (7 September 2016). "Behaviorism is Not Enough: Better Recommendations through Listening to Users". Proceedings of the 10th ACM Conference on Recommender Systems. Association for Computing Machinery: 221–224. doi:10.1145/2959100.2959179. S2CID 2846294.

[16] Milli, Smitha; Belli, Luca; Hardt, Moritz (2021). "From Optimizing Engagement to Measuring Value". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. pp. 714–722. arXiv:2008.12623. doi:10.1145/3442188.3445933. ISBN 9781450383097. S2CID 221370628.

[DebatePaper-17] Irving, Geoffrey; Christiano, Paul; Amodei, Dario; OpenAI (October 22, 2018). "AI safety via debate". arXiv:1805.00899 [stat.ML].

[book_gen_prog-18] Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill (May 2020). Genetic Programming Theory and Practice XVII. Springer Nature. ISBN 978-3-030-39958-0.

[Leike_et_al_2018-19] Leike, Jan; Kreuger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].

[Everitt_Hutter_2019-20] Everitt, Tom; Hutter, Marcus (15 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning". arXiv:1908.04734v2 [cs.AI].

[Christiano_et_al_2017-21] Christiano, Paul; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741 [stat.ML].

[OpenAI_2020-22] Stiennon, Nisan; Ziegler, Daniel; Lowe, Ryan; Wu, Jeffrey; Voss, Chelsea; Christiano, Paul; Ouyang, Long (September 4, 2020). "Learning to Summarize with Human Feedback".

[reward_modelling-23] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[Openai_iterated_amplification-24] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (19 October 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[CIRL-25] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (12 November 2016). "Cooperative Inverse Reinforcement Learning". arXiv:1606.03137 [cs.AI].

[lit_review-26] Everitt, Tom; Lea, Gary; Hutter, Marcus (21 May 2018). "AGI Safety Literature Review". 1805.01109. arXiv:1805.01109.

[27] Demski, Abram; Garrabrant, Scott (6 October 2020). "Embedded Agency". arXiv:1902.09469. {{cite journal}}: Cite journal requires |journal= (help)

[causal_influence-28] Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane (6 September 2019). "Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings". arXiv:1902.09980. {{cite journal}}: Cite journal requires |journal= (help)

[causal_influence_2-29] Everitt, Tom; Hutter, Marcus (20 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective". 1908.04734. arXiv:1908.04734.

[30] Everitt, Tom; Filan, Daniel; Daswani, Mayank; Hutter, Marcus (10 May 2016). "Self-Modification of Policy and Utility Function in Rational Agents". arXiv:1605.03142. {{cite journal}}: Cite journal requires |journal= (help)

[31] Leike, Jan; Taylor, Jessica; Fallenstein, Benya (25 June 2016). "A formal solution to the grain of truth problem". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press: 427–436.

[logical_induction-32] Garrabrant, Scott; Benson-Tilsen, Tsvi; Critch, Andrew; Soares, Nate; Taylor, Jessica (7 December 2020). "Logical Induction". arXiv:1609.03543 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[Christiano_interview-33] Christiano, Paul (11 September 2019). "Conversation with Paul Christiano". AI Impacts. AI Impacts. Retrieved 6 January 2021.

[34] Szegedy, Christian; Zaremba, Wojciech; Sutskever, Ilya; Bruna, Joan; Erhan, Dumitru; Goodfellow, Ian; Fergus, Rob (19 February 2014). "Intriguing properties of neural networks". arXiv:1312.6199 [cs.CV]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[35] Serban, Alex; Poll, Erik; Visser, Joost (12 June 2020). "Adversarial Examples on Object Recognition: A Comprehensive Survey". ACM Computing Surveys. 53 (3): 66:1–66:38. doi:10.1145/3398394. hdl:2066/221052. ISSN 0360-0300. S2CID 218518141.

[DM_verification-36] Kohli, Pushmeet; Dvijohtham, Krishnamurthy; Uesato, Jonathan; Gowal, Sven. "Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification". Deepmind. Retrieved 6 January 2021.

[superintelligence-37] Cite error: The named reference superintelligence was invoked but never defined (see the help page).

[38] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (15 June 2017). "The Off-Switch Game". arXiv:1611.08219 [cs.AI]. {{cite arXiv}}: Unknown parameter |url= ignored (help)

[bbc-google-39] "Google developing kill switch for AI". BBC News. 8 June 2016. Retrieved 12 June 2016.

[interruptible_agents-40] Orseau, Laurent; Armstrong, Stuart (25 June 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press: 557–566.

[corrigibility-41] Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.

[interpretability_survey-42] Montavon, Grégoire; Samek, Wojciech; Müller, Klaus Robert (2018). "Methods for interpreting and understanding deep neural networks". Digital Signal Processing: A Review Journal. 73: 1–15. doi:10.1016/j.dsp.2017.10.011. ISSN 1051-2004. S2CID 207170725.

[Circuits-43] Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (10 March 2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3): e00024.001. doi:10.23915/distill.00024.001. ISSN 2476-0757. S2CID 215930358.

[44] Li, Yixuan; Yosinski, Jason; Clune, Jeff; Lipson, Hod; Hopcroft, John (8 December 2015). "Convergent Learning: Do different neural networks learn the same representations?". Feature Extraction: Modern Questions and Challenges. PMLR: 196–212.

[45] Hendricks, Lisa Anne; Akata, Zeynep; Rohrbach, Marcus; Donahue, Jeff; Schiele, Bernt; Darrell, Trevor (2016). "Generating Visual Explanations". Computer Vision – ECCV 2016. Lecture Notes in Computer Science. 9908. Springer International Publishing: 3–19. arXiv:1603.08507. doi:10.1007/978-3-319-46493-0_1. ISBN 978-3-319-46492-3. S2CID 12030503.

[46] Koh, Pang Wei; Liang, Percy (17 July 2017). "Understanding Black-box Predictions via Influence Functions". International Conference on Machine Learning. PMLR: 1885–1894.

[47] Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]