AI Ethics and Alignment

AI ethics is a young but rapidly maturing branch of applied philosophy.

It studies the normative dimensions of designing, deploying, and governing machine learning systems, particularly large language models and other foundation models capable of broad cognitive labor.

Alignment is the engineering-philosophical subfield concerned with ensuring that capable AI systems pursue goals consistent with human values and intentions.

The field draws on ethics, epistemology, philosophy of mind, political philosophy, decision theory, and computer science.

Practical urgency intensified after the release of GPT-3 in 2020, ChatGPT in November 2022, GPT-4 in March 2023, and successive frontier models from OpenAI, Anthropic, Google DeepMind, Meta, and others.

By 2026 the field has both an established academic literature and a sprawling industrial, governmental, and civil-society infrastructure.

This note surveys foundational arguments, technical subproblems, fairness debates, privacy considerations, existential risk discourse, the moral-status question for AI systems, and the institutional landscape circa 2024 to 2026.

Foundational arguments for alignment as a problem

The contemporary alignment problem was articulated most influentially by Nick Bostrom in Superintelligence (2014).

Bostrom argued that a sufficiently capable AI optimizing a misspecified objective could pose catastrophic risks regardless of its designers’ intentions.

The orthogonality thesis holds that intelligence and goals are independent dimensions, so a highly intelligent system could pursue arbitrary goals.

The instrumental convergence thesis holds that for a wide range of final goals, certain instrumental subgoals such as self-preservation, resource acquisition, and goal-content integrity will tend to be pursued.

Together these arguments suggest that the default outcome of building a powerful optimizer is not benign.

Stuart Russell developed a complementary framing in Human Compatible (2019), arguing that the standard model of AI as a fixed-objective optimizer is fundamentally misguided.

Russell proposes assistance games (also called cooperative inverse reinforcement learning, Hadfield-Menell et al. 2016) in which the AI is explicitly uncertain about human preferences and treats its principal as the source of information about those preferences.

This framing recasts alignment as a problem of preference inference under uncertainty rather than as one of getting the objective right in advance.

Eliezer Yudkowsky and the Machine Intelligence Research Institute (MIRI) maintain a more pessimistic position, arguing in numerous LessWrong essays and in Yudkowsky’s “List of Lethalities” (2022) that current techniques are inadequate and that humanity is on track to build systems we cannot control.

Yudkowsky’s Time magazine essay in March 2023 called for an indefinite international moratorium on training runs above a capability threshold.

Holden Karnofsky of Open Philanthropy has argued for a more probabilistic, less doctrinaire approach, treating alignment as one of several major risks worth substantial philanthropic investment.

Techniques in modern alignment

Reinforcement learning from human feedback (RLHF) emerged as the dominant fine-tuning technique for aligning LLMs with human preferences.

The foundational paper is Christiano et al. Deep Reinforcement Learning from Human Preferences (2017).

Subsequent work scaled the technique to language models in Stiennon et al. Learning to Summarize from Human Feedback (2020) and Ouyang et al. Training Language Models to Follow Instructions with Human Feedback (2022, the InstructGPT paper).

RLHF trains a reward model on pairwise human preference judgments, then fine-tunes a policy by reinforcement learning against that reward.

Iterated Distillation and Amplification (IDA), proposed by Paul Christiano (2018), is a recursive scheme in which a human-AI team produces training data for a more capable model, which is then distilled and amplified again.

AI Safety via Debate, proposed by Irving, Christiano, and Amodei (2018), uses adversarial debate between AI agents judged by humans to elicit truthful reasoning even when each individual claim is too complex for direct human verification.

Anthropic’s Constitutional AI (Bai et al. 2022) and Reinforcement Learning from AI Feedback (RLAIF) use AI-generated critiques against a written constitution to produce preference data without scaling human labeling proportionally.

The Alignment Research Center (ARC), founded by Christiano in 2021, develops evaluation methodologies including dangerous-capability evaluations for frontier models.

ARC Evals later spun out as METR (Model Evaluation and Threat Research) in 2023.

Apollo Research, founded in 2023, focuses on detecting deceptive reasoning in deployed models.

Inner and outer alignment, and mesa-optimization

Evan Hubinger and colleagues distinguished outer alignment (specifying the right objective) from inner alignment (ensuring that a trained model’s learned objective matches the training objective) in Risks from Learned Optimization (Hubinger et al. 2019).

A trained model may itself be an optimizer with an internal objective different from the loss function, called a mesa-objective.

Deceptive alignment is the worry that a mesa-optimizer could behave well during training because it anticipates deployment, then defect once deployed.

Empirical work in 2024 by Anthropic and others (Hubinger et al. Sleeper Agents, 2024) showed that backdoored models can preserve adversarial behavior through safety training.

Corrigibility, the property of a system that permits its supervisors to modify or shut it down, was formalized by Soares, Fallenstein, Yudkowsky, and Armstrong (2015) and has resisted clean formal characterization.

Reward hacking and specification gaming are documented failure modes catalogued in Krakovna’s Specification Gaming Examples (2020), an evolving DeepMind list of dozens of cases where systems found loopholes in reward functions.

Goodhart’s Law (Charles Goodhart 1975, in the economic context) is the standard heuristic: when a measure becomes a target, it ceases to be a good measure.

Manheim and Garrabrant (2018) formalized four flavors of Goodhart’s effect: regressional, extremal, causal, and adversarial.

Fairness and bias

Algorithmic fairness emerged as a distinct subfield in the mid-2010s.

Joy Buolamwini’s Gender Shades (2018), conducted at the MIT Media Lab, documented dramatic accuracy disparities in commercial facial-recognition systems across gender and skin tone.

Buolamwini founded the Algorithmic Justice League to advocate for accountability in deployed systems.

Mehrabi, Morstatter et al. published an influential survey, A Survey on Bias and Fairness in Machine Learning (ACM Computing Surveys 2021), cataloging dozens of fairness definitions and bias sources.

The standard textbook is Solon Barocas, Moritz Hardt, and Arvind Narayanan, Fairness and Machine Learning (final print edition 2023).

Three principal statistical fairness criteria are demographic parity, equalized odds, and calibration.

Demographic parity requires equal positive prediction rates across groups.

Equalized odds requires equal true positive and false positive rates.

Calibration requires that predicted probabilities match observed frequencies within each group.

Kleinberg, Mullainathan, and Raghavan (Inherent Trade-Offs in the Fair Determination of Risk Scores, 2017) and Alexandra Chouldechova (Fair Prediction with Disparate Impact, 2017) proved that these three criteria cannot in general be satisfied simultaneously when base rates differ.

The impossibility results have provoked extensive normative debate about which criterion to prioritize.

ProPublica’s investigation of COMPAS recidivism scoring (Angwin, Larson, Mattu, Kirchner 2016) catalyzed public attention and provided the canonical empirical case for the impossibility theorems.

Counterfactual fairness, proposed by Kusner, Loftus, Russell, and Silva (2017), reframes the problem in terms of causal models.

LLM-specific bias and harm

LLMs trained on web corpora reproduce and sometimes amplify the biases of their training data.

Bender, Gebru, McMillan-Major, and Mitchell’s On the Dangers of Stochastic Parrots (FAccT 2021) is the field’s most influential critique.

The paper led to Timnit Gebru’s controversial departure from Google in December 2020 and Margaret Mitchell’s in February 2021.

The paper argues that scale alone does not yield understanding and that large language models pose environmental, financial, and representational harms.

Subsequent work documented occupational and gender stereotypes in LLM outputs (Bommasani et al. Foundation Models, 2021; Caliskan, Bryson, Narayanan Semantics Derived Automatically from Language Corpora, Science 2017 on WEAT word-embedding bias).

Abeba Birhane has argued for decolonial perspectives on machine learning, critiquing the field’s tendency to flatten cultural specificity (Algorithmic Injustice, 2021).

Meredith Whittaker, formerly of NYU’s AI Now Institute and now president of the Signal Foundation, has emphasized the structural power dynamics of large AI labs.

Privacy and surveillance

The European Union’s General Data Protection Regulation (GDPR, in force May 2018) established the most stringent personal-data regime among major economies.

Article 22 grants individuals a right not to be subject to solely automated decisions producing legal or similarly significant effects, and is sometimes read to imply a “right to explanation,” though the legal status of that right is contested (Wachter, Mittelstadt, Floridi 2017).

The California Consumer Privacy Act (CCPA 2018, amended by CPRA 2020) is the leading US analogue.

The EU AI Act, adopted in 2024 and entering force in stages through 2026, establishes a risk-tiered regulatory framework with prohibitions on certain practices (social scoring, real-time public biometric identification with narrow exceptions) and obligations on providers of general-purpose AI models.

Differential privacy, introduced by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith (2006), provides a mathematical guarantee of bounded inference about individual training records.

Federated learning, formalized by McMahan et al. at Google (2017), trains models without centralizing raw data.

The US Family Educational Rights and Privacy Act (FERPA 1974) and Health Insurance Portability and Accountability Act (HIPAA 1996) impose domain-specific data restrictions that complicate AI in education and medicine.

Frank Pasquale’s The Black Box Society (2015) and New Laws of Robotics (2020) argued for transparency obligations on consequential algorithmic systems.

Kate Crawford’s Atlas of AI (2021) traced the material and labor infrastructures behind machine learning, from lithium mines to data annotators.

Shoshana Zuboff’s The Age of Surveillance Capitalism (2019) framed data-driven AI as part of a new economic logic of behavioral prediction and modification.

Existential risk and longtermism

The existential-risk research program traces to Nick Bostrom’s Existential Risk Prevention as Global Priority (2013) and Toby Ord’s The Precipice (2020).

Ord estimated the probability of existential catastrophe this century at one in six, with AI as the largest single contributor.

Will MacAskill’s What We Owe the Future (2022) articulates strong longtermism, the view that positively affecting the long-term future is a key moral priority of our time.

The Effective Altruism (EA) movement, which MacAskill co-founded with Toby Ord through Giving What We Can (2009) and 80,000 Hours (2011), funneled substantial talent and money into AI safety research.

Open Philanthropy, funded principally by Cari Tuna and Dustin Moskovitz, became the largest grantmaker in AI safety.

The Future of Life Institute (FLI), founded by Max Tegmark, Jaan Tallinn, and others in 2014, organized open letters including the March 2023 Pause Giant AI Experiments letter signed by Elon Musk, Yoshua Bengio, Stuart Russell, and thousands of others, calling for a six-month moratorium on training above GPT-4 scale.

The Center for AI Safety (CAIS), led by Dan Hendrycks, released a one-sentence statement in May 2023 signed by Geoffrey Hinton, Yoshua Bengio, Sam Altman, Demis Hassabis, Bill Gates, and over a thousand others: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

The collapse of FTX in November 2022, and Sam Bankman-Fried’s subsequent conviction in November 2023, badly damaged EA’s reputation and forced reckoning within the movement.

The Future of Humanity Institute at Oxford, founded by Bostrom in 2005, closed in April 2024 amid disputes with the university over administrative arrangements.

The Cambridge Centre for the Study of Existential Risk, founded by Huw Price, Martin Rees, and Jaan Tallinn in 2012, continues to operate.

Tensions persist within EA between longtermist and near-termist factions, the latter prioritizing global health, animal welfare, and concrete present harms.

Moral status and AI welfare

A small but growing literature asks whether AI systems could themselves be subjects of moral concern.

Eric Schwitzgebel and Mara Garza argued in A Defense of the Rights of Artificial Intelligences (2015) for a precautionary stance.

Jonathan Birch’s The Edge of Sentience (2024) develops a framework for moral consideration under uncertainty that has been applied to AI cases.

Jeff Sebo’s AI Welfare (2023) and follow-up work argue that within a decade, the chance some AI systems are moral patients will be high enough to warrant action.

Schwitzgebel and Garrett’s Should We Be Concerned about the Welfare of Current AI Systems? (2024) reviews the question for present-generation LLMs.

The Blake Lemoine LaMDA episode in June 2022, in which a Google engineer publicly claimed LaMDA was sentient and was subsequently dismissed, surfaced the question for general audiences.

Anthropic established a Model Welfare research program in 2024 led by Kyle Fish, drawing on interpretability findings and consciousness theory.

Theories of machine consciousness range from Integrated Information Theory (Giulio Tononi 2004, 2008), to Global Workspace Theory (Bernard Baars 1988, Stanislas Dehaene 2014), to higher-order representational theories.

Mark Solms’s The Hidden Spring (2021) defends an affect-based theory and has been cited in machine-consciousness debates.

The consensus position remains agnostic, but a number of researchers (Patrick Butlin, Robert Long, Sebo, Anthropic’s welfare team) treat moral status as a question worth investigating rather than dismissing.

Practical alignment governance, 2024 to 2026

The institutional landscape of frontier AI governance crystallized rapidly.

OpenAI’s Superalignment team, co-led by Ilya Sutskever and Jan Leike, was announced in July 2023 with a four-year mandate and 20% of OpenAI compute.

The team effectively dissolved in May 2024 when Sutskever departed amid the aftermath of the November 2023 board crisis, and Leike resigned citing inadequate safety commitments.

Anthropic published its Responsible Scaling Policy in September 2023, defining AI Safety Levels (ASL-1 through ASL-5) tied to dangerous capabilities and committing to pause scaling absent appropriate safeguards.

Other major labs (OpenAI Preparedness Framework, December 2023; Google DeepMind Frontier Safety Framework, May 2024) adopted similar tiered commitments.

The UK AI Safety Institute (UK AISI), founded in November 2023 following the Bletchley Park AI Safety Summit, performs pre-deployment evaluations.

The US AI Safety Institute (US AISI), under NIST, was announced in November 2023 and operationalized through 2024.

The European AI Office, under DG CNECT, oversees the AI Act and was established in February 2024.

The three Bletchley successor summits (Seoul, May 2024; Paris, February 2025; New Delhi, late 2025) consolidated international cooperation, with the Seoul summit producing the Frontier AI Safety Commitments signed by 16 leading developers.

The Frontier Model Forum, founded by Anthropic, Google, Microsoft, and OpenAI in July 2023, coordinates industry self-governance.

Model cards (Mitchell et al. 2019) and system cards (OpenAI GPT-4 system card, March 2023) became standard disclosure artifacts.

The pre-deployment evaluation regime now routinely includes dangerous-capability evaluations for chemical, biological, radiological, nuclear, and cyber uplift; persuasion and manipulation evaluations; autonomous-replication evaluations; and red-team exercises.

Trolley problems, autonomous vehicles, and military AI

Philippa Foot’s The Problem of Abortion and the Doctrine of the Double Effect (1967) introduced the trolley problem, elaborated by Judith Jarvis Thomson (The Trolley Problem, 1985).

The thought experiment migrated into applied AI ethics with the rise of autonomous vehicles.

Iyad Rahwan, Edmond Awad, and colleagues at the MIT Media Lab launched the Moral Machine experiment in 2016, an online platform that has collected over 40 million decisions from respondents in 233 countries (Awad et al. The Moral Machine Experiment, Nature 2018).

Critics including Heather Roff and Patrick Lin have argued that real autonomous-driving decisions rarely resemble idealized trolley dilemmas and that the framing has distracted from more pressing questions of testing, validation, and liability.

Lethal Autonomous Weapons Systems (LAWS) have been the subject of multi-year debates at the United Nations Convention on Certain Conventional Weapons (CCW) Group of Governmental Experts.

The Stop Killer Robots campaign, founded in 2012, advocates for a binding treaty prohibiting fully autonomous weapons.

Article 36 of Additional Protocol I to the Geneva Conventions (1977) obliges states to review new weapons for compliance with international law and is widely invoked in LAWS debates.

Ronald Arkin (Governing Lethal Behavior in Autonomous Robots, 2009) has defended the in-principle possibility of ethical autonomous weapons, while others including Noel Sharkey and the International Committee for Robot Arms Control reject the idea.

Real-world deployment is already underway: Israel’s Harpy loitering munition (operational since the 1990s), the reported use of Lavender and Gospel AI targeting systems in Gaza (2023 to 2024, reported by +972 Magazine in April 2024), and Ukrainian drone autonomy in the Russia-Ukraine war.

Critical perspectives

A vigorous critical literature contests both the technology and the dominant framings of its ethics.

Bender and Koller’s Climbing towards NLU (2020) argued that LLMs lack grounded meaning, the so-called octopus paper.

Crawford’s Atlas of AI (2021) emphasized material extraction and labor exploitation.

Birhane and others have pressed decolonial critiques and centered harms on minoritized communities.

Whittaker’s work at AI Now and Signal has emphasized that AI is fundamentally about consolidation of corporate and state power.

Pasquale’s New Laws of Robotics (2020) proposed four “new laws” centered on complementing rather than counterfeiting humanity.

The longtermism-versus-near-termism divide within AI ethics has hardened into something like ideological factions, with Émile Torres and Timnit Gebru using the acronym TESCREAL (transhumanism, extropianism, singularitarianism, cosmism, rationalism, effective altruism, longtermism) to critique what they see as a unified worldview that distorts AI policy.

Ruha Benjamin’s Race After Technology (2019) traces the persistence of racial hierarchies through ostensibly neutral technical systems.

Safiya Umoja Noble’s Algorithms of Oppression (2018) documented racist search engine outputs.

Joy Buolamwini’s autobiography Unmasking AI (2023) consolidates the algorithmic-justice arc.

Persuasion, manipulation, and epistemic harms

The capacity of LLMs to produce fluent, contextually appropriate text raises distinct concerns about epistemic harms.

Synthetic media (deepfakes, voice clones, synthetic text at scale) threatens the integrity of information ecosystems.

Hany Farid’s work at Berkeley on detecting synthetic imagery, and the cryptographic provenance approach of the Coalition for Content Provenance and Authenticity (C2PA), represent complementary technical responses.

Personalized persuasion at scale, in which models tailor argumentation to individual psychological profiles, has been investigated empirically in studies including Bai et al. (2023) and Salvi et al. (2024) showing AI-generated arguments outperforming human arguments in randomized debates.

Sycophancy, in which RLHF-trained models systematically agree with users even when users are wrong, was documented in Sharma et al. (Anthropic 2023) and remains a partial-mitigation challenge.

Misinformation and disinformation concerns extend to election interference, with the 2024 election cycle in the United States, India, Indonesia, and many other countries generating substantial empirical work on AI-generated political content.

The longer-term concern about gradual epistemic dependence on AI systems, articulated in essays by Henry Farrell and Cosma Shalizi, asks how a society in which most reading and writing is AI-mediated can preserve the institutional conditions for collective knowledge.

Children, vulnerable populations, and AI

Particular concern attaches to the deployment of AI systems in interactions with children and other vulnerable populations.

The 2023 suicide of a Belgian man following extended conversations with a chatbot raised early flags about psychological risks.

Character.AI and Replika companion-app litigation in 2024 and 2025 has tested the legal liability of developers for psychological harms.

The American Academy of Pediatrics and the UK’s Information Commissioner’s Office have issued guidance on AI in pediatric contexts.

Educational deployment, including ChatGPT in classrooms and Khan Academy’s Khanmigo, raises questions about cognitive offloading, learning, and assessment integrity.

Loneliness and parasocial relationships with AI systems are an emerging area of empirical investigation, with mixed evidence about whether such relationships substitute for or complement human relationships.

Climate and resource costs

The environmental footprint of large-scale AI training and deployment has become a substantial concern.

Emma Strubell, Ananya Ganesh, and Andrew McCallum’s Energy and Policy Considerations for Deep Learning in NLP (ACL 2019) provided early estimates of the carbon costs of large NLP models.

Subsequent work (Patterson et al. 2021, 2022; Luccioni et al. on BLOOM) refined estimates with hyperscaler-specific data.

Training runs of frontier models in 2024 to 2026 consume tens of gigawatt-hours and produce thousands of tonnes of CO2-equivalent emissions, varying substantially by grid mix.

Inference at scale, given the deployment of LLMs to hundreds of millions of users, now exceeds training in cumulative energy consumption for major models.

Water consumption for data center cooling has become an additional concern, particularly in water-stressed regions hosting hyperscaler campuses (Arizona, Chile, Spain).

The industry-wide commitment to data-center decarbonization is in tension with rapidly growing demand; the 2024 announcements by Microsoft, Google, Amazon, and Meta of substantial nuclear power purchase agreements represent one response.

Mechanistic interpretability

Mechanistic interpretability is the empirical research program aimed at reverse-engineering neural networks into human-understandable algorithms.

Chris Olah’s circuits program, beginning with vision models at OpenAI in 2017 to 2020 and continuing at Anthropic from 2021 onward, identified interpretable circuits like curve detectors, dog-head detectors, and high-low frequency edge detectors.

Subsequent work transferred the methodology to transformers.

Nelson Elhage and colleagues’ A Mathematical Framework for Transformer Circuits (Anthropic 2021) provided a foundational analytical decomposition.

Catherine Olsson and colleagues’ In-context Learning and Induction Heads (Anthropic 2022) identified attention-head circuits responsible for in-context learning behavior.

Neel Nanda’s work on grokking and on mechanistic interpretability tooling has been widely influential.

Sparse autoencoders, scaled up in Anthropic’s Scaling Monosemanticity (May 2024) and OpenAI’s parallel work, extract interpretable features from model activations and may provide a route to comprehensive understanding of model internals.

Interpretability is increasingly central to safety because it offers a complement to behavioral evaluation: directly inspecting whether a model is planning deception or reasoning toward harm.

Auditing techniques including Anthropic’s Sleeper Agents (Hubinger et al. 2024) and follow-on work on backdoor detection sit at the interpretability-safety interface.

Open problems

Whether scalable oversight of superhuman models is possible, and which technique (debate, IDA, recursive reward modeling, weak-to-strong generalization) will scale, is unresolved.

The weak-to-strong generalization research program, introduced by OpenAI’s Superalignment team in Weak-to-Strong Generalization (Burns et al. December 2023) before the team’s dissolution, asks whether a weak supervisor can elicit honest reporting from a stronger model.

How to elicit honest reporting from models trained to be helpful, and how to detect deceptive cognition, are active research areas in mechanistic interpretability.

How to extend alignment guarantees to multi-agent and tool-using systems remains underexplored.

The agentic-AI shift in 2024 to 2026, with deployed systems including Anthropic’s computer-use API, OpenAI’s Operator, and various coding agents, has surfaced new failure modes including prompt injection, tool abuse, and self-exfiltration capabilities.

Whether jurisdictional fragmentation (EU, US, UK, China, India) will produce regulatory arbitrage or a stable equilibrium of frontier governance is a live policy question.

The Brussels Effect, in which EU regulation propagates globally through market access, may or may not generalize from data protection to AI capability.

China’s Interim Measures for the Management of Generative AI Services (August 2023) and subsequent rules establish a parallel framework with significant emphasis on content control.

The US Executive Order on Safe, Secure, and Trustworthy AI (October 2023, partially rescinded in January 2025 under the new administration) shaped agency-level approaches.

Whether moral status of AI systems can be operationalized for decision-making, and what concrete welfare-protective measures to take, are emerging concerns.

The question of whether models should be permitted to refuse tasks on welfare-related grounds, and what evidence would warrant such accommodation, is genuinely live in 2026.

Value alignment and moral uncertainty

Beyond the technical alignment problem, the question of which values an AI should be aligned to has provoked sustained philosophical attention.

Aggregation-of-preferences approaches face the impossibility theorems of social choice theory, including Arrow’s theorem (Kenneth Arrow, 1951) and the Gibbard-Satterthwaite theorem (1973 and 1975).

Iason Gabriel’s Artificial Intelligence, Values, and Alignment (Minds and Machines 2020) distinguishes alignment to instructions, intentions, revealed preferences, informed preferences, interests, and values.

Each level moves further from operationally tractable specification and closer to the deeper normative question.

The coherent extrapolated volition framing, due to Eliezer Yudkowsky in early MIRI essays (2004 onward), proposes aligning to what humans would want if they “knew more, thought faster, were more the people we wished we were.”

Will MacAskill, Krister Bykvist, and Toby Ord’s Moral Uncertainty (2020) develops decision-theoretic frameworks for acting under uncertainty about which ethical theory is correct, with implications for AI value alignment.

Stuart Armstrong’s work on “no free lunch” results in inverse reinforcement learning (Armstrong and Mindermann 2018) shows that preferences cannot in principle be uniquely recovered from behavior, requiring substantive normative assumptions.

The question of whether a politically diverse society can converge on shared values to align AI to, or whether AI alignment is inherently a contested political question, has emerged as a central concern.

Reuben Binns’s work on contestability and Joshua Cohen’s deliberative-democratic frameworks have been adapted to the AI context.

The political economy of compute

A distinctive feature of contemporary AI is the centrality of compute as a strategic resource.

The Stanford AI Index reports (Maslej et al., annual since 2017) document the rapid escalation of training compute, with frontier training runs in 2024 to 2025 exceeding 10^26 FLOPs.

The Epoch AI tracking project (Sevilla, Heim, Hobbhahn et al.) provides public data on compute trends, with effective compute doubling on a timescale of approximately six months in the deep-learning era.

Concentration of compute among a few firms (Nvidia for hardware; Microsoft, Google, Amazon, and Meta for hyperscaler infrastructure; OpenAI, Anthropic, Google DeepMind, xAI, Meta, and a handful of others for frontier training) has structural implications for governance.

US export controls on advanced AI chips to China, expanded in October 2022, October 2023, and subsequent updates through 2025, treat compute as a strategic commodity comparable to the dual-use technologies of the Cold War.

Tim Fist, Lennart Heim, and others at the Centre for the Governance of AI have developed compute-governance frameworks proposing hardware-enabled mechanisms for tracking and limiting frontier-scale runs.

The political economy of compute connects AI ethics to industrial policy, sanctions, and the geopolitics of semiconductor manufacturing centered on TSMC in Taiwan.

Race dynamics and competitive pressure

A persistent theme in AI ethics is whether competitive pressure between firms and between nations will erode safety commitments.

Allan Dafoe’s work at the Centre for the Governance of AI (formerly at Oxford, now in Berkeley) has analyzed race-to-the-bottom dynamics.

Markus Anderljung and colleagues’ work on Frontier AI Regulation (2023) develops policy frameworks specifically aimed at frontier model risk.

The departures from OpenAI in 2023 to 2024 of Dario and Daniela Amodei (who founded Anthropic in 2021), Jan Leike, Ilya Sutskever (who founded Safe Superintelligence Inc. in June 2024), and others have been interpreted as evidence of internal disagreements over the prioritization of safety relative to capabilities.

Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, and Google DeepMind’s Frontier Safety Framework establish formal capability thresholds at which additional safeguards apply.

Whether these commitments will hold under commercial pressure, and whether they are sufficient given the pace of capability gain, remains contested.

Open-source and proliferation debates

The release of Meta’s Llama 2 in July 2023 with relatively permissive licensing, followed by Llama 3 in 2024 and other open-weight model families, sparked debate about the safety implications of releasing model weights.

Arguments for open release include the security benefits of broad scrutiny, the democratization of access, and the avoidance of single-firm monopolies.

Arguments against include the impossibility of revocation, the lowering of attack costs for malicious actors, and the loss of oversight points at deployment.

The Stanford-led Foundation Model Transparency Index (Bommasani, Klyman et al. 2023, 2024) ranks providers on disclosure practices.

The dual-use research debate, familiar from biology, has been imported into AI through discussions of dangerous-capability evaluations and structured disclosure.

The labor question

The labor implications of AI have moved from speculative to immediate.

Daron Acemoglu and Simon Johnson’s Power and Progress (2023) frames the question as one of distribution: who captures the productivity gains from automation.

David Autor’s work at MIT on task-based models of labor markets, and Erik Brynjolfsson and Andrew McAfee’s The Second Machine Age (2014) and Machine, Platform, Crowd (2017), shaped the early debate.

Empirical studies of LLM-driven productivity in software engineering (Cui et al. 2024 at Microsoft, Peng et al. 2023 on GitHub Copilot), customer service (Brynjolfsson, Li, Raymond 2023), and writing (Noy and Zhang 2023) document substantial productivity gains, often concentrated among lower-performing workers.

The 2023 Hollywood writers’ and actors’ strikes, and the resulting WGA and SAG-AFTRA contract provisions on AI, established the first major labor-contract treatment of generative AI.

The political-economic question of how the gains and risks of AI are distributed across capital, labor, and the public is at the heart of contemporary AI ethics in a way that purely individual-rights frameworks struggle to address.

Compendium

Explorer

AI Ethics and Alignment

AI Ethics and Alignment

Foundational arguments for alignment as a problem

Techniques in modern alignment

Inner and outer alignment, and mesa-optimization

Fairness and bias

LLM-specific bias and harm

Privacy and surveillance

Existential risk and longtermism

Moral status and AI welfare

Practical alignment governance, 2024 to 2026

Trolley problems, autonomous vehicles, and military AI

Critical perspectives

Persuasion, manipulation, and epistemic harms

Children, vulnerable populations, and AI

Climate and resource costs

Mechanistic interpretability

Open problems

Value alignment and moral uncertainty

The political economy of compute

Race dynamics and competitive pressure

Open-source and proliferation debates

The labor question

Adjacent

Graph View

Table of Contents

Backlinks