LESSWRONGTags
LW

Reward Functions

•

Applied to Reward hacking behavior can generalize across tasks by Kei Nishimura-Gasparian 1mo ago

•

Applied to Speedrun ruiner research idea by lukehmiles 2mo ago

•

Applied to Utility ≠ Reward by Oliver Sourbut 6mo ago

•

Applied to Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI by jacobjacob 8mo ago

•

Applied to VLM-RM: Specifying Rewards with Natural Language by ChengCheng 8mo ago

•

Applied to Some alignment ideas by SelonNerias 10mo ago

•

Applied to self-improvement-executors are not goal-maximizers by bhauth 1y ago

•

Applied to Shutdown-Seeking AI by Simon Goldstein 1y ago

•

Applied to Language Agents Reduce the Risk of Existential Catastrophe by cdkg 1y ago

•

Applied to A Short Dialogue on the Meaning of Reward Functions by Leon Lang 2y ago

•

Applied to Learning societal values from law as part of an AGI alignment strategy by John Nay 2y ago

•

Applied to Scaling Laws for Reward Model Overoptimization by David Gross 2y ago

•

Applied to Four usages of "loss" in AI by TurnTrout 2y ago

•

Applied to Reward IS the Optimization Target by RobertM 2y ago

•

Applied to Leveraging Legal Informatics to Align AI by John Nay 2y ago

•

Applied to An investigation into when agents may be incentivized to manipulate our beliefs. by RobertM 2y ago

•

Applied to Seriously, what goes wrong with "reward the agent when it makes you smile"? by TurnTrout 2y ago

•

Applied to Reward is not the optimization target by TurnTrout 2y ago