This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Reward Functions
•
Applied to
Reward hacking behavior can generalize across tasks
by
Kei Nishimura-Gasparian
1mo
ago
•
Applied to
Speedrun ruiner research idea
by
lukehmiles
2mo
ago
•
Applied to
Utility ≠ Reward
by
Oliver Sourbut
6mo
ago
•
Applied to
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
by
jacobjacob
8mo
ago
•
Applied to
VLM-RM: Specifying Rewards with Natural Language
by
ChengCheng
8mo
ago
•
Applied to
Some alignment ideas
by
SelonNerias
10mo
ago
•
Applied to
self-improvement-executors are not goal-maximizers
by
bhauth
1y
ago
•
Applied to
Shutdown-Seeking AI
by
Simon Goldstein
1y
ago
•
Applied to
Language Agents Reduce the Risk of Existential Catastrophe
by
cdkg
1y
ago
•
Applied to
A Short Dialogue on the Meaning of Reward Functions
by
Leon Lang
2y
ago
•
Applied to
Learning societal values from law as part of an AGI alignment strategy
by
John Nay
2y
ago
•
Applied to
Scaling Laws for Reward Model Overoptimization
by
David Gross
2y
ago
•
Applied to
Four usages of "loss" in AI
by
TurnTrout
2y
ago
•
Applied to
Reward IS the Optimization Target
by
RobertM
2y
ago
•
Applied to
Leveraging Legal Informatics to Align AI
by
John Nay
2y
ago
•
Applied to
An investigation into when agents may be incentivized to manipulate our beliefs.
by
RobertM
2y
ago
•
Applied to
Seriously, what goes wrong with "reward the agent when it makes you smile"?
by
TurnTrout
2y
ago
•
Applied to
Reward is not the optimization target
by
TurnTrout
2y
ago