Open Question: Working with concepts that the human can’t understand
Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H's concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?
Paul: I don't have any general answer to this, seems like we should probably choose some example cases. I'm probably going to be advocating something like "Search over a bunch of possible concepts and find one that does what you want / has the desired properties."
E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like "What would I infer from learning that a proof is `elegant` other than that it will work" and make sure that you are OK with that.
Andreas: Suppose you don't have the concepts of "proof" and "inquiry", but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I'm trying to see in more detail that you can do a good job at "making sure you're OK with reasoning in ways X" in cases where X is far removed from H's concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)
This may be related to the more general question of what sorts of instructions you'd give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.
Open Question: Severity of “Honest Mistakes”
In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.
Paul: I also want to make the further claim that such failures are much less concerning than what-I'm-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).
This is one of my main cruxes. I have 2 main concerns about honest mistakes:
1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don't see any strong reasons to believe it will happen or that we'll be able to recognize whether it is or not.
2) The "progeny alignment problem" (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don't have a good reason to believe it won't. You could argue that humans recognize this problem, so an AGI should as well (and if it's aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we're building "R"):
2a) R can make an unaligned progeny before it's "smart enough" to realize it needs to exercise care to avoid doing so.
2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI's AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.
2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R's effective horizon, but will take over and optimize a different objective after that.
2b and 2c are *arguably* "non-problems" (although they're at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.
Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, the transcript of the discussion is available here. (Disclaimer that it’s a nonstandard format and we weren't optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.
Highlights
Strategies for sampling from a human distribution of solutions:
Dealing with unknown concepts
Limits on what amplification can accomplish
Alignment search for creative solutions
Considering the task of generating a solution to a problem that requires creativity, it can be decomposed into:
Generate solutions
Evaluate those solutions
For solution generation, one idea is to shape the distribution of proposals so you are less likely to get malign answers (ie. sample from the distribution of answers a human would give, which would hopefully be more likely to be safe/easily evaluated compared to some arbitrary distribution).
I asked Paul if he thought that safe creative solution generation would require sampling from a less malign distribution, or whether he thought we could solve evaluation (“secure-X-evaluation”, as testing whether the solution fulfilled property X) well enough to use an arbitrary distribution/brute force search.
We then dived into how well we could solve secure X-evaluation. I was particularly interested in questions like how we could evaluate whether a design had potentially harmful side-effects.