Seth Herd

I've been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I've worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I've focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I'm incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.

Wiki Contributions

Corrigibility

6mo

(+472)

Language model cognitive architecture

6mo

(+805)

Chain-of-Thought Alignment

(+14/-21)

Comments

AI Timelines

Seth Herd6mo94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Seth Herd7mo52

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

[Linkpost] "Governance of superintelligence" by OpenAI

Seth Herd1y60

That is indeed a lot of points. Let me try to parse them and respond, because I think this discussion is critically important.

Point 1: overhang.

Your first two paragraphs seem to be pointing to downsides of progress, and saying that it would be better if nobody made that progress. I agree. We don't have guaranteed methods of alignment, and I think our odds of survival would be much better if everyone went way slower on developing AGI.

The standard thinking, which could use more inspection, but which I agree with, is that this is simply not going to happen. Individuals that decide to step aside are slowing progress only slightly. This leaves compute overhang that someone else is going to take advantage of, with nearly the competence, and only slightly slower. Those individuals who pick up the banner and create AGI will not be infinitely reckless, but the faster progress from that overhang will make whatever level of caution they have less effective.

This is a separate argument from regulation. Adequate regulation will slow progress universally, rather than leaving it up to the wisdom and conscience of every individual who might decide to develop AGI.

I don't think it's impossible to slow and meter progress so that overhang isn't an issue. But I think it is effectively even harder than alignment. We have decent suggestions on the table for alignment now, and as far as I know, no equally promising suggestions for getting everyone (and it does take almost everyone coordinating) to pass up the immense opportunities offered by capabilities overhangs.

Point 2: Are LLMs safer than other approaches?

I agree that this is a questionable proposition. I think it's worth questioning. Aiming progress at easier-to-align approaches seems highly worthwhile.

I agree that an LLM may have something like a mind inside. I think current versions are almost certainly too dumb to be existentially dangerous (at least directly - if a facebook algorithm can nearly cause an insurrection, who knows what dangerous side effects any AI can have).

I'm less worried about GPT10 playing a superintelligent, Waluigi-collapsed villain than I am about a GPT6 that has been amplified to agency, situational awareness, and weak superintelligence by scaffolding it into something like a cognitive architecture. I think this type of advance is inevitable. ChatGPT extensions and Bing Chat both use internal prompting to boost intelligence, and approaches like SmartGPT and Tree of Thoughts massively improve benchmark results over the base LLM.

Fortunately, this direction also has huge advantages for alignment. It has a very low alignment tax, since you give them additional goals in natural language, like "support human empowerment" or whatever the SOTA alignment goal is. And they have vastly better interpretability since they're at least summarizing their thoughts in natural language.

Here's where your skepticism that they're being honest about summarizing those thoughts comes into full force. I agree that it's not reliable; for instance, changing the intermediate answer in chain of thought prompting often doesn't change the final output, indicating that that output was for show.

However, a safer setup is to never use the same model twice. When you use chain-of-thought reasoning, construct a new context with the relevant information from memory; don't just let the context window accrue, since this allows fake chains-of-thought and the collapse of the simulator into a waluigi.

Scaffolded LLMs should not turn an LLM into an agent, but rather create a committee of LLMs that are called for individual questions needed to accomplish that committee's goals.

This isn't remotely a solution to the alignment problem, but it really seems to have massive upsides, and only the same downsides as other practically viable approaches to AGI.

To be clear, I only see some form of RL agents as the other practical possibility, and I like our odds much less with those.

I think there are other, even more readily alignable approaches to AGI. But they all seem wildly impractical. I think we need to get ready to align the AGI we get, rather than just preparing to say I-told-you-so after the world refuses to forego massive incentives to take a much slower but safer route to AGI.

To paraphrase, we need to go to the alignment war with the AGI we get, not the AGI we want.

How could you possibly choose what an AI wants?

Seth Herd1y123

I really like your recent series of posts that succinctly address common objections/questions/suggestions about alignment concerns. I'm making a list to show my favorite skeptics (all ML/AI people; nontechnical people, as Connor Leahy puts it, tend to respond "You fucking what? Oh hell no!" or similar when informed that we are going to make genuinely smarter-than-us AI soonish).

We do have ways to get an AI to do what we want. The hardcoded algorithmic maximizer approach seems to be utterly impractical at this point. That leaves us with approaches that don't obviously do a good job of preserving their own goals as they learn and evolve:

training a system to pursue things we like, such as shard theory and similar approaches.
training or hand-coding a critic system, as in outlined approaches from Steve Byrnes and me as well as many others. Nicely summarized as a Steering systems approach. This seems a bit less sketchy than training in our goals and hoping they generalize adequately, but still pretty sketchy.
Telling the agent what to do in a Natural language alignment approach. This seems absurdly naive. However, I'm starting to think our first human-plus AGIs will be wrapped or scaffolded LLMs, and they to a nontrivial degree actually think in natural language. People are right now specifying goals in natural language, and those can include alignment goals (or destroying humanity, haha). I just wrote an in-depth post on the potential Capabilities and alignment of LLM cognitive architectures, but I don't have a lot to say about stability in that post.

None of these directly address what I'm calling The alignment stability problem, to give a name to what you're addressing here. I think addressing it will work very differently in each of the three approaches listed above, and might well come down to implementational details within each approach. I think we should be turning our attention to this problem along with the initial alignment problems, because some of the optimism in the field stems from thinking about initial alignment and not long-term stability.

Edit: I left out Ozyrus's posts on approach 3. He's the first person I know of to see agentized LLMs coming, outside of David Shapiro's 2021 book. His post was written a year ago and posted two weeks ago to avoid infohazards. I'm sure there are others who saw this coming more clearly than I did, but I thought I'd try to give credit where it's due.

On AutoGPT

Seth Herd1y287

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.^[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

^{^}
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd2h20

I read your linked shortform thread. I agreed with pretty most of your arguments against some common AGI takeover arguments. I agree that they won't coordinate against us and won't have "collective grudges" against us.

But I don't think the arguments for continued stability are very thorough, either. I think we just don't know how it will play out. And I think there's a reason to be concerned that takeover will be rational for AGIs, where it's not for humans.

The central difference in logic is the capacity for self-improvement. In your post, you addressed self-improvement by linking a Christiano piece on slow takeoff. But he noted at the start that he wasn't arguing against self-improvement, only that the pace of self improvement would be more modest. But the potential implications for a balance of power in the world remain.

Humans are all locked to a similar level of cognitive and physical capabilities. That has implications for game theory where all of the competitors are humans. Cooperation often makes more sense for humans. But the same isn't necessarily true of AGI. Their cognitive and physical capacities can potentially be expanded on. So it's (very loosely) like the difference between game theory in chess, and chess where one of the moves is to add new capabilities to your pieces. We can't learn much about the new game from theory of the old, particularly if we don't even know all of the capabilities that a player might add to their pieces.

More concretely: it may be quite rational for a human controlling an AGI to tell it to try to self-improve and develop new capacities, strategies and technologies to potentially take over the world. With a first-mover advantage, such a takeover might be entirely possible. Its capacities might remain ahead of the rest of the world's AI/AGIs if they hadn't started to aggressively self-improve and develop the capacities to win conflicts. This would be particularly true if the aggressor AGI was willing to cause global catastrophe (e.g., EMPs, bringing down power grids).

The assumption of a stable balance of power in the face of competitors that can improve their capacities in dramatic ways seems unlikely to be true by default, and at the least, worthy of close inspection. Yet I'm afraid it's the default assumption for many.

Your shortform post is more on-topic for this part of the discussion, so I'm copying this comment there and will continue there if you want. It's worth more posts; I hope to write one myself if time allows.

Edit: It looks like there's an extensive discussion there, including my points here, so I won't bother copying this over. As far as I could tell, neither you nor anyone else had really addressed the destabilizing effect of potential AGI self-improvement. So I continue to think that a massively multipolar AGI scenario probably results fairly quickly in conflict and potential catastrophe.

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd3h20

In the near term AI and search are blurred, but that's a separate topic. This post was about AGI as distinct from AI. There's no sharp line between but there are important distinctions, and I'm afraid we're confused as a group because of that blurring. More above, and it's worth its own post and some sort of new clarifying terminology. The term AGI has been watered down to include LLMs that are fairly general, rather than the original and important meaning of AI that can think about anything, implying the ability to learn, and therefore almost necessarily to have explicit goals and agency. This was about that type of "real" AGI, which is still hypothetical even though increasingly plausible in the near term.

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd3h20

Yes, we do see such "values" now, but that's a separate issue IMO.

There's an interesting thing happening in which we're mixing discussions of AI safety and AGI x-risk. There's no sharp line, but I think they are two importantly different things. This post was intended to be about AGI, as distinct from AI. Most of the economic and other concerns relative to the "alignment" of AI are not relevant to the alignment of AGI.

This thesis could be right or wrong, but let's keep it distinct from theories about AI in the present and near future. My thesis here (and a common thesis) is that we should be most concerned about AGI that is an entity with agency and goals, like humans have. AI as a tool is a separate thing. It's very real and we should be concerned with it, but not let it blur into categorically distinct, goal-directed, self-aware AGI.

Whether or not we actually get such AGI is an open question that should be debated, not assumed. I think the answer is very clearly that we will, and soon; as soon as tool AI is smart enough, someone will make it agentic, because agents can do useful work, and they're interesting. So I think we'll get AGI with real goals, distinct from the pseudo-goals implicit in current LLMs behavior.

The post addresses such "real" AGI that is self-aware and agentic, but that has the sole goal of doing what people want is pretty much a third thing that's somewhat counterintuitive.

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd1d40

Thanks for engaging. I did read your linked post. I think you're actually in the majority in your opinion on AI leading to a continuation and expansion of business as usual. I've long been curious about about this line of thinking; while it makes a good bit of sense to me for the near future, I become confused at the "indefinite" part of your prediction.

When you say that AI continues from the first step indefinitely, it seems to me that you must believe one or more of the following:

No one would ever tell their arbitrarily powerful AI to take over the world
- Even if it might succeed
No arbitrarily powerful AI could succeed at taking over the world
- Even if it was willing to do terrible damage in the process
We'll have a limited number of humans controlling arbitrarily powerful AI
- And an indefinitely stable balance-of-power agreement among them
By "indefinitely" you mean only until we create and proliferate really powerful AI

If I believed in any of those, I'd agree with you.

Or perhaps I'm missing some other belief we don't share that leads to your conclusions.

Care to share?

Separately, in response to that post: your post you linked was titled AI values will be shaped by a variety of forces, not just the values of AI developers. In my prediction here, AI and AGI will not have values in any important sense; it will merely carry out the values of its principals (its creators, or the government that shows up to take control). This might just be terminological distinction, except for the following bit of implied logic: I don't think AI needs to share clients' values to be of immense economic and practical advantage to them. When (if) someone creates a highly capable AI system, they will instruct it to serve customers needs in certain ways, including following their requests within certain limits; this will not necessitate changing the A(G)I's core values (if they exist) to use it to make enormous profits when licensed to clients. To the extent this is correct, we should go on assuming that AI will share or at least follow its creators' values (or IMO more likely, take orders/values from the government that takes control, citing security concerns)

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd1d40

I'm sure it's not the same, particularly since neither one has really been fully fleshed out and thought through. In particular, Yudkowsky doesn't focus on the advantages of instructing the AGI to tell you the truth, and interacting with it as it gets smarter. I'd guess that's because he was still anticipating a faster takeoff than network-based AGI affords.

But to give credit where it's due, I think that literal instruction-following was probably part of (but not the whole of) his conception of task-based AGI. From the discussion thread w Paul Christiano following the task directed AGI article on Greater Wrong:

The AI is getting short-term objectives from humans and carrying them out under some general imperative to do things conservatively or with ‘low unnecessary impact’ in some sense of that, and describes plans and probable consequences that are subject to further human checking, and then does them, and then the humans observe the results and file more requests.

And the first line of that article:

A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope [...]

These sections, in connection with the lack of reference to instructions and checking for most of the presentation, suggest to me that he probably was thinking of things like hard-coding it to design nanotech, melt down GPUs (or whatever) and then delete itself, but also of more online, continuous instruction-following AGI more similar to my conception of likely AGI projects. Bensinger may have been pursuing one part of that broader conception.

LESSWRONG
LW

Posts

Wiki Contributions

Comments