I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don't think this is the biggest issue with these approaches.)

I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)

Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:

Prior to this point, we don't get substantial value from pausing, especially if we're pausing/dismantling all of semi-conductor R&D globally.
Prior to this point AI won't be concerning enough for governments to take agressive action.
At this point, additional time is extremely useful due to access to powerful AIs.
The main counterargument is that at this point more powerful AI will also look very attractive. So, it will seem too expensive to stop.

So, I don't really see very compelling alternatives to push on at the margin as far as "metastrategy" (though I'm not sure I know exactly what you're pointing at here). Pushing for bigger asks seems fine, but probably less leveraged.

I actually don't think control is a great meme for the interests of labs which purely optimize for power as it is a relatively legible ask which is potentially considerably more expensive than just "our model looks aligned because we red teamed it" which is more like the default IMO.

The same way "secure these model weights from China" isn't a great meme for these interests IMO.

Reply

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

ryan_greenblatt11h326

I think literal extinction is unlikely even conditional on misaligned AI takeover due to:

The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens). ^[1]
Decision theory/trade reasons

This is discussed in more detail here and here.

Insofar as humans and/or aliens care about nature, similar arguments apply there too, though this is mostly beside the point: if humans survive and have (even a tiny bit of) resources they can preserve some natural easily.

I find it annoying how confident this article is without really bothering to engage with the relevant arguments here.

(Same goes for many other posts asserting that AIs will disassemble humans for their atoms.)

This includes the potential for the AI to generally have preferences that are morally valueable from a typical human perspective. ↩︎

Reply

3

Stephen Fowler's Shortform

ryan_greenblatt15h20

The Internet seems to agree with you. I wonder why I remember "edit time addition".

Reply

Stephen Fowler's Shortform

ryan_greenblatt16h2-1

ETA = edit time addition

I should probably not use this term, I think I picked up this habit from some other people on LW.

Reply

Stephen Fowler's Shortform

ryan_greenblatt17h20

I interpreted the comment as being more general than this. (As in, if someone does something that works out very badly, they should be forced to resign.)

Upon rereading the comment, it reads as less generic than my original interpretation. I'm not sure if I just misread the comment or if it was edited. (Would be nice to see the original version if actually edited.)

(Edit: Also, you shouldn't interpret my comment as an endorsement or agreement with the the rest of the content of Ben's comment.)

Reply

Stephen Fowler's Shortform

ryan_greenblatt19h62

I don't see how this is relevant to my comment.

By "positive EV bets" I meant positive EV with respect to shared values, not with respect to personal gain.

Edit: Maybe your view is that leaders should take this bets anyway even though they know they are likely to result in a forced retirement. (E.g. ignoring the disincentive.) I was actually thinking of the disincentive effect as: you are actually a good leader, so you remaining in power would be good, therefore you should avoid actions that result in you losing power for unjustified reasons. Therefore you should avoid making positive EV bets (as making these bets is now overall negative EV as it will result in a forced leadership transition which is bad). More minimally, you strongly select for leaders which don't make such bets.

Reply

Stephen Fowler's Shortform

ryan_greenblatt19h2518

Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire?

Doesn't this strongly disincentivize making positive EV bets which are likely to fail?

Edit: I interpreted this comment as a generic claim about how the EA community should relate to things which went poorly ex-post, I now think this comment was intended to be less generic.

Reply

DeepMind's "Frontier Safety Framework" is weak and unambitious

ryan_greenblatt19h175

In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities

I agree with this as stated, but don't think that avoiding deploying such models is needed to mitigate risk.

I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly.

More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.)

or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.

I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control).

<1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details.

(That said, I'm pretty sure that these lab's aren't making decisions based on carefully analyzing the situation and are instead just operating like "idk human level models don't seem that bad, we'll probably be able to figure it out, humans can solve most problems with empiricism on priors". But, this prior seems more right than overwhelming pessimism IMO.)

Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.

Reply

Stephen Fowler's Shortform

ryan_greenblatt19h3325

I mostly agree with premises 1, 2, and 3, but I don't see how the conclusion follows.

It is possible for things to be hard to influence and yet still worth it to try to influence them.

(Note that the $30 million grant was not an endorsement and was instead a partnership (e.g. it came with a board seat), see Buck's comment.)

(Ex-post, I think this endeavour was probably net negative, though I'm pretty unsure and ex-ante I currently think it seems great.)

Reply

Stephen Fowler's Shortform

ryan_greenblatt20h174

Why focus on the $30 million grant?

What about large numbers of people working at OpenAI directly on capabilities for many years? (Which is surely worth far more than $30 million.)

Separately, this grant seems to have been done to influence the goverance at OpenAI, not make OpenAI go faster. (Directly working on capabilities seems modestly more accelerating and risky than granting money in exchange for a partnership.)

(ETA: TBC, there is a relationship between the grant and people working at OpenAI on capabilities: the grant was associated with a general vague endorsement of trying to play inside game at OpenAI.)

Reply