ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

Instead someone might, for example...

Isn't the central one "you want to spend money to make a better long term future more likely, e.g. by donating it to fund AI safety work now"?

Fair enough if you think the marginal value of money is negligable, but this isn't exactly obvious.

Thanks, this is clarifying from my perspective.

My remaining uncertainty is why you think AIs are so unlikely to keep humans around and treat them reasonably well (e.g. let them live out full lives).

From my perspective the argument that it is plausible that humans are treated well [even if misaligned AIs end up taking over the world and gaining absolute power] goes something like this:

  • If it only cost >1/million of overall resources to keep a reasonable fraction of humans alive and happy, it's reasonably likely that misaligned AIs with full control would keep humans alive and happy due to either:
    • Acausal trade/decision theory
    • The AI terminally caring at least a bit about being nice to humans (perhaps because it cares a bit about respecting existing nearby agents or perhaps because it has a bit of human like values).
  • It is pretty likely that it costs <1/million of overall resources (from the AI's perspective) to keep a reaonable fraction of humans alive and happy. Humans are extremely keep to keep around asymptotically and I think it can be pretty cheap even initially, especially if you're a very smart AI.

(See links in my prior comment for more discussion.)

(I also think the argument goes through for 1/billion, but I thought I would focus on the higher value for now.)

Where do you disagree with this argument?

upvoted for being an exquisite proof by absurdity about what's productive

I don't think you should generally upvote things on the basis of indirectly explaining things via being unproductive lol.

Hmm, I agree that ARA is not that compelling on its own (as a threat model). However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger. And, once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary. Further, it seems somewhat hard to do capabilities evaluations rule out this self-improvement if models are ARA capable given that there are so possible routes.

So, I think I basically agree with where you're at overall, but I'd go further than "it's something that roughly correlates with other threat models, but is easier and more concrete to measure" and say "it's a reasonable threshold to use to (somewhat) bound danger " which seems worth noting.

While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources

I agree this is probably an issue for the rogue AIs. But, we might want to retain the ability to slow down if misalignment seems to be a huge risk and rogue AIs could make this considerably harder. (The existance of serious rogue AIs is surely also correlated with misalignment being a big risk.)

While it's hard to coordinate to slow down human AI development even if huge risks are clearly demonstrated, there are ways in which it could be particularly hard to prevent mostly autonomous AIs from self-improving. In particular, other AI projects require human employees which could make them easier to track and shutdown. Further, AI projects are generally limited by not having a vast supply of intellectual labor which would change in a regime where there rogue AIs with reasonable ML abilities.

This is mostly an argument that we should be very careful with AIs which have a reasonable chance of being capable of substantial self-improvement, but ARA feels quite related to me.

The difference between killing everyone and killing almost everyone while keeping a few alive for arcane purposes does not matter to most people, nor should it.

I basically agree with this as stated, but think these arguments also imply that it is reasonably likely that the vast majority of people will survive misaligned AI takeover (perhaps 50% likely).

I also don't think this is very well described as arcane purposes:

  • Kindness is pretty normal.
  • Decision theory motivations is actually also pretty normal from some perspective: it's just the generalization of relatively normal "if you wouldn't have screwed me over and it's cheap for me, I won't screw you over". (Of course, people typically don't motivate this sort of thing in terms of decision theory so there is a bit of a midwit meme here.)

Unfortunately, if the AI really barely cares (e.g. <1/billion caring), it might only need to be barely useful.

I agree it is unlikely to be very useful.

I basically agree with your overall comment, but I'd like to push back in one spot:

If your model of reality has the power to make these sweeping claims with high confidence

From my understanding, for at least Nate Soares, he claims his internal case for >80% doom is disjunctive and doesn't route all through 1, 2, 3, and 4.

I don't really know exactly what the disjuncts are, so this doesn't really help and I overall agree that MIRI does make "sweeping claims with high confidence".

Withholding information because you don't trust your audience to reason validly (!!) is not at all the behavior of a "straight shooter".

Hmm, I'm not sure I exactly buy this. I think you should probably follow something like onion honesty which can involve intentionally simplifying your message to something you expect will give the audience more true views. I think you should lean on the side of stating things, but still, sometimes stating a thing which is true can be clearly distracting and confusing and thus you shouldn't.

though I think they're way above the bar for "worthwhile to report"

Yeah, maybe I'm pretty off base in what the meta-level policy should be like. I don't feel very strongly about how to manage this.

I also now realized that some of the langauge was stronger than I think I intended and I've edited the original comment, sorry about that.

Ryan is arguing more that something like "humans will get a solar system or two and basically get to have decent lives".

Yep, this is an accurate description, but it is worth emphasizing that I think that horrible violent conflict and other bad outcomes for currently alive humans are reasonably likely.

Load More