Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

Sequences

2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More

Comments

Two things:

  • For myself, I would not feel comfortable using language as confident-sounding as "on the default trajectory, AI is going to kill everyone" if I assigned (e.g.) 10% probability to "humanity [gets] a small future on a spare asteroid-turned-computer or an alien zoo or maybe even star". I just think that scenario's way, way less likely than that.
    • I'd be surprised if Nate assigns 10+% probability to scenarios like that, but he can speak for himself. 🤷‍♂️
    • I think some people at MIRI have significantly lower p(doom)? And I don't expect those people to use language like "on the default trajectory, AI is going to kill everyone".
  • I agree with you that there's something weird about making lots of human-extinction-focused arguments when the thing we care more about is "does the cosmic endowment get turned into paperclips"? I do care about both of those things, an enormous amount; and I plan to talk about both of those things to some degree in public communications, rather than treating it as some kind of poorly-kept secret that MIRI folks care about whether flourishing interstellar civilizations get a chance to exist down the line. But I have this whole topic mentally flagged as a thing to be thoughtful and careful about, because it at least seems like an area that contains risk factors for future deceptive comms. E.g., if we update later to expecting the cosmic endowment to be wasted but all humans not dying, I would want us to adjust our messaging even if that means sacrificing some punchiness in our policy outreach.
    • Currently, however, I think the particular scenario "AI keeps a few flourishing humans around forever" is incredibly unlikely, and I don't think Eliezer, Nate, etc. would say things like "this has a double-digit probability of happening in real life"? And, to be honest, the idea of myself and my family and friends and every other human being all dying in the near future really fucks me up and does not seem in any sense OK, even if (with my philosopher-hat on) I think this isn't as big of a deal as "the cosmic endowment gets wasted".
    • So I don't currently feel bad about emphasizing a true prediction ("extremely likely that literally all humans literally nonconsensually die by violent means"), even though the philosophy-hat version of me thinks that the separate true prediction "extremely likely 99+% of the potential value of the long-term future is lost" is more morally important than that. Though I do feel obliged to semi-regularly mention the whole "cosmic endowment" thing in my public communication too, even if it doesn't make it into various versions of my general-audience 60-second AI risk elevator pitch.

Note that "everyone will be killed (or worse)" is a different claim from "everyone will be killed"! (And see Oliver's point that Ryan isn't talking about mistreated brain scans.)

Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.

I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don't think we'd be comfortable saying "AI is going to kill everyone". (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)

I also personally have maybe half my probability mass on "the AI just doesn't store any human brain-states long-term", and I have less than 1% probability on "conditional on the AI storing human brain-states for future trade, the AI does in fact encounter aliens that want to trade and this trade results in a flourishing human civilization".

FWIW I do think "don't trust this guy" is warranted; I don't know that he's malicious, but I think he's just exceptionally incompetent relative to the average tech reporter you're likely to see stories from.

Like, in 2018 Metz wrote a full-length article on smarter-than-human AI that included the following frankly incredible sentence:

During a recent Tesla earnings call, Mr. Musk, who has struggled with questions about his company’s financial losses and concerns about the quality of its vehicles, chastised the news media for not focusing on the deaths that autonomous technology could prevent — a remarkable stance from someone who has repeatedly warned the world that A.I. is a danger to humanity.

FWIW, Cade Metz was reaching out to MIRI and some other folks in the x-risk space back in January 2020, and I went to read some of his articles and came to the conclusion in January that he's one of the least competent journalists -- like, most likely to misunderstand his beat and emit obvious howlers -- that I'd ever encountered. I told folks as much at the time, and advised against talking to him just on the basis that a lot of his journalism is comically bad and you'll risk looking foolish if you tap him.

This was six months before Metz caused SSC to shut down and more than a year before his hit piece on Scott came out, so it wasn't in any way based on 'Metz has been mean to my friends' or anything like that. (At the time he wasn't even asking around about SSC or Scott, AFAIK.)

(I don't think this is an idiosyncratic opinion of mine, either; I've seen other non-rationalists I take seriously flag Metz as someone unusually out of his depth and error-prone for a NYT reporter, for reporting unrelated to SSC stuff.)

Sounds like a lot of political alliances! (And "these two political actors are aligned" is arguably an even weaker condition than "these two political actors are allies".)

At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)

To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".

If I hear that Russia and China are "aligned", I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.

And if we step back from the human realm, an engineered system can be "aligned" in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.

Cf. the history of the term "AI alignment". From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term "AI alignment" was that we wanted to switch away from "Friendly AI" to a term that sounded more neutral. "Friendly AI research" had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing "Friendliness" made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.

But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted "alignment" to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.

I've basically given up on trying to achieve uniformity on what "AI alignment" is; the best we can do, I think, is clarify whether we're talking about "intent alignment" vs. "outcome alignment" when the distinction matters.

But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn't have a word for this idea I think it would be very important to invent one. 

IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers' focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this "keep our eye on the ball" skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to "sell" AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).

And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that "alignment" is about AI systems' goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.

"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?

In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.

Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?

We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".

If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?

It also doesn't force us to believe that a bunch of water pipes or gears functioning as a classical computer can ever have our own first person experience.

Why is this an advantage of a theory? Are you under the misapprehension that "hypothesis H allows humans to hold on to assumption A" is a Bayesian update in favor of H even when we already know that humans had no reason to believe A? This is another case where your theory seems to require that we only be coincidentally correct about A ("sufficiently complex arrangements of water pipes can't ever be conscious"), if we're correct about A at all.

One way to rescue this argument is by adding in an anthropic claim, like: "If water pipes could be conscious, then nearly all conscious minds would be instantiated in random dust clouds and the like, not in biological brains. So given that we're not Boltzmann brains briefly coalescing from space dust, we should update that giant clouds of space dust can't be conscious."

But is this argument actually correct? There's an awful lot of complex machinery in a human brain. (And the same anthropic argument seems to suggest that some of the human-specific machinery is essential, else we'd expect to be some far-more-numerous observer, like an insect.) Is it actually that common for a random brew of space dust to coalesce into exactly the right shape, even briefly?

Load More