Bystander Apathy

Eliezer Yudkowsky

A Longlist of Theories of Impact for Interpretability

127

Ω 482y

I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I've compiled a list with a longlist of 19 different arguments I've heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I'm sharing them in the hopes that this is interesting to people

(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)

Credit to Evan...

(Continue Reading – 1283 more words)

hmys5m10

I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we'd like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can't put them in code or a program specification, and consequently can't reason about them very precisely, test whether sys... (read more)

Open Thread Spring 2024

habryka

2mo

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

nim11m20

More concrete than your actual question, but there's a couple options you can take:

acknowledge that there's a form of social truth whereby the things people insist upon believing are functionally true. For instance, there may be no absolute moral value to criticism of a particular leader, but in certain countries the social system creates a very unambiguous negative value to it. Stick to the observable -- if he does an experiment, replicate that experiment for yourself and share the results. If you get different results, examine why. IMO, attempting in

... (read more)

4nim28m

Welcome! If you have the emotional capacity to happily tolerate being disagreed with or ignored, you should absolutely participate in discussions. In the best case, you teach others something they didn't know before, or get a misconception of your own corrected. In the worst case, your remarks are downvoted or ignored. Your question on games would do well fleshed out into at least a quick take, if not a whole post, answering: * What games you've ruled out for this and why * what games in other genres you've found to capture the "truly simulation-like" aspect that you're seeking * examples of game experiences that you experience as narrative railroading * examples of ways that games that get mostly there do a "hard science/AI/transhumanist theme" in the way that you're looking for * perhaps what you get from it being a game that you miss if it's a book, movie, or show? If you've tried a lot of things and disliked most, then good clear descriptions of what you dislike about them can actually function as helpful positive recommendations for people with different preferences.

2nim33m

Can random people donate images for the sequence-items that are missing them, or can images only be provided by the authors? I notice that I am surprised that some sequences are missing out on being listed just because images weren't uploaded, considering that I don't recall having experienced other sequences' art as particularly transformative or essential.

2nim36m

Congratulations! I'm in today's lucky 10,000 for learning that Asymptote exists. Perhaps due to my not being much of a mathematician, I didn't understand it very clearly from the README... but the examples comparing code to its output make sense! Comparing your examples to the kind of things Asymptote likes to show off (https://asymptote.sourceforge.io/gallery/), I see why you might have needed to build the additional tooling. I don't think you necessarily have to compare smoothmanifold to a JavaScript framework to get the point across -- it seems to be an abstraction layer that allows one to describe a drawn image in slightly more general terms than Asymptote supports. I admire how you're investing so much effort to use your talents to help others.

Introducing AI-Powered Audiobooks of Rational Fiction Classics

Askwho

(ElevenLabs reading of this post:)

I'm excited to share a project I've been working on that I think many in the Lesswrong community will appreciate - converting some rational fiction into high-quality audiobooks using cutting-edge AI voice technology from ElevenLabs, under the name "Askwho Casts AI".

The keystone of this project is an audiobook version of Planecrash (AKA Project Lawful), the epic glowfic authored by Eliezer Yudkowsky and Lintamande. Given the scope and scale of this work, with its large cast of characters, I'm using ElevenLabs to give each character their own distinct voice. It's a labor of love to convert this audiobook version of this story, and I hope if anyone has bounced off it before, this...

(See More – 140 more words)

3Mir13h

I gave it a try two years ago, and I rly liked the logic lectures early on (basicly a narrativization of HAE101 (for beginners)), but gave up soon after. here are some other parts I lurned valuable stuff fm: * when Keltham said "I do not aspire to be weak." * and from an excerpt he tweeted (idk context): "if at any point you're calculating how to pessimize a utility function, you're doing it wrong." * Keltham briefly talks about the danger of (what I call) "proportional rewards". I seem to not hv noted down where in the book I read it, but it inspired this note: * If you're evaluated for whether you're doing your best, you have an incentive to (subconsciously or otherwise) be weaker so you can fake doing your best with less effort. Never encourage people "you did your best!". An objective output metric may be fairer all things considered. * and furthermore caused me to try harder to eliminate internal excusification-loops in my head. "never make excuses for myself" is my ~3rd Law—and Keltham help me be hyperaware of it. * (unrelatedly, my 1st Law is "never make decisions, only ever execute strategies" (origin).) * I already had extensive notes on this theme, originally inspired by "Stuck In The Middle With Bruce" (JF Rizzo), but Keltham made me revisit it and update my behaviour further. * re "handicap incentives", "moralization of effort", "excuses to lose", "incentive to hedge your bets" * I also hv this quoted in my notes, though only to use as diversity/spice for explaining stuff I already had in there (I've placed it under the idionym "tilling the epistemic soil"): * Keltham > "I'm - actually running into a small stumbling block about trying to explain mentally why it's better to give wrong answers than no answers? It feels too obvious to explain? I mean, I vaguely remember being told about experiments where, if you don't do that, people sort of revise history inside their own heads, and aren't aware of the processes

Nevin Wetherill28m10

Yep, I can do that legwork!

I'll add some commentary, but I'll "spoiler" it in case people don't wanna see my takes ahead of forming their own, or just general "don't spoil the intended payoffs" stuff.

https://www.projectlawful.com/replies/1743791#reply-1743791

https://www.projectlawful.com/posts/6334 (Contains infohazards for people with certain psychologies, do not twist yourself into a weird and uncomfortable condition contemplating "Greater Reality" - notice confusion about it quickly and refocus on ideas for which you can more easily update your ex

151

Andrew Mack, TurnTrout

Ω 746d

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).

TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.

Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...

(Continue Reading – 13417 more words)

4gate35m10

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constrai

... (read more)

14gate2h

Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy. Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

14gate3h

Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field. I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness. Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever. Curious to hear thoughts

1Bogdan Ionut Cirstea7h

Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing.

Some Experiments I'd Like Someone To Try With An Amnestic

johnswentworth

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

(See More – 589 more words)

3EJT4h

But Eric Neyman's post suggests that benzos don't significantly reduce performance on some cognitive tasks (e.g. Spelling Bee)

the gears to ascension1h20

Yeah there are definitely tasks that depressants would be expected to leave intact. I'd guess it's correlated strongly with degree of working memory required.

4JenniferRM4h

There was an era in a scientific community where they were interested in the "kinds of learning and memory that could happen in de-corticated animals" and they sort of homed in on the basal ganglia (which, to a first approximation "implements habits" (including bad ones like tooth grinding)) as the locus of this "ability to learn despite the absence of stuff you'd think was necessary for your naive theory of first-order subjectively-vivid learning". (The cerebellum also probably has some "learning contribution" specifically for fine motor skills, but it is somewhat selectively disrupted just by alcohol: hence the stumbling and slurring. I don't know if anyone yet has a clean theory for how the cerebellum's full update loop works. I learned about alcohol/cerebellum interactions because I once taught a friend to juggle at a party, and she learned it, but apparently only because she was drunk. She lost the skill when sober.)

2the gears to ascension9h

yeah, agreed - benzos are on my list of drugs to never take if I can possibly avoid it, along with opiates. By temporary, I just mean "recoverably". many drugs society considers sus or terrible I consider mostly fine if risks are managed, but that generally involves how to avoid addiction, and means using things at non-recreational-does levels. Benzos are hard to do that with because, to my cached understanding, the margin between therapeutic and addictive doses is very small.

How would you navigate a severe financial emergency with no help or resources?

Tigerlily

Hello, friends.

This is my first post on LW, but I have been a "lurker" here for years and have learned a lot from this community that I value.

I hope this isn't pestilent, especially for a first-time post, but I am requesting information/advice/non-obvious strategies for coming up with emergency money.

I wouldn't ask except that I'm in a severe financial emergency and I can't seem to find a solution. I feel like every minute of the day I'm butting my head against a brick wall trying and failing to figure this out.

I live in a very small town in rural Arizona. The local economy is sustained by fast food restaurants, pawn shops, payday lenders, and some huge factories/plants that are only ever hiring engineers and other highly specialized personnel.

I...

(See More – 560 more words)

1Answer by Hastings4h

If you can get to Seattle for your partner's career, you can likely get a job nannying during the day, which will pay $25 to $30 an hour and doesn't require a car. This time last summer I was an incoming intern in Seattle and I was unable to pay less than $30 an hour for childcare during working hours, hiring by combing through Facebook groups for nannys and sending many messages. At this price, one of the nannys we worked with had a car and the other did not. I do not know what the childcare market is like near your current location.

Tigerlily1h10

It is not that high here, but this is something I will look into if we can get to Seattle. But does this not require a license?

2nim6h

I hear you, describing how weird social norms in the world can be. I hear you describing how you followed those norms to show consideration for readers by dressing up a very terrible situation as a slightly less bad one. In social settings where people both know who you are and are compelled by the circumstances to listen to what you say, that's still the right way to go about it. The rudeness of taking peoples' time is very real in person, where a listener is socially "forced" to invest time in listening or effort in escaping the conversation. But posts online are different: especially when you lack the social capital of "this post is by someone I know I often like reading, so I should read it to see what they say", readers should feel no obligation to read your whole post, nor to reply, if they don't want to. When you're brand new to a community, readers can easily dismiss your post as a bot or scammer and simply ignore it, so you have done them no harm in the way that consuming someone's time in person harms them. A few trolls may choose to read your post and then pretend you forced them to do so, but anyone who behaves like that is inherently outing themself as someone whose opinions about you don't deserve much regard. (and then you get some randos who like how you write and decide to be micro-penpals... hi there!) However, there's another option for how to approach this kind of thing online. You can spin up an anonymous throwaway and play the "asking for a friend" game -- take the option of direct help or directly contacting the "actual person" off the table, and you've ruled out being a gofundme scam. Sometimes asking on behalf of a fictional person whose circumstances happen to be more like the specifics of your own than you would disclose in public gets far better answers. For instance, if the fictional person had a car problem involving a specific model year of vehicle and a specific insurance company, the internet may point out that there's a recall on

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)