Eliezer explores a dichotomy between "thinking in toolboxes" and "thinking in laws".
Toolbox thinkers are oriented around a "big bag of tools that you adapt to your circumstances." Law thinkers are oriented around universal laws, which might or might not be useful tools, but which help us model the world and scope out problem-spaces. There seems to be confusion when toolbox and law thinkers talk to each other.
I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I've compiled a list with a longlist of 19 different arguments I've heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I'm sharing them in the hopes that this is interesting to people
(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)
Credit to Evan...
I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we'd like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can't put them in code or a program specification, and consequently can't reason about them very precisely, test whether sys...
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The Open Thread tag is here. The Open Thread sequence is here.
More concrete than your actual question, but there's a couple options you can take:
acknowledge that there's a form of social truth whereby the things people insist upon believing are functionally true. For instance, there may be no absolute moral value to criticism of a particular leader, but in certain countries the social system creates a very unambiguous negative value to it. Stick to the observable -- if he does an experiment, replicate that experiment for yourself and share the results. If you get different results, examine why. IMO, attempting in
(ElevenLabs reading of this post:)
I'm excited to share a project I've been working on that I think many in the Lesswrong community will appreciate - converting some rational fiction into high-quality audiobooks using cutting-edge AI voice technology from ElevenLabs, under the name "Askwho Casts AI".
The keystone of this project is an audiobook version of Planecrash (AKA Project Lawful), the epic glowfic authored by Eliezer Yudkowsky and Lintamande. Given the scope and scale of this work, with its large cast of characters, I'm using ElevenLabs to give each character their own distinct voice. It's a labor of love to convert this audiobook version of this story, and I hope if anyone has bounced off it before, this...
Yep, I can do that legwork!
I'll add some commentary, but I'll "spoiler" it in case people don't wanna see my takes ahead of forming their own, or just general "don't spoil the intended payoffs" stuff.
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...
This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere
A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.
… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...
Yeah there are definitely tasks that depressants would be expected to leave intact. I'd guess it's correlated strongly with degree of working memory required.
Hello, friends.
This is my first post on LW, but I have been a "lurker" here for years and have learned a lot from this community that I value.
I hope this isn't pestilent, especially for a first-time post, but I am requesting information/advice/non-obvious strategies for coming up with emergency money.
I wouldn't ask except that I'm in a severe financial emergency and I can't seem to find a solution. I feel like every minute of the day I'm butting my head against a brick wall trying and failing to figure this out.
I live in a very small town in rural Arizona. The local economy is sustained by fast food restaurants, pawn shops, payday lenders, and some huge factories/plants that are only ever hiring engineers and other highly specialized personnel.
I...
It is not that high here, but this is something I will look into if we can get to Seattle. But does this not require a license?
Join us at Segundo Coffee Lab at 7pm for Houston Rationalists' weekly social meetup.
(inside the IRONWORKS through the big orange door)
711 Milby St, Houston, TX 77023
Go to the #monday-meetups channel on our Discord server for additional details and discussion:
Join us at Segundo Coffee Lab at 7pm for Houston Rationalists' weekly social meetup.
(inside the IRONWORKS through the big orange door)
711 Milby St, Houston, TX 77023
Go to the #monday-meetups channel on our Discord server for additional details and discussion: