Eliezer explores a dichotomy between "thinking in toolboxes" and "thinking in laws". 
Toolbox thinkers are oriented around a "big bag of tools that you adapt to your circumstances." Law thinkers are oriented around universal laws, which might or might not be useful tools, but which help us model the world and scope out problem-spaces. There seems to be confusion when toolbox and law thinkers talk to each other.

William_S3dΩ671488
26
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
I wish there were more discussion posts on LessWrong. Right now it feels like it weakly if not moderately violates some sort of cultural norm to publish a discussion post (similar but to a lesser extent on the Shortform). Something low effort of the form "X is a topic I'd like to discuss. A, B and C are a few initial thoughts I have about it. What do you guys think?" It seems to me like something we should encourage though. Here's how I'm thinking about it. Such "discussion posts" currently happen informally in social circles. Maybe you'll text a friend. Maybe you'll bring it up at a meetup. Maybe you'll post about it in a private Slack group. But if it's appropriate in those contexts, why shouldn't it be appropriate on LessWrong? Why not benefit from having it be visible to more people? The more eyes you get on it, the better the chance someone has something helpful, insightful, or just generally useful to contribute. The big downside I see is that it would screw up the post feed. Like when you go to lesswrong.com and see the list of posts, you don't want that list to have a bunch of low quality discussion posts you're not interested in. You don't want to spend time and energy sifting through the noise to find the signal. But this is easily solved with filters. Authors could mark/categorize/tag their posts as being a low-effort discussion post, and people who don't want to see such posts in their feed can apply a filter to filter these discussion posts out. Context: I was listening to the Bayesian Conspiracy podcast's episode on LessOnline. Hearing them talk about the sorts of discussions they envision happening there made me think about why that sort of thing doesn't happen more on LessWrong. Like, whatever you'd say to the group of people you're hanging out with at LessOnline, why not publish a quick discussion post about it on LessWrong?
habryka3d4518
5
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.  Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict? An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike. The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up. And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don't get their own versions they'll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China's responses will be. If its easy for China to copy our AI technology, then there's much slower mounting pressure.
Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"? Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA? 1. Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours 2. There is another board shakeup because senior leaders seem worried about Altman. Altman stays on 3. Superalignment team is disbanded 4. OpenAI doesn't let UK or US AISI's safety test GPT5/6 before release 5. There are strong rumours they've achieved weakly general AGI internally at end of 2025

Popular Comments

Recent Discussion

I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I've compiled a list with a longlist of 19 different arguments I've heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I'm sharing them in the hopes that this is interesting to people

(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)

Credit to Evan...

hmys5m10

I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we'd like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can't put them in code or a program specification, and consequently can't reason about them very precisely, test whether sys... (read more)

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

nim11m20

More concrete than your actual question, but there's a couple options you can take:

  • acknowledge that there's a form of social truth whereby the things people insist upon believing are functionally true. For instance, there may be no absolute moral value to criticism of a particular leader, but in certain countries the social system creates a very unambiguous negative value to it. Stick to the observable -- if he does an experiment, replicate that experiment for yourself and share the results. If you get different results, examine why. IMO, attempting in

... (read more)
4nim28m
Welcome! If you have the emotional capacity to happily tolerate being disagreed with or ignored, you should absolutely participate in discussions. In the best case, you teach others something they didn't know before, or get a misconception of your own corrected. In the worst case, your remarks are downvoted or ignored. Your question on games would do well fleshed out into at least a quick take, if not a whole post, answering: * What games you've ruled out for this and why * what games in other genres you've found to capture the "truly simulation-like" aspect that you're seeking * examples of game experiences that you experience as narrative railroading * examples of ways that games that get mostly there do a "hard science/AI/transhumanist theme" in the way that you're looking for * perhaps what you get from it being a game that you miss if it's a book, movie, or show? If you've tried a lot of things and disliked most, then good clear descriptions of what you dislike about them can actually function as helpful positive recommendations for people with different preferences.
2nim33m
Can random people donate images for the sequence-items that are missing them, or can images only be provided by the authors? I notice that I am surprised that some sequences are missing out on being listed just because images weren't uploaded, considering that I don't recall having experienced other sequences' art as particularly transformative or essential.
2nim36m
Congratulations! I'm in today's lucky 10,000 for learning that Asymptote exists. Perhaps due to my not being much of a mathematician, I didn't understand it very clearly from the README... but the examples comparing code to its output make sense! Comparing your examples to the kind of things Asymptote likes to show off (https://asymptote.sourceforge.io/gallery/), I see why you might have needed to build the additional tooling. I don't think you necessarily have to compare smoothmanifold to a JavaScript framework to get the point across -- it seems to be an abstraction layer that allows one to describe a drawn image in slightly more general terms than Asymptote supports. I admire how you're investing so much effort to use your talents to help others.

(ElevenLabs reading of this post:)

I'm excited to share a project I've been working on that I think many in the Lesswrong community will appreciate - converting some rational fiction into high-quality audiobooks using cutting-edge AI voice technology from ElevenLabs, under the name "Askwho Casts AI".

The keystone of this project is an audiobook version of Planecrash (AKA Project Lawful), the epic glowfic authored by Eliezer Yudkowsky and Lintamande. Given the scope and scale of this work, with its large cast of characters, I'm using ElevenLabs to give each character their own distinct voice. It's a labor of love to convert this audiobook version of this story, and I hope if anyone has bounced off it before, this...

3Mir13h
I gave it a try two years ago, and I rly liked the logic lectures early on (basicly a narrativization of HAE101 (for beginners)), but gave up soon after.  here are some other parts I lurned valuable stuff fm: * when Keltham said "I do not aspire to be weak." * and from an excerpt he tweeted (idk context): "if at any point you're calculating how to pessimize a utility function, you're doing it wrong."   * Keltham briefly talks about the danger of (what I call) "proportional rewards".  I seem to not hv noted down where in the book I read it, but it inspired this note: * If you're evaluated for whether you're doing your best, you have an incentive to (subconsciously or otherwise) be weaker so you can fake doing your best with less effort. Never encourage people "you did your best!". An objective output metric may be fairer all things considered. * and furthermore caused me to try harder to eliminate internal excusification-loops in my head.  "never make excuses for myself" is my ~3rd Law—and Keltham help me be hyperaware of it. * (unrelatedly, my 1st Law is "never make decisions, only ever execute strategies" (origin).) * I already had extensive notes on this theme, originally inspired by "Stuck In The Middle With Bruce" (JF Rizzo), but Keltham made me revisit it and update my behaviour further. * re "handicap incentives", "moralization of effort", "excuses to lose", "incentive to hedge your bets" * I also hv this quoted in my notes, though only to use as diversity/spice for explaining stuff I already had in there (I've placed it under the idionym "tilling the epistemic soil"): * Keltham > "I'm - actually running into a small stumbling block about trying to explain mentally why it's better to give wrong answers than no answers? It feels too obvious to explain? I mean, I vaguely remember being told about experiments where, if you don't do that, people sort of revise history inside their own heads, and aren't aware of the processes

Yep, I can do that legwork!

I'll add some commentary, but I'll "spoiler" it in case people don't wanna see my takes ahead of forming their own, or just general "don't spoil the intended payoffs" stuff.

  1. https://www.projectlawful.com/replies/1743791#reply-1743791
  1. https://www.projectlawful.com/posts/6334 (Contains infohazards for people with certain psychologies, do not twist yourself into a weird and uncomfortable condition contemplating "Greater Reality" - notice confusion about it quickly and refocus on ideas for which you can more easily update your ex
... (read more)

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).

TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.

Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...

4gate35m10

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

  1. Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
  2. It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constrai
... (read more)
14gate2h
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy. Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.
14gate3h
Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things  happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field. I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness. Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever. Curious to hear thoughts
1Bogdan Ionut Cirstea7h
Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing. 

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

3EJT4h
But Eric Neyman's post suggests that benzos don't significantly reduce performance on some cognitive tasks (e.g. Spelling Bee)

Yeah there are definitely tasks that depressants would be expected to leave intact. I'd guess it's correlated strongly with degree of working memory required.

4JenniferRM4h
There was an era in a scientific community where they were interested in the "kinds of learning and memory that could happen in de-corticated animals" and they sort of homed in on the basal ganglia (which, to a first approximation "implements habits" (including bad ones like tooth grinding)) as the locus of this "ability to learn despite the absence of stuff you'd think was necessary for your naive theory of first-order subjectively-vivid learning". (The cerebellum also probably has some "learning contribution" specifically for fine motor skills, but it is somewhat selectively disrupted just by alcohol: hence the stumbling and slurring. I don't know if anyone yet has a clean theory for how the cerebellum's full update loop works. I learned about alcohol/cerebellum interactions because I once taught a friend to juggle at a party, and she learned it, but apparently only because she was drunk. She lost the skill when sober.)
2the gears to ascension9h
yeah, agreed - benzos are on my list of drugs to never take if I can possibly avoid it, along with opiates. By temporary, I just mean "recoverably". many drugs society considers sus or terrible I consider mostly fine if risks are managed, but that generally involves how to avoid addiction, and means using things at non-recreational-does levels. Benzos are hard to do that with because, to my cached understanding, the margin between therapeutic and addictive doses is very small.

Hello, friends.

This is my first post on LW, but I have been a "lurker" here for years and have learned a lot from this community that I value.

I hope this isn't pestilent, especially for a first-time post, but I am requesting information/advice/non-obvious strategies for coming up with emergency money.

I wouldn't ask except that I'm in a severe financial emergency and I can't seem to find a solution. I feel like every minute of the day I'm butting my head against a brick wall trying and failing to figure this out.

I live in a very small town in rural Arizona. The local economy is sustained by fast food restaurants, pawn shops, payday lenders, and some huge factories/plants that are only ever hiring engineers and other highly specialized personnel.

I...

1Answer by Hastings4h
If you can get to Seattle for your partner's career, you can likely get a job nannying during the day, which will pay $25 to $30 an hour and doesn't require a car.  This time last summer I was an incoming intern in Seattle and I was unable to pay less than $30 an hour for childcare during working hours, hiring by combing through Facebook groups for nannys and sending many messages. At this price, one of the nannys we worked with had a car and the other did not. I do not know what the childcare market is like near your current location.

It is not that high here, but this is something I will look into if we can get to Seattle. But does this not require a license?

2nim6h
I hear you, describing how weird social norms in the world can be. I hear you describing how you followed those norms to show consideration for readers by dressing up a very terrible situation as a slightly less bad one. In social settings where people both know who you are and are compelled by the circumstances to listen to what you say, that's still the right way to go about it. The rudeness of taking peoples' time is very real in person, where a listener is socially "forced" to invest time in listening or effort in escaping the conversation. But posts online are different: especially when you lack the social capital of "this post is by someone I know I often like reading, so I should read it to see what they say", readers should feel no obligation to read your whole post, nor to reply, if they don't want to. When you're brand new to a community, readers can easily dismiss your post as a bot or scammer and simply ignore it, so you have done them no harm in the way that consuming someone's time in person harms them. A few trolls may choose to read your post and then pretend you forced them to do so, but anyone who behaves like that is inherently outing themself as someone whose opinions about you don't deserve much regard. (and then you get some randos who like how you write and decide to be micro-penpals... hi there!) However, there's another option for how to approach this kind of thing online. You can spin up an anonymous throwaway and play the "asking for a friend" game -- take the option of direct help or directly contacting the "actual person" off the table, and you've ruled out being a gofundme scam. Sometimes asking on behalf of a fictional person whose circumstances happen to be more like the specifics of your own than you would disclose in public gets far better answers. For instance, if the fictional person had a car problem involving a specific model year of vehicle and a specific insurance company, the internet may point out that there's a recall on
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Join us at Segundo Coffee Lab at 7pm for Houston Rationalists' weekly social meetup.

(inside the IRONWORKS through the big orange door)

711 Milby St, Houston, TX 77023

Go to the #monday-meetups channel on our Discord server for additional details and discussion:

https://discord.gg/DzmEPAscpS

Join us at Segundo Coffee Lab at 7pm for Houston Rationalists' weekly social meetup.

(inside the IRONWORKS through the big orange door)

711 Milby St, Houston, TX 77023

Go to the #monday-meetups channel on our Discord server for additional details and discussion:

https://discord.gg/DzmEPAscpS

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA