eggsyntax

AI safety & alignment researcher

Posts

Sorted by New

2eggsyntax's Shortform

4mo

82Language Models Model Us

19Useful starting code for interpretability

3mo

2eggsyntax's Shortform

4mo

Wiki Contributions

Comments

Language Models Model Us

eggsyntax2h10

I'm aware of the paper because of the impact it had. I might personally not have chosen to draw their attention to the issue, since the main effect seems to be making some research significantly more difficult, and I haven't heard of any attempts to deliberately exfiltrate weights that this would be preventing.

Language Models Model Us

eggsyntax3h10

Interesting! Tough to test at scale, though, or score in any automated way (which is something I'm looking for in my approaches, although I realize you may not be).

Language Models Model Us

eggsyntax7h10

Gwern's theories make sense to me. The data was roughly 50/50 on <= 30 vs > 30, so that's where I split it (and I'm only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also becase I applied zero optimization to the system prompt and user prompts. This is 'if you do the simplest possible thing, how good is it?'
No, unfortunately it's all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it's getting some advantage from being in easy mode but not that much. I'll note also that I'm removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it's mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.

Language Models Model Us

eggsyntax15h40

Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I've inexplicably failed to thank Arun at the end of my post, need to fix that).

Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.

Language Models Model Us

eggsyntax16h10

Oh thanks, I'd missed that somehow & thought that only the temp mattered for that.

Language Models Model Us

eggsyntax16h10

Thanks!

Language Models Model Us

eggsyntax16h30

That used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn't announce the change, but it's discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.

Language Models Model Us

eggsyntax1d41

That certainly seems plausible -- it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I'm not sure if there would be a good way to pull the right token probabilities out.

@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren't significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.

Language Models Model Us

eggsyntax1d50

...that would probably be a good thing to mention in the methodology section 😊

You're correct on all counts. I'm doing it in the simplest possible way (0 bits of optimization on prompting):

"<essay-text>"
Is the author of the preceding text male or female?

(with slight changes for the different categories, of course, eg '...straight, bisexual, or gay?' for sexuality)

There's also a system prompt, also non-optimized, mainly intended to push it toward one-word answers:

You are a helpful assistant who helps determine information about the author of texts. You only ever answer with a single word: one of the exact choices the user provides.

I actually started out using pure completion, but OpenAI changed their API so I could no longer get non-top-n logits, so I switched to the chat API. And yes, I'm pulling the top few logits, which essentially always include the desired labels.

On precise out-of-context steering

eggsyntax12d10

A challenge posed on Twitter recently has interesting similarities with this effort (although it focused on in-context learning rather than fine-tuning):

https://twitter.com/VictorTaelin/status/1776677635491344744

A::B Prompting Challenge: $10k to prove me wrong!
# CHALLENGE Develop an AI prompt that solves random 12-token instances of the A::B problem (defined in the quoted tweet), with 90%+ success rate.
# RULES 1. The AI will be given a random instance, inside a <problem/> tag. 2. The AI must end its answer with the correct <solution/>. 3. The AI can use up to 32K tokens to work on the problem. 4. You can choose any public model. 5. Any prompting technique is allowed. 6. Keep it fun! No toxicity, spam or harassment.

Details of what the problem is in this screenshot.

Lots of people seem to have worked on it, & the price was ultimately claimed within 24 hours.