It’s easier to mess up an eval than to make a good one. Most of the non-successful evals make at least one mistake.

  1. If an eval doesn’t have enough examples, it will be noisy and a bad UI for researchers. ... It’s good to have at least 1,000 examples for your eval; perhaps more if it’s a multiple choice eval. Even though GPQA is a good eval, the fact that it fluctuates based on the prompt makes it hard to use.
  2. ... If there are a lot of mistakes in your eval, people won’t trust it. For example, I used Natural Questions (NQ) for a long time. But GPT-4 crossed the threshold where if GPT-4 got a test-example incorrect, it was more likely that the ground truth answer provided by the eval was wrong. So I stopped using NQ.
  3. If your eval is too complicated, it will be hard for people to understand it and it will simply be used less. ... It’s critical to have a single-number metric—I can’t think of any great evals that don’t have a single-number metric.
  4. If your eval takes too much work to run, it won’t gain traction even if everything else is good. BIG-Bench is one of my favorite evals, but it is a great pain to run. There were both log-prob evals and generation evals, which required different infra ... BIG-Bench didn’t gain much traction, even though it provided a lot of signal.
  5. If an eval is not on a meaningful task, AI researchers won’t deeply care about it. For example, in BIG-Bench Hard we had tasks like recommending movies or closing parentheses properly ... Successful evals often measure things central to intelligence, like language understanding, exam problems, or math.
  6. The grading in your eval should be extremely correct. If someone is debugging why their model got graded incorrectly, and they disagree with the grading, that’s a quick way for them to write-off your eval immediately. It’s worth spending the time to minimize errors due to parsing, or to have the best autograder prompt possible.
  7. For the eval to stand the test of time, performance must not become saturated too quickly. For example, GLUE/SuperGLUE got saturated too quickly that it was hard to show big gains, and people stopped using them. Language models also got good at tasks like summarization and translation faster than we could develop good evals for them, and so we stopped measuring those tasks.
     

See also "Devising ML Metrics" from CAIS.