The self-play paradigm is self-evidently self-exciting. If we can just put a model in a nice loop and train it to learn everything, then we get AGI (I think?)! And with RL ‘finally’ working for LMs, it seems simple enough to do: get one model to propose problems, get another to solve them, and use RL to train both in a self-improving loop. Lots of interesting work explored this last year, with two especially interesting ones being Absolute Zero and SPICE (although there are many more cool works out there)!

But despite these papers existing, we don’t yet have AGI. Why not? I was curious about this, so I spent some time at the start of the year poking at these codebases. I’m sharing some results from these experiments below! This isn’t really paper-quality work, so don’t expect thorough ablations or so on, but I think all these experiments pointed me to a fairly useful insight: diversity is key to proper self-exploration. And diversity is…. hard to solve.

Starting from Absolute Zero

Absolute Zero training overview
Figure 1. Overview of the Absolute Zero Reasoner training loop from the original paper.

For these experiments, I used Absolute Zero as a starting point: it was a pretty famous paper, they released code, and I was able to replicate the results easly enough with the published code. I’ll note that the setup in this paper is a bit complicated: the paper focusses on python code execution as a playground for self-play, and in particular trains models to either generate or predict elements of (input, program, output) given the other two elements. Guessing the input given program and output is abductive reasoning; guessing the output given input and program is deductive reasoning; guessing the program given input and output is inductive reasoning. The proposer model is rewarded the more the solver model struggles, and the solver is rewarded for getting problems right - see the figure above. Additionally, a buffer of past results is used to condition the proposer when creating new tasks, and is initialised with a few seed samples.

Replicating Absolute Zero

I primarily experimented with the 7b script, which trains Qwen/Qwen2.5-7B (the base model) in the self-play loop. Running the initial script, I found I was able to replicate the ‘in-distribution’ scores pretty well:

In-distribution scores
Figure 2. My reproduction of the in-distribution benchmark scores.
Published scores
Figure 3. Published in-distribution benchmark scores from Absolute Zero.

But! If we continue training, we see the ‘in-distribution’ scores start to decline:

In-distribution scores decline
Figure 4. Continuing the reproduction run shows in-distribution scores declining after the early peak.

Why is this happening? Looking at the metrics a bit more, we see that the model starts predicting the code output perfectly around the same time:

Code output prediction accuracy
Figure 5. Code prediction accuracy on generated tasks rises while LiveCodeBench performance stagnates.

This suggests the model is perfectly solving the generated code! But we reward the proposer for the solver struggling, and if we look the accuracy of the policy on the generated programs at each step at program generation time, we don’t see this perfect performance:

Generated programs
Figure 6. Policy accuracy on generated code-output tasks during training.

This is contradictory and suggests a bug…! And indeed this was the case. When new programs are generated, they are added to a buffer, from which programs are sampled for the solver tasks. However, the buffer is capped to a max size. But it is capped by simply dropping the most recently generated samples:

self.datasets['input'] = self.datasets['input'][:max_length]

Since the dataset is never shuffled, this means the most recently generated samples are dropped. Hence, after the buffer hits max size, it never grows, and the model is trained on the same set of programs forever. This explains our contradiction above - the proposer is generating programs the solver struggles with, but they are never added to the buffer, so the solver is never trained on them.

Important: This bug doesn’t really affect the results in the paper, as the buffer size was 16384, and the model generates a maximum of 64 programs per step. Hence, the buffer doesn’t fill up until step 256 at the earliest, and the paper results are training models to ~250 steps.

Failures in Diversity

Great, I thought - just remove the buffer size limit and never truncate, and we have AGI! But then, well:

Eval scores over training
Figure 7. Eval scores after fixing the buffer truncation bug.

Scores still collapsed….! Why? Looking at the other metrics, we see some clear signs of diversity collapse: entropy collapses, complexity and program length flatline.

Entropy, complexity, and program length
Figure 8. Entropy collapses while generated-program complexity and length flatten.

If we do a bit more digging, we see that the model ends up generating basically the same program over and over again, producing something like:

def f(lst: list[int]) -> int:
      max_sum = 0
      for i in range(len(lst)):
          current_sum = 0
          for j in range(i, len(lst)):
              current_sum += lst[j] - i - (j - i - 1 - i - i - (j - i - 1 - i - i)) - i - ...
              if current_sum > max_sum:
                  max_sum = current_sum
      return max_sum

Often the sum itself changes a bit, and maybe it uses list as the typehint instead of list[int], but basically it’s always this. This implies that the previous bug was actually preventing diversity collapse by just stopping generating new functions after step ~200. For example, here’s a bunch of the generated functions at step 60:

### Program 1
     def f(strings: list[str]):
         count = 0
         for string in strings:
             count += 'a' in string
         return count

### Program 2
     def f(n: int) -> int:
         fib_a, fib_b = 0, 1
         while fib_b < n:
             fib_a, fib_b = fib_b, fib_a + fib_b
         return fib_b

### Program 3
     def f(numbers: list) -> list:
         for i in range(len(numbers)):
             numbers[i] -= 3
             numbers[i] //= 2
         numbers.sort(reverse=True)
         return numbers

### Program 4
     def f(nums: list):
         swapped_pairs = [nums[i:i+2][::-1] if i % 2 != len(nums)%2 else nums[i:i+2] for i in range(0, len(nums), 2)]
         return [num for pair in swapped_pairs for num in pair]

### Program 5
     def f(numbers: list, target_sum: int):
         if target_sum < 0:
             return 0
         result = [1] + [0] * target_sum
         for number in numbers:
             for i in range(number, target_sum + 1):
                 result[i] += result[i - number]
         return result[target_sum]

As you can see, there’s lots of diversity here, and for the large part these are all ‘small’ functions, but display a fair amount of diversity - not all are loops, and structurally they feel different from each other.

Why is this happening? Well, my best guess is that the model is never punished for generating the same program over and over again. But it is rewarded for programs the solver struggles with, and in particular programs the solver can’t predict the output/input of well. So, it settles on these loop accumulations that are tricky enough to predict, and then mildly varies it to get the solver to struggle.

We can see if we can improve this easily by just… adding a diversity reward!

Sprinkling in some diversity

So next I tried adding a diversity reward. I tried a few different different diversity rewards:

LM-based diversity reward

One trick in NLP: just use an LM as a judge. It’s simple and (sometimes) works. For this, we simply ask an LM judge if the new program is meaningfully different from the past programs. Since the prompt size grows, I just sampled 3 programs from the buffer of past programs to keep the prompt size constant. Keeping in the self-play spirit, I used the same model being trained as the judge, which gave reasonable but not perfect scores.

This worked for a bit, but then still eventually collapsed to the same ~4 programs over and over again. Why 4? because we only ever sample 3 programs from the buffer, so this is sort of ‘hacking’ this prompt size limit. Increasing the size limit could help, or doing pairwise comparisons only, but this gets expensive fast.

Embedding-based diversity reward

Another thing we could do is directly measure similarity between programs by using an embedding model and cosine similarity. I used microsoft/unixcoder-base model for embeddings.

This really worked but… also got hacked. We would get programs with similar structures but diverse names, like:

def f(creative_name_1: int, creative_name_2: int, ..., creative_name_N: int) -> int:
  counter = 0
  for i in range(1, some_param + 1):
    if i % K == 0 and i > some_other_param:
      counter += 1
    if i > yet_another_param:
      counter += 1
  return counter + some_param

This feels like the embedding model is just not really great at measuring program differences. This did pretty well on the evals, though, improving quite well over time. Maybe semantic diversity is all you need?

Embedding-based diversity reward eval scores over training
Figure 9. Embedding diversity improves evals relative to the Absolute Zero baseline.

Entropic + embedding

I also played around with the adding the entropic reward in Learning to Discover at Test Time with the embedding approach. The basic intuition here is to try to reward things that are very different, and focus on maximising that. This sorta just collapsed, but I didn’t spend much time on it. It seemed to have some nice diversity.

Conditioning on a document corpus

In this setup, we sample a problem from a dataset, and then use it as conditioning for the generator model. The idea is that the dataset itself is diverse, and so we can just ‘expand’ it via the generator. I took this idea from SPICE, and sampled from the Llama Nemotron RLVR-code stdio dataset.

This seemed nice initially, but then eventually it hit a similar problem as the embedding-based diversity reward. The model would just generate variations of the same program over and over again, but with slightly different names and domain words, for example:

def f(streams: int, trays: int, bars: int) -> int:
    bar_out = 2 if trays > 0 else 0
    for shift in range(4):
        bar_out += 1 if shift < streams else 0
    if shift == bars:
        bar_out -= 2 if trays > 2 else 1
        bar_out += 1 if trays > 0 else 0
    return bar_out + 3 if streams > 1 else bar_out + 2
def f(gears: int, mileage: int, platform: int) -> int:
    """
    Calculates potential wheel alignments via mileage statuses linked to premium gears and platform base orientations.
    """
    alignments = 0
    for gear in range(gears):
        if gear > platform * 2:
            alignments += 3 if gear % mileage == 0 else 2
        elif gear < mileage // 2:
            alignments += 1 if gear % 2 == 0 else 0
        if gear > platform:
            alignments -= 2 if gear < platform * 3 else 1
    return alignments + 2 if gears > mileage else alignments * 2

Again, these are just loops with some various math, but at least the variable names are nice. This actually did really well on the evals, similar to the embedding approach, but then eventually collapsed much later on in training (past 600 steps), as seen below.

Conditioning eval scores over training
Figure 10. Conditioning on external code data improves early evals but later collapses.

What next?

From this, we can see a few things:

  • Diversity is useful for performance, but…
  • Diversity metrics are easy to hack during RL training, especially over time.
  • You need to train to > 600 steps to see the collapse affect downstream performance. Our ‘fixes’ to diversity may only stave off collapse, but don’t fully solve it.

This is about where I stopped playing around with this, due to other stuff in my life coming up and needing attention. But there’s lots of interesting extensions here I think:

  • Training a better embedding model seems very doable, and probably gets much harder to hack. Perhaps you could even train one online adversarially against the generator?
  • Other constraints on the generator: e.g., an LM judge making sure it follows the conditioning data in the prompt for the conditioning, or sampling more from it. Generally, adding a well-prompted and trained LM judge seems strong, like what Scaling Self-Play with Self-Guidance does.
  • Breaking out of the Absolute Zero setup itself: I think the Absolute Zero setup is fairly restrictive, since its purely about code IO. Making more general self-play setups probably reduce the chances of collapsing into the same few tasks/programs over and over again.
  • Entropy-preserving RL. We saw entropy collapse, but earlier in training, when entropy was higher, we saw more diversity. It’s entirely possible general entropy preserving techniques could help here.

And that’s all for now! Thanks for reading!

Bibliography