68 points by kcorbitt 20 hours ago | 8 comments
sadiq 16 hours ago
Excellent, look forward to giving this a go.

I was looking at: https://arxiv.org/abs/2506.18254 but your approach is even more general.

kcorbitt 14 hours ago
I really like RLPR for when you have a known-good answer to compare to as well!
spmurrayzzz 16 hours ago
Might end up being some confusion with the RULER benchmark from NVIDIA given the (somewhat shared) domain: https://github.com/NVIDIA/RULER

EDIT: by shared I only mean the adjacency to LLMs/AI/ML, RL is a pretty big differentiator though and project looks great

kcorbitt 14 hours ago
Dang, hadn't seen that. Namespace collision strikes again.
swyx 9 hours ago
yeah unforutnately for you this is one of the well known long context benchmarks. too late tho, soldier on.
15 hours ago
maxrmk 14 hours ago
Very cool. Do you do anything to mitigate ordering bias in the evaluation function, or do you just expect it to average out over time?
kcorbitt 14 hours ago
No, we don't do anything. Theoretically we could judge several times with different ordering.

We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!

15 hours ago
swyx 9 hours ago
how does o3 on the customer support agent task so dreadfully underperform qwen?
someoneontenet 17 hours ago
Love these write ups!
kcorbitt 16 hours ago
Thank! If there are any topics that you'd find particularly interesting, let me know and I can try to find time. :)
ndgold 15 hours ago
Dope