Test Time Scaling with Human Feedback

Integrating Human Feedback into AI Reasoning Pipelines and implications on UX

Mar 05, 2025

I doubt I'm the only person wondering about this - but when will the test time scaling paradigm incorporate human feedback throughout the reasoning pipeline?

For those of you who aren't familiar with test time scaling, the latest reasoning systems including DeepSeek R1, Open AI o3, and Stanford's recent s1 are all built on this architecture. In this framework, prompts don't just generate a single response immediately, but instead invoke a series of models to respond to the prompt and each output is then evaluated and reasoned on by yet another model to choose the best output before proceeding to the next step. These intermediate reasoning steps are performed in parallel or sequentially, sometimes a combination of both. At each reasoning step a variety of methods, such as majority voting, decides which outputs to propagate to the next step. These reasoning steps are typically evaluated by models sometimes known as "Models as a Judge", and more advance techniques include building reinforcement learning reward functions to provide feedback to the models in the previous step. Checkout this article for a simplified breakdown if you want to go deeper.

In the math domain, these reasoning steps are easily verifiable and proceed down logical paths. This is why so many of the benchmarks for reasoning are all evaluating how well the model does on very difficult math problems like FrontierMath. In others like coding, reasoning steps are verifiable as well, e.g. did the code compile - however I'd argue coding is subjective, there are many ways to solve a problem dependent on the programmers style, context, or intent. For generalized questions and research, which are frankly more practical for most applications than PhD-level math, reasoning may not be verifiable at all and even more subject to the user preference.

I'm not ML researcher, but I'd suspect that its possible for these systems to solicit human feedback on its intermediate reasoning steps in real time. Of course the evaluator model needs be able to decide when to ask for human feedback - but assuming that's possible, this should provide better results. In my experience, the o family of models do a decent job of asking for reasoning feedback up front, but imagine a world where the model pauses and checks its work with you. A lot of people say AI is like a junior employee, and when you work with junior employees you give them feedback on work in progress not just their output.

I want to also explore the UX implications of test time scaling with human feedback. The psychology of the human computer interface has trended faster and faster since the origins of "instant message" chat. In some cases like real-time voice, companies like Cartesia are pushing the boundaries of physics where latency truly matters for a good UX. In many other use cases especially with text, I don't think low latency and complete autonomy is as important as people say it is, we're just biased to expect that. In my experience, users are actually eager to provide feedback and steer AI systems rather than getting frustrated by low quality outputs. Looking at products like Cursor, part of its success stems from people being able to specify the context the model should use, rather than just hoping the backend vector search system can figure out which files to use.

As test time scaling advances, these workloads are going to become increasingly asynchronous going from seconds to hours or even days for a single query. If we shift the AI UX paradigm to how delegate asynchronous workflows today, we can easily picture an AI system Slacking or emailing you asking for feedback on its reasoning. Taking this further, an AI workflow could even set the status of a Linear ticket to 'needs review' when people should share code feedback before going back to work. Designing the UX for asynchronous workloads also helps bring down the cost of AI by allowing cheaper or time-shared computational resources to run these reasoning pipelines. This greatly extends budgets, allowing models to take more steps to reason on a problem.

Looking forward to seeing how UX for AI evolves as we delegate more complex tasks to test time scaling systems. If anyone is working on this I'd love to chat!

Aaron's Insights

Discussion about this post

Ready for more?