Aaron's Insights

Test Time Scaling with Human Feedback

Aaron Melgar — Wed, 05 Mar 2025 20:48:03 GMT

I doubt I'm the only person wondering about this - but when will the test time scaling paradigm incorporate human feedback throughout the reasoning pipeline?

For those of you who aren't familiar with test time scaling, the latest reasoning systems including DeepSeek R1, Open AI o3, and Stanford's recent s1 are all built on this architecture. In this framework, prompts don't just generate a single response immediately, but instead invoke a series of models to respond to the prompt and each output is then evaluated and reasoned on by yet another model to choose the best output before proceeding to the next step. These intermediate reasoning steps are performed in parallel or sequentially, sometimes a combination of both. At each reasoning step a variety of methods, such as majority voting, decides which outputs to propagate to the next step. These reasoning steps are typically evaluated by models sometimes known as "Models as a Judge", and more advance techniques include building reinforcement learning reward functions to provide feedback to the models in the previous step. Checkout this article for a simplified breakdown if you want to go deeper.

In the math domain, these reasoning steps are easily verifiable and proceed down logical paths. This is why so many of the benchmarks for reasoning are all evaluating how well the model does on very difficult math problems like FrontierMath. In others like coding, reasoning steps are verifiable as well, e.g. did the code compile - however I'd argue coding is subjective, there are many ways to solve a problem dependent on the programmers style, context, or intent. For generalized questions and research, which are frankly more practical for most applications than PhD-level math, reasoning may not be verifiable at all and even more subject to the user preference.

I'm not ML researcher, but I'd suspect that its possible for these systems to solicit human feedback on its intermediate reasoning steps in real time. Of course the evaluator model needs be able to decide when to ask for human feedback - but assuming that's possible, this should provide better results. In my experience, the o family of models do a decent job of asking for reasoning feedback up front, but imagine a world where the model pauses and checks its work with you. A lot of people say AI is like a junior employee, and when you work with junior employees you give them feedback on work in progress not just their output.

I want to also explore the UX implications of test time scaling with human feedback. The psychology of the human computer interface has trended faster and faster since the origins of "instant message" chat. In some cases like real-time voice, companies like Cartesia are pushing the boundaries of physics where latency truly matters for a good UX. In many other use cases especially with text, I don't think low latency and complete autonomy is as important as people say it is, we're just biased to expect that. In my experience, users are actually eager to provide feedback and steer AI systems rather than getting frustrated by low quality outputs. Looking at products like Cursor, part of its success stems from people being able to specify the context the model should use, rather than just hoping the backend vector search system can figure out which files to use.

As test time scaling advances, these workloads are going to become increasingly asynchronous going from seconds to hours or even days for a single query. If we shift the AI UX paradigm to how delegate asynchronous workflows today, we can easily picture an AI system Slacking or emailing you asking for feedback on its reasoning. Taking this further, an AI workflow could even set the status of a Linear ticket to 'needs review' when people should share code feedback before going back to work. Designing the UX for asynchronous workloads also helps bring down the cost of AI by allowing cheaper or time-shared computational resources to run these reasoning pipelines. This greatly extends budgets, allowing models to take more steps to reason on a problem.

Looking forward to seeing how UX for AI evolves as we delegate more complex tasks to test time scaling systems. If anyone is working on this I'd love to chat!

IQh - a new way of pricing AI systems

Aaron Melgar — Thu, 30 Jan 2025 00:17:39 GMT

As a physics major, I distinctly recall my professors stressing the importance of getting the units of measure correct. Even if the final number was off by a factor of pi, oftentimes having the appropriate units and being within the right order of magnitude earned me at least partial credit on exams. Looking ahead into the future of AI has me pondering the question: are tokens the right units to price AI models?

Take a simple example:

“Is 7428958698238956023987723967931 prime?” - roughly 20 tokens

“Take this hour long call transcript, plus this 30 page compliance guide and tell me if the sales person asked about pre-existing conditions“ - roughly 30,000 tokens.

Should these input and output tokens be priced the same? Clearly a model system will apply more effort answering one prompt over the other.

In these simple examples, high-level model tiers like 2.5, Pro, or Lite may suffice. However this starts to break down as model systems become increasingly dynamic and involve many inference steps to generate answers. While most are touting 2025 as the year of Agents, one of the most important trends is the rise of chain-of-thought reasoning systems, which involve multiple parallel model inferences to generate, critique, refine, and reason about a prompt. DeepSeek’s R1 reasoning model took the world by storm, open sourcing the architecture and model - making reasoning capabilities significantly cheaper. They weren’t the first to build an architecture in this direction though, OpenAI’s o-3 unveiling just at the end of last year sparked a huge debate around the future AI system design striving towards general intelligence. Meta also just released a paper outlining a reasoning system called COCONUT which uses embeddings rather than text at intermediate steps to improve performance and efficiency. Even early-stage startups like Fireworks have been building model systems for complex reasoning, and no doubt this system design pattern will continue to evolve well into 2025 and beyond.

One could argue that chain-of-thought reasoning is itself an Agentic pattern, but I’m not here to play semantics. Autonomously applying reinforcement learning and supervised-fine-tuning on systems of models at scale seems beyond the scope of database queries and customer service workflows that most people refer to as Agents but I digress. Pricing these reasoning systems requires a slew of new economic considerations that encapsulate how much time and effort went into generating an answer.

Number of refinement steps generating the answer
Number of refinement steps occurring in parallel
Size of the models at each step
Latency requirements
Quantity of reasoning tokens exchanged between steps
Amount of context pulled from retrieval or search systems
So many more…

As the levers of control with these systems become increasingly granular, I would bet that engineering teams will be able to provision reasoning capabilities with a certain degree of accuracy and precision. “I want a system with an IQ of 100 to work on this problem for up to 4 hours” more accurately represents how people think about labor economics. When you think about it, this is actually how people price consulting services - a higher hourly rate for a more capable person.

These reasoning systems can even dynamically route a prompt to an appropriately powerful model, and even generate reasoning plans within budget constraints. The reasoning capabilities of a model combined with the amount of effort to arrive at answer seems to be the best way to price these systems.

This led me to the notion of IQ-hour or IQh which succinctly captures the interplay between reasoning capability and effort applied. Of course the measure of IQ doesn’t come without its limitations and caveats. The relationship between IQ and socioeconomic status is highly contested, and IQ doesn’t capture domain specific performance. In my head, IQ is distinct from domain knowledge which comes in the form of context, like pulling medical research into a prompt. The model system then reasons on the body of knowledge with varying degrees of IQ. All of this said, IQ is the most widely accepted unit for quantifying intelligence and reasoning, and makes for a good shorthand approximation here.

An analogous unit to IQh in physics is the kWh unit of energy which is how your electricity usage is billed, where kW represents the power generated multiplied by time consumed in hours. The simple example would be with 10wH, you can turn on a 1W light for 10 hours, or turn on a more powerful 10W light for 1h. In this case, IQ represents the reasoning power of the model, multiplied by the time it gets used. A powerful model would solve a task in less time but could still cost the same as a less powerful model taking longer. time. After all, Andrew Ng famously proclaimed that “AI is the new electricity” back in 2017, great to see this coming to fruition.

I could see an argument that IQh is simply an extension of a GPU-hour metric, and that each step in the chain of thought reasoning can be priced by the amount of time the computer is being utilized. However I tend to favor output pricing over input pricing; your utility doesn’t bill for the natural gas consumed or cost of solar installations, they just charge you for the kWh you use. If you have two competing reasoning offerings, each could price at the same GPU-hour level but have different outcomes based on each system design. True outcome based pricing in these systems is intractable because of their general nature, and determining the value of outcomes is often an arduous task. Just ask your an engineer about the value of addressing technical debt and you’re in for hours of debate..

IQ is also more fungible as a potential standardized definition of reasoning, whereas it goes without saying that not all high performance compute clusters are created equal. As Nvidia releases new chipsets every year, AMD and others race to compete, and many cloud providers are provisioning their own chips. It’s easier to consider lowering the cost of IQh, than increasing the reasoning capability of a GPU-hr because compute clusters are so heterogeneous and sensitive to the economics of purchasing the hardware. Each cloud provider procures and provisions with their own methodology, but the output unit of IQ can be applied ubiquitously across any application.

I think IQh concept is well aligned with the economics and heuristics of building AI systems. Even if we don’t get down to IQ precision of measuring reasoning, some combination of reasoning power and time is the right direction. In the spirit of strong opinions, weakly held - I’d love to hear how other people are thinking about pricing AI reasoning systems.