Evals and scorers are the backbone of how Braintrust enables optimization for AI agents. At what point can we expect a Replit user feedback mechanism—directly tied into the eval process—for the Replit agent? This would allow real-world usage insights from Replit users to feed into the scoring and training loops, ultimately helping the Replit team refine the agent more effectively.
It’s surprising that, as a Braintrust customer, Replit doesn’t already have this in place. A feedback-to-eval pipeline seems like a critical feature for quickly identifying and addressing the recurring concerns that many of us have raised here in the forum. Not only would it make it easier for the Replit team to prioritize fixes and improvements, but it would also give users confidence that their input has a tangible impact on the product’s evolution.
Direct, structured feedback integrated into evals would be a win-win for both the Replit team and its user community. Is there a timeline or plan for rolling this out?
1 Like
Very tricky. Because many problems are caused by bad prompting.
Garbage in garbage out as the old techie saying goes.
If Replit’s big backend training system had to deal with angry feedback every time agent did something wrong [because it wasn’t instructed clearly by the user], then I think this would lead to a lot of false positive type situations where they were modifying the AI rules for the wrong reasons.
But this is not to say you are wrong @Marshal_Thompson - I just think it is extremely tricky to find the right model for taking, evaluating and using feedback from users who all have very different degrees of expertise and understanding.
@Gipity-Steve
Interestingly, Braintrust’s entire business model is built around solving exactly the challenge you described.
They position themselves as an observability platform for fine-tuning LLM agents — and, according to their presentation earlier this week, Replit is already a customer.
If you haven’t explored it yet, Braintrust’s solution focuses on improving agent inputs and outputs through curated datasets and automated scorers. When paired with their new feature, Loop, the platform should, in theory, provide the Replit team with the observability needed to pinpoint and address exactly the type of issue you’re talking about — or at the very least, give them clear visibility into it.
That said, without a robust user feedback loop feeding into those evaluations, I’m struggling to see how the Replit team can meaningfully move the needle in this problem space. The observability is only as good as the real-world signals it’s measuring.
1 Like