

If you wire the LLM directly into a proof-checker (like with AlphaGeometry) or evaluation function (like with AlphaEvolve) and the raw LLM outputs aren’t allowed to do anything on their own, you can get reliability. So you can hope for better, it just requires a narrow domain and a much more thorough approach than slapping some extra firm instructions in an unholy blend of markup languages in the prompt.
In this case, solving math problems is actually something Google search could previously do (before dumping AI into it) and Wolfram Alpha can do, so it really seems like Google should be able to offer a product that does math problems right. Of course, this solution would probably involve bypassing the LLM altogether through preprocessing and post processing.
Also, btw, LLM can be (technically speaking) deterministic if the heat is set all the way down, its just that this doesn’t actually improve their performance at math or anything else. And it would still be “random” in the sense that minor variations in the prompt or previous context can induce seemingly arbitrary changes in output.
So us sneerclubbers correctly dismissed AI 2027 as bad scifi with a forecasting model basically amounting to “line goes up”, but if you end up in any discussions with people that want more detail titotal did a really detailed breakdown of why their model is bad, even given their assumptions and trying to model “line goes up”: https://www.lesswrong.com/posts/PAYfmG2aRbdb74mEp/a-deep-critique-of-ai-2027-s-bad-timeline-models
tldr; the AI 2027 model, regardless of inputs and current state, has task time horizons basically going to infinity at some near future date because they set it up weird. Also the authors make a lot of other questionable choices and have a lot of other red flags in their modeling. And the picture they had in their fancy graphical interactive webpage for fits of the task time horizon is unrelated to the model they actually used and is missing some earlier points that make it look worse.