Compare our EBM reasoning model against the latest frontier AI models. Enter a Sudoku or load a random hard puzzle.
*To ensure we are testing the AI models ability to actually reason and self-align, we disabled code execution for both the EBM and LLMs. If you run these tests on public LLMs, rather than trying to reason through the puzzles themselves, they will run a brute-force search in Python to "cheat." Kona actually reasons through the Sudoku without access to code execution.