Our verification approach significantly improves the accuracy of reasoning chains but fails to enhance the final answer accuracy, demonstrating a crucial distinction in verification effectiveness.
In the GSM8K dataset, 91.6% of problems show that final answers are unlikely to change through our deductive verification due to predominant voting discrepancies.
We found that 46.2% of reasoning chains that correctly arrive at answers are filtered out by our verification, indicating substantial errors in reasoning.
It appears that most reasoning chains identified as correct still possess significant flaws, suggesting that verification is critical for reasoning quality, not just for final outcomes.
Collection
[
|
...
]