The article describes an experimental setup designed to evaluate the step-by-step reasoning capabilities of large language models (LLMs) using seven algorithmic reasoning tasks from Big-Bench Hard. The authors employ various prompting techniques and baselines, including direct prompting and different execution-based strategies like Think-and-Execute. The models assessed include GPT-3.5-Turbo, recognized for its performance in reasoning benchmarks, and CodeLlama. Through zero-shot evaluations, the study aims to provide insights into how effectively these models can execute complex reasoning without demonstrations, highlighting both the potential and limitations of LLMs in such tasks.
To evaluate LLMs' reasoning capabilities, we curated seven algorithmic reasoning tasks from Big-Bench Hard designed to measure step-by-step reasoning in zero-shot settings.
We compare various baselines such as Direct prompting, Zero-shot Chain of Thought, Zero-shot Program of Thought, and a natural language variant of Think-and-Execute.
Collection
[
|
...
]