The article outlines an experimental study employing different large language models (LLMs), namely GPT-3.5-Turbo, GPT-4, and CodeLlama, to analyze their effectiveness in generating answers through a process called 'Think-and-Execute.' The setup details models, inference methodology, and evaluation criteria, highlighting performance metrics such as inference times and accuracy. It compares outputs using a specific formulation of prompts and provides an in-depth look at data sets, while also discussing limitations and related work that situates the research within the broader field of LLM advancements.
We employ several LLMs, including GPT-3.5-Turbo and GPT-4, alongside open-source LLM, CodeLlama, utilizing different models for diverse tasks in our experiments.
In our study, inference times varied significantly across models; for instance, GPT-3.5-Turbo completed tasks in about 30 seconds, while CodeLlama models required 2 to 5 minutes.
Collection
[
|
...
]