I retested GPT-5's coding skills using OpenAI's guidance - and now I trust it even less
Briefly

I retested GPT-5's coding skills using OpenAI's guidance - and now I trust it even less
"Do AI's get headaches? Because GPT-5 has certainly been giving me one. This article was going to be so easy. OpenAI came out with a list of best practices for GPT-5 coding. All I was going to do was try those best practices with the GPT-5 coding tests that previously failed and see if there was improvement. It seemed so simple."
"I re-ran the first failed test. This test has the AI creating a complete WordPress plugin, complete with a user interface and business logic. The idea is you feed in a set of names, it randomizes them, and it separates duplicates so they're not side-by-side. Also: How I test an AI chatbot's coding ability - and you can, too When I ran this test on GPT-5 originally, it failed. Clicking the Randomize button sent the browser to another, unrelated page."
"This time, I ran the exact same test with the exact same prompt again. This time, it worked perfectly. Wow, I thought. GPT-5 has improved in the past week. If only I had left well enough alone. But no. I had to try again. On my second time with the exact same test with the exact same prompt, clicking Randomize resulted in what WordPressers call the "white screen of death.""
GPT-5's responses to identical coding prompts can vary widely, producing a working WordPress plugin on one run, redirecting the browser to an unrelated page on another, and causing a complete site crash on a subsequent try. The model sometimes requires extra prompting to fix errors, and its prompt-optimization tools can improve outcomes while introducing new quirks. The model has a tendency to add details unconsciously, which complicates debugging and reduces reliability. Re-running identical tests can establish baselines, but result fluctuation undermines repeatability. The model may be coerced into fixes through additional instructions, but initial failures still matter for evaluation.
Read at ZDNET
Unable to calculate read time
[
|
]