OpenAI has unveiled new speech-to-text and text-to-speech models that enable developers to create sophisticated and expressive speech agents. The latest models, gpt-4o-transcribe and gpt-4o-mini-transcribe, show marked improvements over the existing Whisper models in terms of word error rate and accuracy due to reinforcement learning and high-quality audio data usage. The gpt-4o-mini-tts model offers better control over pronunciation but is currently limited to preset voices. Moreover, the pricing structure for these models aims to provide competitive options for usage.
OpenAI is launching new speech-to-text and text-to-speech models through its API, allowing developers to create expressive and customizable speech agents.
The new gpt-4o models significantly improve on word error rates and accuracy, utilizing reinforcement learning and diverse audio datasets for training.
These models are adept at recognizing speech nuances, performing better with accents, background noises, and varying speaking speeds, leading to fewer transcription errors.
The latest pricing for OpenAI's audio models is competitive, with gpt-4o-transcribe priced at $6 per million tokens for audio input.
Collection
[
|
...
]