Performance of the OpenVINO Whisper transcription models

TTMO · April 23, 2025, 6:56pm

I performed a study that compares four transcription models (base, small, medium and large-v3). My findings are:

Base model is fastest but also least accurate.
Small model offers the best balance — slightly better than large-v3 in accuracy while being significantly faster.
Medium and large-v3 demonstrate diminishing returns: higher processing time but only modest accuracy gains.
Certain audio tracks proved challenging for all models.
Combining models (e.g., small for most tracks, large-v3 for specific cases) could optimise accuracy. This approach could be implemented as an enhancement in Audacity.