I performed a study that compares four transcription models (base, small, medium and large-v3). My findings are:
- Base model is fastest but also least accurate.
- Small model offers the best balance — slightly better than large-v3 in accuracy while being significantly faster.
- Medium and large-v3 demonstrate diminishing returns: higher processing time but only modest accuracy gains.
- Certain audio tracks proved challenging for all models.
- Combining models (e.g., small for most tracks, large-v3 for specific cases) could optimise accuracy. This approach could be implemented as an enhancement in Audacity.
The detailed study and the study data is available at https://www.alanbonnici.com/2025/04/comparing-audacitys-openvino-whisper.html.