Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models

This article introduces Mi-Go, a tool aimed at evaluating the performance and adaptability of general-purpose speech recognition machine learning models across diverse real-world scenarios. The tool leverages YouTube as a rich and continuously updated data source, accounting for multiple languages, accents, dialects, speaking styles, and audio quality levels. To demonstrate the effectiveness of the tool, an experiment was conducted, by using Mi-Go to evaluate state-of-the-art automatic speech recognition machine learning models. The evaluation involved a total of 141 randomly selected YouTube videos. The results underscore the utility of YouTube as a valuable data source for evaluation of speech recognition models, ensuring their robustness, accuracy, and adaptability to diverse languages and acoustic conditions. Additionally, by contrasting the machine-generated transcriptions against human-made subtitles, the Mi-Go tool can help pinpoint potential misuse of YouTube subtitles, like search engine optimization


Introduction
Speech recognition has become a critical component in numerous applications, ranging from virtual assistants and transcription services to voice-controlled devices and accessibility tools.The increasing reliance on speech recognition machine learning models necessitates robust and comprehensive evaluation methodologies to ensure their performance, reliability, and adaptability across diverse scenarios.
Existing speech recognition models evaluations often rely on curated datasets, such as LibriSpeech [25], CommonVoice [4], and TIMIT [32].While these datasets provide a controlled environment for evaluation, they may not capture the full spectrum of real-world scenarios, potentially limiting the model's generalizability.Additionally, these datasets may not be updated frequently, resulting in potential stagnation in performance evaluation.
In this article, we introduce Mi-Go (the name will be explained further), a tool designed to evaluate the prediction performance of general-purpose speech recognition machine learning models.Mi-Go harnesses the power of YouTube as a data source, providing access to a virtually unlimited repository of diverse audio-visual content.YouTube offers a rich and continuously updated collection of spoken language data, encompassing various languages, accents, dialects, speaking styles, and audio quality levels.This makes it an ideal source of data which can be used to evaluate the adaptability and performance of speech recognition models in real-world situations.
In recent years, there has been a growing interest in harnessing the vast amount of data available on platforms such as YouTube for machine learning tasks.Various approaches have been proposed to collect and process data from YouTube, including YouTube-8M [1], AudioSet [11], and GigaSpeech [6].However, these methods primarily focus on video and audio classification tasks rather than the evaluation of speech recognition models.
The landscape of speech recognition technology has witnessed a paradigm shift, driven by rapid advancements in deep learning and artificial intelligence.Groundbreaking architectures, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and, more recently, transformer-based models, have revolutionized this domain, offering unprecedented accuracy in transcribing human speech.These models, trained on vast datasets, have demonstrated remarkable proficiency in navigating the complexities of language, including accents, dialects, and noise interference.The emergence of these models not only underscores the accelerated pace of development in this field but also leads one to believe that in the near future seamless human-computer interaction will become the norm.It should be noted that while these advancements present exciting prospects, they also raise compelling questions concerning data privacy, algorithmic bias, and the digital divide.
In our study, we address this need by proposingand then empirically investigating the prediction performance of speech recognition model-evaluation tool which utilizes YouTube as a data source, providing access to an extensive and diverse collection of audio samples for evaluation purposes.This approach ensures that the performance assessment remains up-to-date and relevant, capturing the nuances of real-world speech more accurately than curated datasets.To the best of our knowledge, there is little or even no research on using YouTube and video subtitles provided by the YouTube users for speech recognition evaluation.Considering all the above, our goal is to answer the following research question: (RQ) Will evaluation of the selected speech recognition machine learning model using YouTube as a data source, as made possible by Mi-Go, produce similar results (measured using the same metric) as the evaluation conducted by the model creators?
Mi-Go automates the process of data extraction, annotation, and evaluation from YouTube, ensuring an up-todate and representative sample for evaluation purposes.By leveraging algorithms for data filtering and annotation, Mi-Go facilitates a thorough and unbiased evaluation of the speech recognition models.Moreover, Mi-Go is designed to be easily adaptable, allowing for seamless integration with variety of different speech recognition solutions, making it a versatile and valuable tool in the speech recognition research community.
The primary motivation behind the development of the Mi-Go tool stems from the recognition of several limitations in existing approaches to evaluate speech recognition models.As speech recognition technology continues to play a critical role in various applications, including voice assistants, transcription services, and accessibility tools, ensuring the robustness and accuracy of these models is crucial.
Other speech recognition model evaluation methods often rely on static, curated datasets which, while useful for establishing a controlled environment, may not fully represent the diversity and complexity of real-world speech scenarios.This can lead to overfitting and limit the model's generalizability, ultimately affecting its performance in real-world applications.
Additionally, as the field of speech recognition rapidly advances, existing evaluation methods may struggle to keep pace with new developments and challenges, potentially hindering the progress of these models.By utilizing YouTube as a data source, Mi-Go aims to overcome these limitations and offers a more comprehensive and dynamic evaluation environment.
Another motivation for the development of Mi-Go is the need for a flexible and adaptable tool capable of accommodating variety of speech recognition models.This adaptability allows researchers and developers to compare and contrast the performance of various models, facilitating the continuous improvement and refinement of speech recognition systems.
By addressing these limitations and providing a dynamic, diverse, and adaptable evaluation tool, Mi-Go aspires to contribute significantly to the field of speech recognition research, driving innovation and fostering the development of highly accurate and robust models for various applications.
In a summary, the Mi-Go tool is a contribution to the scientific and speech recognition community for the following reasons: -Rich and diverse test data source.Mi-Go leverages YouTube, a platform with vast and continuously updated content, to provide a rich source of diverse audio-visual content.This includes various languages, accents, dialects, speaking styles, and audio quality levels.Such diversity is ideal for evaluating the adaptability and performance of speech recognition models in real-world situations, ensuring robustness, accuracy, and adaptability to diverse languages and acoustic conditions.-Dynamic evaluation environment.By using You-Tube as a data source, Mi-Go addresses limitations of previous approaches that often relied on static and potentially outdated datasets.It offers a more com-prehensive and dynamic evaluation environment that reflects current real-world scenarios.This adaptability allows for the comparison of various models and facilitates the continuous improvement and refinement of speech recognition systems.-Practical and theoretical contributions.The experimental results obtained through Mi-Go highlight the utility of YouTube as a valuable data source for the evaluation of speech recognition models.This not only underscores the platform's potential in enhancing model robustness and adaptability but also contributes to the academic discourse by providing a novel methodology for speech recognition research.Additionally, Mi-Go's approach to contrasting machine-generated transcriptions against humanmade subtitles offers insights into potential misuse of subtitles, such as for search engine optimization purposes, thereby adding a layer of practical utility in detecting transcription anomalies.

YouTube as a data source for speech recognition model evaluation
With over 2 billion monthly active users and a diverse array of content uploaded every day, YouTube offers a rich resource for researchers and developers working on speech recognition technology.By tapping into this wealth of multilingual and multi-genre content, it is possible to evaluate and refine speech recognition models across various languages, dialects, and acoustic environments.
A vast digital archive.YouTube stands as a colossal repository of digital content, presenting an unparalleled resource for research across various disciplines.As the world's largest video sharing platform, it hosts an estimated billions of videos, a number that continues to grow exponentially with about 500 hours of new content uploaded every minute.Exact number of hosted videos is not known, but is estimated for not less than 2.5 billion of videos [3].The number of YouTube "Shorts" videos only, identified through the usage of the hashtag #shorts, reaches approximately 828 million in February 2024 1 .
Diversity of content.YouTube's vast library of user-generated content covers an extensive range of topics, languages, and styles.This diversity enables the evaluation of speech recognition models in real-world scenarios, such as noisy environments, various accents, and even low-quality audio recordings.By evaluating models on such a diverse dataset, researchers can identify potential weaknesses and areas for improvement, ultimately resulting in more robust and accurate speech recognition systems.
Multilingual corpus.One of the key advantages of using YouTube for speech recognition model evaluation is the platform's multilingual nature.Videos on the site are available in numerous languages, allowing for the assessment of models' performance across different linguistic settings.This multilingual corpus is invaluable for developing models that can handle a variety of languages, accents, and dialects, thereby expanding their utility and applicability.
Availability of human-generated transcripts.Many You-Tube videos come with human-generated subtitles, either provided by content creators or contributed by users through the platform's community contributions feature.These transcripts serve as valuable ground-truth data for evaluating speech recognition models, as they offer a reliable source of comparison for the models' output.By comparing model-generated transcriptions with humangenerated ones, researchers can assess the accuracy and performance of their models, identifying areas where improvements are needed.
Potential for continuous model improvement.The ever-growing volume of content on YouTube presents an opportunity for continuous improvement and adaptation of speech recognition models.As new videos are uploaded, models can be re-evaluated and fine-tuned to ensure they remain up-to-date and effective in an everchanging linguistic landscape.This continuous feedback loop helps researchers identify trends, challenges, and emerging language patterns, which can be incorporated into model updates.
YouTube is an invaluable platform for speech recognition model evaluation due to its diverse, multilingual content and the availability of human-generated transcripts.By leveraging this vast resource, researchers and developers can evaluate and refine their models, ensuring they are robust, accurate, and adaptable to a variety of languages and acoustic conditions.

Related work
Studies leveraging YouTube in the area of automatic speech recognition have made significant strides across various facets of the field.These investigations utilize YouTube's extensive library of videos to create datasets, improve speech recognition systems, and explore new approaches to automatic speech recognition, showcasing the platform's value in advancing speech recognition technology research.Key insights from these works include: -Datasets for automatic speech recognition models creation.Researchers have developed methodologies for creating databases for audio/visual speech recognition using YouTube videos, such as the comprehensive Spanish dataset by Córdova Esparza et al [7].In their work, researchers presented a novel approach for creating an audio/visual speech recognition database, particularly addressing the scarcity of datasets in languages other than English, with a focus on Spanish.By selecting hundreds of YouTube videos, the researchers were able to extract facial features and align voice with text with millisecond accuracy, creating a dataset of over 100,000 samples.That methodology not only facilitated the development of automatic speech recognition systems in underrepresented languages but also provided a blueprint for creating datasets in any language by selecting appropriate YouTube content.Takamichi et al. [29] contributed to the diversification of automatic speech recognition research resources through the JTubeSpeech corpus, which consists of Japanese speech collected from YouTube.This corpus was designed for both speech recognition and speaker verification tasks, addressing the need for comprehensive datasets in Japanese for training and evaluating automatic speech recognition systems.The corpus's creation from YouTube videos ensured a variety of speech contexts and speaker demographics, enhancing the robustness of automatic speech recognition models trained on it.Lakomkin et al. [20] developed the KT-speech-crawler, an automated tool for constructing speech recognition datasets from YouTube videos.This tool leveraged automatic captioning provided by YouTube to generate datasets, significantly reducing the manual effort required in dataset creation and enabling researchers to easily compile large-scale datasets tailored to specific speech recognition research needs.Latest work in the field-creation of Yodas, a YouTube-derrived Dataset, by Li et al. [22], showcases the ongoing efforts to harness YouTube content as diverse and comprehensive training data resource for developing new, robust speech recognition models.By compiling a diverse set of audio and speech samples from You-Tube, Yodas aims to provide a versatile dataset that supports a wide range of automatic speech recognition tasks, including dialect and accent recognition, speech-to-text conversion, and speaker verification.-Improvement of automatic speech recognition systems.Liao et al. [23], from Google, explored usage of new large scale deep neural network acoustic modeling for using in YouTube video transcription.By leveraging the massive amount of unlabeled audiovisual content on YouTube, the researchers were able to enhance the modeling process, by using video transcripts uploaded by YouTube users and thus demonstrating the potential of semisupervised learning approaches in improving automatic speech recognition systems' performance, especially in noisy and challenging acoustic environments.Their findings then were used in actual YouTube automatic speech transcription improvements.-Audio-visual speech recognition.In their work, Serdyuk et al. [28] delved into the enhancement of automatic speech recognition by incorporating video content from YouTube, a novel approach that significantly improved speech recognition accuracy.That study leveraged a large corpus of YouTube videos to train models, focusing on how the visual modality, particularly the movement of the speaker's mouth, could augment audio features for speech recognition tasks.By replacing traditional 3D convolutional neural networks with a video transformer to extract visual features, Serdyuk and his team demonstrated a substantial improvement in word error rates on both a labeled subset of YouTube videos and the LRS3-TED public corpus (described in [2]).Their methodology highlighted the potential of utilizing video content alongside audio data to advance the capabilities of automatic speech recognition systems.This research not only showcased the importance of YouTube as a rich data source for speech recognition technologies but also opened new pathways for enhancing speech recognition accuracy by integrating audio-visual data, paving the way for more sophisticated and efficient automatic speech recognition systems.-Bias and inclusivity in automatic speech recognition.
Koenecke et al. [18] uncovered significant racial disparities in the performance of commercial automatic speech recognition systems, including those developed by major tech companies.By analyzing speech from white and African American speakers, the study revealed a higher word error rate for African American speakers, highlighting a critical area for improvement in making automatic speech recognition technologies more inclusive and equitable.Tatman and Kasten [30] investigated the effects of talker dialect, gender, and race on the accuracy of Bing Speech and YouTube automatic captions.Their findings emphasized the impact of sociolinguistic factors on automatic speech recognition accuracy, urging the development of more sophisticated models that could better accommodate the diversity of human speech.
-Utilizing YouTube as automatic speech recognition tool.Kim et al. [17] embarked on an insightful exploration into the capabilities of automatic speech recognition tools by utilizing YouTube's automatic transcription service as a benchmark for automatic speech recognition accuracy.In their study, they meticulously compared manual transcriptions with those generated automatically by YouTube, alongside other leading speech recognition platforms such as Google Cloud, IBM Watson, Microsoft Azure, and Trint.Their analysis provided a comprehensive evaluation of the relative performance of these services, with a particular focus on YouTube's efficacy in providing accurate transcriptions.This approach not only highlighted YouTube's potential as an accessible and effective tool for automatic speech recognition but also contributed to the broader discourse on the reliability and accuracy of free, platform-based speech recognition services.These studies illustrate the extensive use of YouTube as a rich data source for automatic speech recognition research, ranging from training dataset creation to addressing biases and inclusivity in speech technologies.However, to the best of our knowledge, there is no work describing the direct use of YouTube to evaluate the functional performance of the existing machine learning models used for automatic speech recognition.

Mi-Go Tool
Mi-Go was written in Python programming language.Its source code is available for download under Apache-2.0license at the following address: https:// github.com/ Kowal ski10 24/ Mi-Go In the following, we will describe the tool by focusing on the subsequent operations of the tool -from launching it to saving the evaluation results of the selected speech recognition model.

Test Plan preparation
To start working with the tool, we need a file in JSON format, called a Test Plan.This is illustrated as number 1 in Fig. 1.In a special circumstances, Test Plan file can be manually written, but it is more efficient to generate it, using an additional script named the Test Plan Generator.This script queries YouTube's API to compile a random list of videos, basing on the command line parameters specifying the category of the videos, language, duration, and desired quantity of list items (details can be found in Appendix 1).It is essential that the YouTube clearly indicates, that video has human-made subtitles, and only such videos are considered.To query the API, Test Plan Generator uses external Python library called youtubetranscript-api2 .After querying the API, the Test Plan file contains all the necessary metadata about the videos being used in further evaluation and it also stores information about the selected parameters and token for You-Tube Data API, which can be used in next test iterations, if needed.

Data extraction and transcription
In the next step, marked with number 2 in Fig. 1, Mi-Go reads the Test Plan and, basing on that plan, downloads from YouTube the audio track of each video from the plan and the subtitles for that video.Thus, for each video, we have a pair consisting of an audio file (2a) in and humangenerated subtitles (marked as 2b).
In the next step (number 3 in Fig. 1), a speech recognition model is employed to convert the downloaded audio into a textual transcript.It is done by the Tran-scriptTest component that executes the speech recognition machine learning model against audio data collected from YouTube.That component can be adjusted for specified speech recognition model by extending that component with model-specific code.It allows to use different models from popular "Hugging Face" machine learning models repository3 as well as models dedicated for such toolkits like ESPnet or NeMo.
To eliminate inconsequential textual differences, both the subtitles downloaded from YouTube (number 2b in Fig. 1) and those generated by the speech recognition model ( 4) undergo a normalization process (5a and 5b) using an OpenAI's normalization function 4 .

Evaluation and metrics
Speech recognition model evaluation involves comparing the human-made subtitles downloaded from YouTube and those generated by the model (number 6 in Fig. 1).For that evaluation, Mi-Go tool uses a opensource JiWER library 5 to calculate Word Error Rate (WER) measure [27].WER is a common metric used to assess the performance of speech recognition systems, automatic translation systems, and other tasks involving transcription or translation.It is calculated by determining the minimum number of operations needed to transform the system output into the correct output.These operations include (see Eq. 1): word insertions I, word deletions D, and word substitutions S. To compute the WER, the total number of these operations is divided by the total number of words in the correct output N (in our case: total number of words in subtitles attached to a particular YouTube video), yielding a ratio that represents the rate of errors per word.The lower the WER, the better the performance of the system, as it means fewer errors were made.
The concept of WER has been part of the field of automatic speech recognition and computational linguistics for many years.It is based on the Levenshtein distance or edit distance, a string metric for measuring the difference between two sequences, introduced by Vladimir Levenshtein in 1965 [21].The exact individual or group that first applied this concept specifically as Word Error Rate in speech recognition or translation systems is not clearly documented.It likely emerged from the academic and industry communities working on speech and language processing technologies.WER has since become a standard measure in these fields.In some cases, WER is expressed as a percentage (by multiplying the original formula by 100%), especially when easy understanding of the measure is a main concern.
The comparison results are stored both in the SQLite database (7b in Fig. 1) and directly in the previously used Test Plan file (7a).Such a Test Plan file, with its evaluation results recorded, can be reused for subsequent evaluation iterations, for instance, to augment results not previously gained, or to retest the same videos, specified (1) within it.Such dual storage approach (database and Test Plan file) facilitates simple access, filtering, and analysis of the evaluation results.

Experimental setup
Here, we describe an experimental setup that leverages the Mi-Go tool to use YouTube videos, across all categories, as a data to evaluate speech recognition models by comparing their output with human-made transcripts.
The purpose of the experiment is to confirm whether the following setup (Mi-Go and YouTube as evaluation data source) will allow to evaluate the speech recognition models and obtain evaluation results similar to those obtained by the model creators.

OpenAI's Whisper
OpenAI, a company most notably recognized for its contribution to the field of artificial intelligence through the development of advanced large language models like GPT-3 and GPT-4, also developed state-of-the-art, general-purpose speech recognition models, which demonstrate exceptional performance in various applications, called Whisper [27].
Due to proven outstanding performance of that model family, as well as the fact that it has been made available under a open-source MIT Licence, we decided that, in our experiment, we will mainly focus on evaluation of the Whisper models.At this point, we should explain that the name "Mi-Go" comes from a novella by H.P. Lovecraft called "The Whisperer in Darkness"; thus, in our opinion, it would make a good name for the tool initially created to evaluate the Whisper models.
The model is based on a Transformer sequence-tosequence architecture and is trained on a range of speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.These tasks are collectively represented as a sequence of tokens to be predicted by the decoder, enabling a single model to supplant multiple stages of a conventional speech processing pipeline.The multitask training approach employs a series of unique tokens that act as task specifiers or classification targets [27].
Whisper model is available in five different sizes.Four of them (tiny, base, small, medium) having additional English-only versions, which-according to the creators-perform better when used in English-only applications [16].Thus, in our research, we decided to use English-only model versions.The "large" model was improved twice; thus, in our experiment, we used two versions of "large" model-initial version, marked as "Whisper large-v1" and latest version, marked as "Whisper large-v3." Each model offers a balance between speed and accuracy.The names of the used models, their approximate memory requirements and relative speeds are provided in Table 1.

NVIDIA's Conformer-Transducer X-Large
To prove that Mi-Go can be used for evaluation of different speech recognition models, apart from OpenAI's Whisper, in our experiment, we also included models provided by other companies, like one developed by NVIDIA, built upon the Conformer-Transducer architecture, which blends the strengths of transformer and convolutional neural network architectures [13].The "X-Large" variant of this model signifies its substantial size and capacity, enabling it to process and understand complex audio inputs with higher accuracy compared to its predecessors.It is distributed on Creative Commons BY 4.0 license [24].When comparing the Conformer-Transducer X-Large model to OpenAI's Whisper model, there are several key points of differentiation.The Whisper model, as we stated before, is based on a different architectural approach, primarily leveraging transformer neural networks.While both models aim to provide high accuracy in speech-to-text conversion, the NVIDIA model's use of the Conformer-Transducer architecture may offer advantages in handling real-time or streaming audio applications.Additionally, the specific design choices in the NVIDIA model might result in better performance in certain scenarios, such as dealing with background noise or low-quality audio inputs [8].
Conformer-Transducer X-Large model is primarily used by NVIDIA in their open-source NeMo toolkitdesigned to simplify the process of building, training, and fine-tuning complex neural network models, particularly for speech and natural language processing tasks [19].To indicate this fact, as well as to use shorter name, in the following text, we will refer to the model as "NeMo Transducer Xlarge."

ESPnet2 model
Similarly to NeMo, ESPnet2 (End-to-End Speech Processing Toolkit, version 2) is an open-source (using Apache 2.0 license) software toolkit designed for speech processing tasks, including automatic speech recognition, text-to-speech, and language modeling.Key features of ESPnet2 include its support for state-of-the-art machine learning models, its flexibility in handling different types of neural network architectures, and its comprehensive set of tools for training, evaluating, and fine-tuning models.ESPnet2 is widely used in the academic and research community for experimenting with novel ideas in speech processing and for developing systems that are more efficient and accurate in real-world applications [15].
Among different speech recognition models available for ESPnet2 toolkit, we chose one of the models trained by Shinji Watanabe, shortly called in this work as "ESP-net2 Conformer6 " to use it as a reference point for Whisper models evaluation in our experiment.Selection of this particular model was motivated by fact that it was used with success in official ESPnet2 demonstration material [31].

Facebook's wav2vec2-base-960h
Facebook's Wav2Vec 2.0 is an advanced neural networkbased framework for speech recognition developed by Facebook AI researchers.It employs a self-supervised learning approach where the model is initially trained on a 53,000 of hours of unlabeled audio [5].This pre-training allows the model to learn representations of speech from the raw audio itself.Once pre-trained, Wav2Vec 2.0-derived models can be fine-tuned with a smaller amount of labeled data to achieve high performance in transcribing speech.Model selected to be used in our experiment-"wav2vec2-base-960h"-was finetuned on 960 h of LibriSpeech [25] dataset on 16 kHz sampled speech audio.

Data collection and preparation
To begin the experiment, we instruct the TestPlan Generator, a component of the Mi-Go tool, via command line interface, to randomly fetch 7-10 videos per category These videos are randomly selected, but basing on factors such as popularity, relevance, and the presence of human-generated subtitles, ensuring a diverse and high-quality dataset.The YouTube Data API is used to acquire the videos, while the youtube-transcriptapi library retrieves their corresponding transcripts.Already fetched, the same set of videos is used to evaluate selected automatic speech recognition models (presented in Section 5.1).Full list of 141 videos used in experiment is provided in Appendix 4.

Results
To answer the research question, we used the proposed Mi-Go tool, to utilize 141 YouTube videos, representing all categories listed in Table 2, to evaluate selected automatic speech recognition models (as presented in Section 5.1) and collect Word Error Rate (WER) metrics as a result.
Statistics for collected Word Error Rate values for all evaluated models are presented in Table 3 and  Whisper model characteristics, published by its authors [27], concern only "large-v1" model-thus, in Table 3, we presented WER statistics for that model with bold font.
As we can see, the median for "large-v1" model evaluation results is WER = 7.4%.The worst median of results for Whisper "large-v1" model presented by its creators was 19.6% (see Table 4 in Appendix 2).That result was achieved by using CORAAL speech recording dataset  New version of Whisper model-"large-v3"resulted with worse WER median than "large-v1" version.However, at the same time, "large-v3" resulted with much lower-when compared to "large-v1"-maximum WER value and standard deviation.Thus, we can interpret that result as indication of higher stability of "large-v3" outcomes when compared to older Whisper model version.
One can find large WER values for selected results, significantly different from the median.However, by reviewing the YouTube videos that were used for the tests that ended with high WER values, we can conclude that the reason for this is not due to a malfunction of the Mi-Go tool or speech recognition model.Instead, the high WER values are due to the actual discrepancy between the human-made subtitles attached to the video and those generated by the model.We found that such discrepancies occur due to several reasons: 1. Transcription errors.Humans, despite their proficiency, are not infallible and may make mistakes when transcribing speech to text.This could involve mishearing words or phrases, particularly in a noisy environment, during rapid speech, or when dealing with dialectal variations or accents.On the other hand, automatic speech recognition models can "hallucinate" under certain conditions, causing high WER values.For example, in our experiment, for one video containing little speech 8 , Whisper "large-v1" model returned such transcription: Interpretation differences.Subtitling is not always a direct one-to-one transcription process.The transcriber's understanding and interpretation of the speech can influence the outcome.Homonyms, idiomatic expressions, cultural references, or ambiguous statements can all be interpreted differently depending on the transcriber's knowledge and perspective.3. Contextual adaptations.Subtitle makers often make deliberate changes to the text for various reasons.They may simplify or clarify speech to make it more accessible to the audience, especially if the speech is complex or jargon-filled.They may also modify the text to match reading speed constraints, given that text must be readable within the time it is displayed.Cultural adaptations may also be made to make the content more comprehensible to a specific audience (as a form of video's localization).4. Descriptive transcriptions.Some transcriptions go beyond the spoken content and provide descriptions of the visual elements in the video.These are often intended for visually impaired or blind viewers to provide them a more comprehensive understanding of the video content.Such case occurred with video resulted with second highest WER value in our experiment (WER = 12650%).While that video only consist of animal sounds, actual subtitles are as follows (original spelling) 9 : Cats Cats are very cute animals Animals that are close and affectionate with people Cat breed is a species with relatively high fertility, giving birth to 2-3 litters of kittens a year New born kittens only weighs about 100g and fits easily in the palm of your hand Horses are smart, wise animals Mother horses as young as 3 years old can start breeding (...) 5. Search engine optimization (SEO).Some subtitles may be created or modified with the goal of improving the video's visibility in search engine results.The inclusion of relevant keywords and phrases can make 7 Other explanation is that the model does not simply feel so-called "christmas spirit"

Conclusions and future work
In this paper, we have introduced Mi-Go, a lightweight and flexible tool for evaluating general-purpose speech recognition models and using YouTube's vast and diverse content.Traditional evaluation methods, which employ curated datasets, may not capture the broad array of real-world scenarios, hence potentially limiting a model's generalizability.Mi-Go, by leveraging YouTube's dynamic content, offers an enriched platform for evaluating such models.An experiment was conducted, using randomly fetched 141 YouTube videos, demonstrating the usefulness of the Mi-Go tool in evaluation of model prediction performance and identification of discrepancies between model-generated transcriptions and human-made subtitles.The results underscore the necessity for human oversight in rectifying inaccuracies and the potential of the Mi-Go tool for enhancing speech recognition models' robustness and adaptability.While the Mi-Go tool demonstrates promising results in evaluating speech recognition models, several avenues for future work can further enhance its capabilities: 1. Expanding the tool to accommodate other data sources (like non-English YouTube videos or video hosting services other than YouTube), providing an even more diverse and representative set of audio samples for evaluation 2. Incorporating advanced techniques for data preprocessing and augmentation, which can help in simulating various real-world challenges, such as background noise and audio distortions 3. Developing a graphical user interface and API, making it easier for researchers and developers to integrate and utilize the Mi-Go tool in their projects 4. Extending the tool to support other tasks, such as speaker identification evaluation and language identification evaluation, in addition to automatic speech recognition evaluation An important area for further work is the tool's lack to handle audio characteristics such as noise, the number of speakers, accents, and the distance of the speaker.This limitation stems from the tool's foundational approach, which uses a straightforward comparison between human-made YouTube subtitles and those generated by a speech recognition model.This approach inherently focuses on textual alignment without delving into the nuances of audio quality or speaker attributes.
To address the mentioned audio characteristics handling, an advanced feature could be integrated into the Mi-Go tool, employing audio analysis techniques to evaluate and adjust for different audio characteristics before the transcription process.This enhancement could involve the implementation of pre-processing algorithms capable of detecting and compensating for noise levels, identifying speaker count and accents, and adjusting for recording distance.Such improvements would not aim to refine the accuracy of the speech recognition, as it is not the tool's purpose, but enrich Mi-Go's speech recognition model evaluation results by adding possible root causes (such as high levels of noise or far-field speech) of potential poor model's performance.
In the pursuit of excellence within the realm of rapid speech-to-text models development, currently, the Mi-Go tool is undergoing a rigorous and comprehensive testing process, embodying the highest standards of software quality assurance [10].This meticulous testing is crucial not only to ensure the tool's reliability and accuracy in evaluating speech-to-text models but also to guarantee an optimal user experience, free from technical glitches and usability hurdles.By subjecting Mi-Go to such thorough scrutiny, we aim to provide users with a seamless and efficient tool, facilitating effective and userfriendly interactions in the nuanced field of speech-totext system evaluation.
10 https:// www.youtu be.com/ watch?v=Jk83I-z6C98, access: 2024.03.14 We hope that Mi-Go tool will find wide application in both speech recognition machine learning model evaluation and detection of anomalies in existing video transcriptions.-q, -queryTerm <term>: Query term for filter- ing the videos.

Examples
Generate testplan for using 100 random videos: python testplan_generator.py 100 Generate testplan for using 50 videos, output to the specified directory, and filter by English language: python testplan_generator.py50 -o / path/to/directory -l en Note: Replace /path/to/directory with the actual directory path where you want the testplan to be saved.

Fig. 1
Fig. 1 Mi-Go and speech recognition model evaluation phases (described in text) illustrated in Fig. 2. Detailed statistics of the WER value for each model by category are presented in Appendix 3. Results for different datasets compared to our YouTube-based results are gathered in Appendix 2.
not a dog.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.I'm a cat.(...) 2.
The number of randomly fetched videos planned to use in model's evaluation.This argument is required.Optional arguments -o, -outputDirectory <directory>: Desti- nation directory for the testplan files.Defaults to ./testplans/.-l, -relevanceLanguage <ISO 639-1 lan- guage code>: Preferred language for the video's content.Defaults to en. -c, -videoCategoryId <video category ID>: Use videos from a specific YouTube category, characterized by the YouTube API's category ID. -t, -topicId <topic ID>: Use videos about a specific topic, characterized by the YouTube API's topic ID. -r, -regionCode <region code>: Use videos targeted to a specific region.Defaults to US. -d, -videoDuration <duration>: Video duration filter.Possible values are any, long, medium, and short.Defaults to medium.-lc, -videoLicense <license>: Video license filter.Possible values are any, creativeCommon, and youtube.Defaults to creativeCommon.

Table 2 .
What is important, we decided to use such number of videos basing only on available computing resources; number of videos used for evaluation is not restricted and can be freely set by other Mi-Go users.

Table 2
YouTube videos categories considered in the experiment

Table 3
Word Error Rate [%] value statistics for all evaluated model versions

Whisper large-v1 0.7 24.7 7.4 614.4 57.8
[27]plot of experiment results.Note the logarithmic scale popularized by Gunter et al.[14].Other datasets used to validate models by its creators were as follows: recordings of earning calls by Del Rio et al.[9], sets of recordings of online blogs and podcasts, and dataset containing recordings of The Late Show (sic!).Whisper "large-v1" model evaluation results from[27]compared to our results are presented in Table4in Appendix 2. By making this comparison, we can conclude that Whisper model evaluation, described in this work, produce similar results as the tests conducted by the Whisper model creators, using different data.Similarly, our results for ESPnet2 Conformer and wav2rec models are similar to those of other authors, achieved using different datasets (Tables5 and 7in Appendix 2).Low WER median of YouTube-based result of the Conformer-Transducer model compared to the results of other authors (Table6in Appendix 2) can be explained by the occurrence of the highest WER value for this model (18250%) due to fact that the model refused to transcribe the music video "All I Want For Christmas Is You" by Mariah Carey (other models did fine) possibly because of a model's failure7.
video more likely to appear in search results related to those terms, hence enhancing the video's discoverability.Here is an example of such subtitles from one of the fetched videos 10 : By comparing model-made transcription to the existing human-made subtitles, discrepancies can be identified.Factors such as background noise, speaker accents, or low-quality audio can impact the model's performance.Hence, although speech recognition models can help identify potential inaccuracies in subtitles, a degree of human oversight and validation is typically necessary to confirm and rectify these inaccuracies.From a different perspective, automated setup which utilizes Mi-Go and selected speech recognition model, can significantly help in detection of video subtitles misuse.

Table 4
[27]arison of WER values for Whisper large-v1 model presented in[27]and our results (highlighted)

Table 5
[27]arison of WER values for wav2rec 2.0 Large model presented in[27]and our results (highlighted)

Table 6
[12]arison of WER values for Conformer-Transducer model presented in[12]and our results (highlighted)

Table 7
[26]arison of WER values for ESPnet2Conformer model presented in[26]and our results (highlighted)