Introduction
Voice recognition technology refers to systems that convert human voice into text. Voice recognition systems take human voice as input and translate it to computerized text. Its use has recently widened due to the invention of artificial intelligence and natural language processing. Natural language processing is the outcome of the interaction between human language and artificial intelligence through voice recognition technology. Therefore, computer systems can now process human languages and take voice commands better than before. Additionally, voice recognition systems apply various techniques in processing voice signals. The techniques include; linear predictive code, perceptual linear prediction, mel frequency cepstral coefficient, and zero crossing with peak amplitudes (Goyal et al., 2017). The linear predictive code algorithm processes voice signals and gives an output informed by past features. In contrast, perceptual linear prediction analyzes the signals, removes redundant data, and provides an output signal at high speed. Likewise, the meal frequency cepstral coefficient detects voice signals and their frequencies and transforms them into digital forms. At the same time, the zero crossing with peak amplitudes technique mimics the human auditory system and identifies features such as the pitch of spoken words. The aforementioned techniques enable voice recognition systems to extract various speech features and differentiate voice signals. Thus, this literature review gives insight into the significant challenges of voice recognition systems and their solutions.
Literature Review
Challenges Associated with Voice Recognition Systems
Literature by Chen et al. (2020) identifies cognitive load as one of the drawbacks of using voice recognition systems. The challenge arises because the user’s brain cannot multitask between speaking, recalling, and solving problems. Generally, speaking overloads a person’s working memory, impacting a user’s interaction with voice recognition technology. The user is unable to use voice recognition systems efficiently for prolonged periods because it tampers with cognitive resources of the brain the user needs to perform other tasks simultaneously. Zhu & Aryadoust (2022) further studied the challenge of cognitive load, specifically in distance interpreting using speech recognition software. Distance interpreting describes the translation service where the speaker and translator are in different geographic locations. Distance interpreting requires the translator to listen and perceive speech and then provide audio signals to the voice recognition systems, which convert them to text. The translator’s working memory and its resources are overworked and easily depleted. As a result, input provided to the voice recognition system will have errors that will reflect on the output. Other than distance interpreting, cognitive load challenge negatively implicates training sessions for human operators about interacting with artificial intelligence agents. Vukovic et al. (2019) describe how people operating applications with speech-enabled user interfaces deal with cognitive load, eventually impairing their performance. The research reports that the mental effort involved in speaking while listening causes human errors that significantly affect the operated system’s output.
Another challenge with voice recognition systems is poor accuracy during real-time dictation. The issue arises because of major limitations with voice recognition systems in speech processing. Karbasi et al. (2019) highlight limitations such as inadequate vocabulary banks of voice recognition systems. Such an inadequate vocabulary bank, which the voice recognition system references when translating speech to text, causes inconsistencies with the output. Therefore, in real-time dictation where the output performs another function, inconsistencies in computerized text cause errors in subsequent systems. A study by Onitilo et al. (2023) further explains the limited vocabulary issue that causes voice recognition systems’ inefficiency. The study reports that voice recognition systems have been programmed to process speech by native speakers of a particular language. Therefore, when a second language user with a different accent and dialect and whose speech exhibits varying features than the natives, the voice recognition system is unable to extract accurate information, which reflects as errors in the output. Inaccuracy in the output of the voice recognition system during real-time dictations is its low processing speed which results in a reduced speech recognition rate. Real-time dictation involves quick and continuous injecting of audio signals into the voice recognition system. Therefore, poor processing speed implies that the voice recognition system will not perceive some of the incoming voice signals, which will cause blanks in the output text. On top of that, the majority of voice recognition systems do not have the capability for continuous speech processing; thus, they become ineffective in real-time dictation. Generally, voice recognition systems’ limitations that cause inaccurate output make institutions like courtrooms or hospitals reluctant to implement the technology, where quality and accurate documentation is crucial.
Implementing voice recognition technology in organizations also requires heavy financial and human resources investment. Cherkassky (2022) explains that acquiring and installing artificial intelligence-driven devices for speech recognition is costly. Voice recognition systems are adopted in various settings, including cars, schools, and offices. Its adoption in such settings depends on the individual and institution’s financial capabilities. Therefore, smaller firms have not been able to invest the voice recognition technology. Moreover, consumers of voice recognition systems encounter frustrations with the technology. The system is not designed with consumer-friendly features, which makes end users without background knowledge in computer science and information technology face challenges using voice recognition systems (Saxena et al., 2018). Likewise, in organizations, human operators interacting with devices and systems integrating voice recognition technology require competence in computer science to navigate the technology. As a result, organizations hire experts with specialized knowledge and skills for interacting with voice recognition systems. Acquiring a sufficient workforce with adequate expertise and purchasing voice recognition technology leaves a dent in an organization’s resources.
Voice recognition technology use is limited to indoor environments. Outdoor environments are prone to noise interference. Interference results from the obstruction of objects in addition to the multitude of transmitted signals which distorts input to the voice recognition software in frequency and amplitude (Zhang, 2020). Achieving high accuracy in such an environment is problematic. Moreover, outdoor settings where the speaker is far from the voice recognition system complicate voice signal perception (Zhang, 2020). Sound propagation from the sources is highly attenuated, which decreases the signal-to-noise ratio. The low ratio causes poor sensitivity to incoming signals, leading to inaccurate voice command processing that gives erroneous output. Thus, voice recognition systems’ poor far-field pick-up capability makes them inefficient in open-field environments.
A Company That Deals with Problems in Voice Recognition Technology
Apple is among the biggest users of voice recognition technology. Apple has incorporated voice recognition technology in many of its products, including Mac, iPhone, and iPad. Specifically, Apple has integrated voice control in macOS, iOS 13, and iPadOS, which makes its products consumer-friendly and highly marketable. Some of the defining features of voice control include machine learning algorithms that transcribe speech to computerized text (Tech Brief, 2019). The algorithms have been developed to understand natural human languages, including various dialects and accents. As a result, a user does not need to say words perfectly for their device to translate fast and efficiently. More than that, voice control has customization settings that enable the user to make specific dictations, such as scientific terms. Text edition commands can enhance dictation accuracy (Basak et al., 2023). For instance, a voice command enables the user to select particular phrases or insert words in specific positions in a text. The editing features also provide options for word autocorrection and emoji collection, enhancing the user’s communication.
Additionally, the voice control system in Apple products allows easy navigation through audio signals. For instance, a phrase like ‘open document’ commands the user’s device to open a file without necessitating motor movements. Likewise, voice recognition technology allows Apple products to have intelligence and awareness of the user (Tech Brief, 2019). Thus, iPhones and iPads detect when the user is facing the screen, scan their face, and activate the voice control settings allowing the user to give voice commands. Voice control also works even when out of range with cellular data allowing user flexibility in interacting with the device (Basak et al., 2023). Overall, the integration of voice recognition technology within the operating systems of Apple products creates user-friendliness and convenience in operations.
Apple has implemented voice recognition technology in creating virtual assistants like Siri. Siri is a virtual assistant empowered by artificial intelligence and voice recognition technology in macOS, tvOS, and Ios devices (Tech Brief, 2019). Siri uses voice recognition to process voice commands through artificial intelligence and give an audio output in human language (Basak et al., 2023). Siri enables users of Apple devices to get immediate feedback on queries in their everyday life.
Possible Solutions to Challenges with Voice Recognition Technology
Chen et al. (2020) highlight various solutions to challenges with voice recognition technology. One of the solutions is for developers to reevaluate the impact of cognitive load on users of voice recognition technology. Doing so will allow the developers to revise the source code, integrate algorithms attuned to a user’s specific characteristics, and make informed predictions during speech-to-text translation. Cherkassky (2022) supports this strategy by hypothesizing that if voice recognition system developers and voice-enabled device makers make appropriate revisions guided by cognitive load assessment, the system’s performance will improve. Moreover, combining voice control assistance with cognitive assistance in voice-enabled devices will also lower the cognitive load on users implementing voice recognition technology. Speech recognition algorithms with high Wake Word False Rejection Rate (WWFRR) and Response Accuracy Rate (RAR) will also enable voice recognition systems to capture voice commands in the far field accurately (Basak et al., 2023). Another potential solution is integrating deep learning in voice recognition systems to adapt them to combine speech separation and advanced acoustic models such as Residual Neural Network (ResNet) to increase the accuracy of the translated text. Likewise, integrating speech emotion recognition capabilities in voice recognition systems will increase the reliability of the results.
Recommendation on the Best Solution for Implementation
Integrating deep learning in voice recognition systems is one of the best strategies to overcome issues in voice recognition systems. Deep learning aims to integrate an algorithm to voice recognition systems to empower their intelligence when perceiving acoustic signals (Basak et al., 2023). Instead of programming the voice recognition system to respond to voice inputs in a particular way, deep learning allows it to learn independently. The system is flooded with a lot of data and data sets, which it analyzes and processes to establish particular relations and patterns that influence the output (Cherkassky, 2022). More than that, deep learning will empower the capability of voice recognition systems to process conversational speech and provide output at a high accuracy rate. Moreover, the word error rate will lower to an insignificant number. Overall, deep learning will empower the sensitivity of the voice recognition system to acoustic signals and give it immunity to noise interference.
Deep learning integration in voice recognition systems is implemented by upgrading the system’s source code and integrating advanced acoustic models into the system (Basak et al., 2023). The upgrade is necessary to adjust the architecture of voice recognition systems so that they can accommodate the necessary algorithms and acoustic models without compromising their functionality (Cherkassky, 2022). Additionally, data for the intelligence of training will be accommodated into the system. System developers will deploy a prototype for testing upon including the above features. The system can be relaunched better if it meets all the upgrade objectives.
Conclusion
Voice recognition technology has revolutionized speech-to-text translation. The technology informs the development of speech-enabled devices that perceive commands and gives the desired output. However, the technique encounters various drawbacks. For instance, it causes cognitive load in consumers, which makes them unable to attend to other tasks. Additionally, voice recognition systems are costly to implement. They are also programmed in native languages, which makes them unable to accurately to translate accented speech. These challenges can be overcome by redesigning the system with consideration of the cognition load it impacts on users. Incorporating deep learning is another strategy involving acoustic models and algorithms to empower the accuracy and efficiency of voice recognition systems.
References
Chen, J., Lyell, D., Laranjo, L., & Magrabi, F. (2020). Effect of speech recognition on problem solving and recall in Consumer Digital Health Tasks: Controlled Laboratory Experiment. Journal of Medical Internet Research, 22(6). https://doi.org/10.2196/14827
Cherkassky, D. (2022). The problem with current speech recognition technology – LinkedIn. LinkedIn. https://www.linkedin.com/pulse/problem-current-speech-recognition-technology-dani-cherkassky
Goyal, S., & Batra, N. (2017). Issues and challenges of voice recognition in a Pervasive Environment. Indian Journal of Science and Technology, 10(30), 1–4. https://doi.org/10.17485/ijst/2017/v10i19/115518
Karbasi, Z., Bahaadinbeigy, K., Ahmadian, L., Khajouei, R., & Mirzaee, M. (2019). Accuracy of speech recognition system’s medical report and physicians’ experience in Hospitals. Frontiers in Health Informatics, 8(1), 19. https://doi.org/10.30699/fhi.v8i1.199
Onitilo, A. A., Shour, A. R., Puthoff, D. S., Tanimu, Y., Joseph, A., & Sheehan, M. T. (2023). Evaluating the adoption of voice recognition technology for real-time dictation in rural healthcare: A retrospective analysis of Dragon Medical one. PLOS ONE, 18(3). https://doi.org/10.1371/journal.pone.0272545
Saxena, K., Diamond, R., Conant, R. F., Mitchell, T. H., Gallopyn, I. G., & Yakimow, K. E. (2018, May 18). Provider adoption of speech recognition and its impact on satisfaction, documentation quality, efficiency, and cost in an inpatient Ehr. AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961784/
Tech Brief. (2019, September). Voice control tech brief – apple. https://www.apple.com/macos/big-sur/docs/Voice_Control_Tech_Brief_Sept_2019.pdf
Vukovic, M., Sethu, V., Parker, J., Cavedon, L., Lech, M., & Thangarajah, J. (2019). Estimating cognitive load from speech gathered in a complex real-life training exercise. International Journal of Human-Computer Studies, 124, 116–133. https://doi.org/10.1016/j.ijhcs.2018.12.003
Zhang, S. (2020). Development status, problems, and solutions of Speech Recognition Technology. Journal of Physics: Conference Series, 1693(1), 012137. https://doi.org/10.1088/1742-6596/1693/1/012137
Zhu, X., & Aryadoust, V. (2022). A synthetic review of cognitive load in distance interpreting: Toward an explanatory model. Frontiers in Psychology, 13. https://doi.org/10.3389/fpsyg.2022.899718
Basak, S., Agrawal, H., Jena, S., Gite, S., Bachute, M., Pradhan, B., & Assiri, M. (2023). Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools, and systems. Computer Modeling in Engineering & Sciences, 135(2), 1053–1089. https://doi.org/10.32604/cmes.2022.021755