Will Google ever be able to create an index of audio content that users can search like web pages?

The results of early tests published by Google in a blog article show that audio searches are harder to perform than it sounds.

Details on these tests are published in an article by Tim Olson, SVP for Digital Strategic Partnerships at KQED.

Google is working with KQED to improve audio discoverability.

With the help of KUNGFU.AI, an AI service provider, Google and KQED conducted tests to determine how audio could be transmitted quickly and without errors.

They discovered the following:

The difficulties of audio search

The biggest barrier to audio search capability is the fact that audio must be converted to text before it can be searched and sorted.


Read on below

There is currently no way to transcribe audio so accurately that it can be found quickly.

The only way a global audio search would ever be possible is with automated transcriptions. Manual transcriptions would cost publishers a lot of time and effort.

KQED's Olson notes that the bar for the accuracy of audio transcriptions must be high, especially when it comes to indexing audio messages. The previous advances in voice output do not currently meet these standards.

Limitations of the current speech-to-text technology

Google ran tests on KQED and KUNGFU.AI by applying the latest voice-text tools to a collection of audio messages.

Limitations have been discovered in the AI's ability to identify proper names (also known as named entities).


Read on below

Named entities sometimes need to understand context in order to be identified exactly, which the AI ​​doesn't always have.

Olson gives an example of KQED's audio messages, which contain a language full of named entities that are contextual for the Bay Area region:

“KQED's local news audio is rich in named entity references that relate to subjects, people, locations, and organizations that are in the context of the Bay Area region. Speakers use acronyms such as "CHP" for California Highway Patrol and "The Peninsula" for the San Francisco to San Jose area. These are more difficult for artificial intelligence to identify. "

When named entities are not understood, the AI's best guess is what was said. However, this is an unacceptable solution for web searches as incorrect transcription can change the entire meaning of what is being said.

What's next?

Work on audio search continues to make the technology widely available as it develops.

David Stoller, Partner Lead for News & Publishing at Google, says the technology will be openly shared when work on this project is complete.

“One of the pillars of the Google New Initiative is to develop new approaches to difficult problems. Once complete, this technology and associated best practices will be openly shared, which greatly expands the expected impact. "

Today's machine learning models don't learn from their mistakes, says KQED's Olson. This is where people may need to intervene.

The next step is to test a feedback loop where newsrooms can help improve machine learning models by identifying common transcription errors.


Read on below

"We are confident that improvements to these speech-to-text models in the near future will help convert audio to text more quickly, which will ultimately help people find audio messages more effectively."

Source: Google


Please enter your comment!
Please enter your name here