How AI Transcription Fails and What to Do About It

Published in

Becoming Human: Artificial Intelligence Magazine

7 min readFeb 26, 2020

By Tat Banerjee

AI Transcription has been all the rage recently. Several companies have been coming out with new products to meet this requirement.

What does it mean? Does it work? How do you evaluate these products? We take a look and try to think behind the hype.

First we look at some of the incoming offerings. Then some issues with AI transcription and translation.

Incoming Offerings

There is a very large number of AI products hitting the market place. Here are a few examples of what is happening.

Google Translate to Add Real-Time Transcription Feature

Earlier this month Silicon Angle posted this article about how Google is going to be adding a real time transcription feature.

Here is the money quote, ““your mobile phone effectively turns into a language translator for long-form speech,” Sami Iqran…product manager for Google Translate told SiliconANGLE in an interview at a press event Wednesday in San Francisco."

Importantly, also from the same article, “the feature sounds a bit similar to the Live Transcribe app on Android, which is aimed at the hearing-impaired and allows them to see a transcription of people’s speech on their smartphones. However, that app cannot translate speech into another language."

Cisco Integrates Voicea Transcription AI into Voice Assistant

From the VoiceBot website, which you should totally check out for very interesting AI news, came this article.

Top 4 Most Popular Ai Articles:

1. 100 days to Deep Learning: Part 2 the 100 days
2. Bursting the Jargon bubbles — Deep Learning
3. How Can We Improve the Quality of Our Data?
4. Artificial Intelligence Conference

What Cisco says about their new product, ““Voicea users have reported saving more than six hours per week per user with more actionable and efficient meetings — and we believe Webex users will experience similar results,” Cisco senior vice present Sri Srinivasan said in a statement. "We’re excited to bring this and other cognitive features to the 300 million users we already serve with Cisco Collaboration. This technology will fundamentally change how we are able to deliver massively personalised experiences and transform the way we work."”

This is an interesting offering in a couple of different ways. Cisco is marketing this as an AI Secretarial Service.

So the idea here is:

Use the AI to transcribe the content
Users can tag key moments, and there is some kind of mechanism to follow up
Using another mechanism, further conversation points can be followed up on

Not having seen the product or worked on it, we have no idea what kind of traction will happen here. But lots of interesting ideas here from Cisco.

Microsoft’s AI Automatically Comments on Video Clips

Venture Beat has another bit of news from earlier this month. The basic idea is, “generating live video captions with AI could bolster engagement on social media."

Ok — that kind of makes sense. But it seems an awful lot like, let’s get AIs commenting on our content, and maybe someone will be fooled and start commenting on it too.

That may be a bit of a cynical viewpoint but that is what it sounds like to me.

But giving these folks the benefit of the doubt, the team from Microsoft Research Asia and Harbin Institute of Technology came up with a new model which, “iteratively learns to capture the representations among comments, video, and audio, and they say that in experiments, it outperforms state-of-the-art methods.”

In fairness, the code is available on GitHub, so might be worth checking out.

We have not yet done this, so please be aware of that when reading our take on it. Why have we not done this? Because the last commit is from December 2018, so…

There is a pretty cool picture though. The below is just a screenshot from the same article.

**AI generated comments from MS Research**

But we should not be too cynical here as the researchers admit this is an interesting project, “"[W]e believe the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models,” wrote the researchers. "For future research, we will further investigate the multimodal interactions among vision, audio, and text in…real-world applications."”

Cool — even if the use case is a bit of a mystery and the project seems to be dead on Github.

What Does It Mean? And Does It Work?

AI Transcription is obviously a hot space for new product and research. And here at VideoTranslator our product is obviously in the same space.

Also, in the world of AI, never say never, because technology is moving very fast. That being said, we feel there are two major issues with the majority of AI transcription products available today.

This is not to say AI transcription/translation does not work, it absolutely does — when used properly.

Issue 1: AI Transcription Fails Because of the Differences in Languages vs Dialects

The first issue is language vs dialects. People speak in different dialects. What are you going to do?

In the VideoTranslator app we have an explicit language-dialect mapping which allows the user to specify the specific dialect which the content should be transcribed with, the example below shows the AI dialect options for English.

Now this is not to say a more general AI (in terms of its training corpus) cannot get a really good transcript.

However, we find that using the specific AI on a dialect basis results in higher accuracy. This is also not entirely surprising for obvious reasons.

Issue 2: AI Transcription Fails As We Move Up Information Complexity

So this is a bit hard to explain. AI’s are trained on a corpus (data set). So a transcription AI is trained with lots of sound files, and a translation AI with lots of text files.

This is the problem, high information complexity content, is by definition, not that common. Hence the more complex a subject is, the less likely this content is in the corpus used to train the AI.

Think about it this way, for a gossip column the AI will probably be pretty good, for medical information, less good; for quantum physics, yeah…nah.

But higher information complexity content is precisely what is valuable.

Again, higher information complexity content is precisely what makes the information valuable, in fact the higher the complexity generally speaking, the more valuable it is.

This is why your doctor can charge you lots of money, or your lawyer, or accountant. Because they have higher information complexity knowledge in their heads. You are paying for the time they spent learning the subject matter.

This is why many of these AI’s don’t work so well.

At VideoTranslator, we explicitly sell our product as an efficiency win, not a cost saving. You really want a human subject matter (i.e. good old human judgement!) to finesse your transcript or translation.

AI Transcription is very capable, and with the right use-cases, can lead to really big productivity wins. This means you can increase the quality of the transcription or translation depending upon what you are trying to do. We hope this post shows you some of the things to think about when choosing your preferred AI toolkit.

Remember: Use the VideoTranslator AIs for the heavy lifting and your own staff to do the high value tasks. Practically this means your staff can think more about engagement and converting clients, as opposed to manually doing the first pass transcription or translation.

Curious about how we could help your business? Check out our managed service, or try our app for free! Alternately drop us an email at hello@videotranslator.ai.

Don’t forget to give us your 👏 !