AI and machine learning algorithms capable of reading lips from videos aren’t anything out of the ordinary, in truth. Back in 2016, researchers from Google and the University of Oxford detailed a system that could annotate video footage with 46.8% accuracy, outperforming a professional human lip-reader’s 12.4% accuracy. But even state-of-the-art systems struggle to overcome ambiguities in lip movements, preventing their performance from surpassing that of audio-based speech recognition.
In pursuit of a more performant system, researchers at Alibaba, Zhejiang University, and the Stevens Institute of Technology devised a method dubbed Lip by Speech (LIBS), which uses Topplay features extracted from speech recognizers to serve as complementary clues. They say it manages industry-leading accuracy on two benchmarks, besting the baseline by a margin of 7.66% and 2.75% in character error rate.
LIBS and other solutions like it could help those hard of hearing to follow videos that lack subtitles. It’s estimated that 466 million people in the world suffer from disabling hearing loss, or about 5% of the world’s population. By 2050, the number could rise to over 900 million, according to the World Health Organization.
LIBS distills useful audio information from videos of human speakers at multiple scales, including at the sequence level, context level, and frame level. It then aligns this data with video data by identifying the correspondence between them (due to different sampling rates and blanks that sometimes appear at the beginning or end, the video and audio sequences have inconsistent lengths), and it leverages a filtering technique to refine the distilled features.