Artificial Intelligence Extracts Sound from Photos
Led by Kevin Fu, Professor of Electrical and Computer Engineering at Northeastern University, researchers have pioneered an AI-driven tool capable of drawing sound from still images and muted videos.
There's a common notion that deactivating microphones or avoiding camera-based lip reading ensures spoken words and nearby sounds remain private. However, with AI's advancements, these methods are becoming less fail-safe. The newest technology can interpret sound frequencies from still images or muted videos by conducting a detailed visual analysis.
It might sound like something straight out of a sci-fi novel, but Kevin Fu asserts he's made it a reality. The spark of inspiration came when a movie critic commented that the idea of extracting sound from heated glass, as depicted in the TV series “Fringe,” was mere fictional pseudoscience. This notion piqued the professor's interest, especially since his lab often dives into endeavors many consider unachievable.
Imagine someone is doing a TikTok video and they mute it and dub music. Have you ever been curious about what they’re really saying? Was somebody speaking behind them? You can actually pick up what is being spoken off camera,© Kevin Fu.
To bring this concept to life, the research team introduced an innovative AI tool called Side Eye. It meticulously analyzes photographs, detecting almost invisible distortions of light. These distortions occur during conversations due to the optical stabilization technologies present in modern smartphones and cameras. The AI subsequently deciphers these minute changes, translating them into audible sound.
Using an advanced global shutter, the sound extracted is often quiet and indistinct. However, with a standard rolling shutter, which reads pixels either column-wise or row-wise, the effect intensifies as it progresses, greatly enhancing the sound's clarity. If provided with a sequence of photos, the artificial intelligence can reconstruct an entire conversation.
The primary limitation of this novel technique is its requirement for minimal lighting and a large volume of files for analysis, though the latter is relatively minor. The tool also has the potential to pinpoint conversation participants if it had prior samples of that individual's voice. But, its accuracy in this function is still in the nascent stages, especially for broader applications.
Kevin Fu envisions this technology reshaping the digital landscape positively, even if it poses notable challenges for cybersecurity professionals and opens up avenues for nefarious activities. For example, legal professionals, law enforcement agencies, or defense attorneys might use Side Eye in criminal cases where hard evidence is limited, but there's an ample set of photographs or video recordings to ascertain an alibi.