New Algorithm Adds Sound to Silent Video By Observing the Tiny Vibrations of Objects (Video)

Researchers at MIT (in conjunction with Microsoft and Adobe) have created an algorithm with a truly impressive function: adding sound to audio-less videos.

The algorithm has the ability to observe and analyze countless tiny micro-vibrations and then determine what type of sound must have made them.

These vibrations are almost imperceptible – as small as one thousandth of a pixel in some cases. Abe Davis, a graduate student in electrical engineering and computer science at MIT, is first author on a new paper discussing the algorithm. He explains:

“When sound hits an object, it causes the object to vibrate. The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

Check out the video below to learn more and to see the algorithm in action:

In their experiments, the MIT researchers used ultra high-speed cameras, shooting video at 2,000 to 6,000 frames per second (movie film is typically shot at 24 fps).

This high-speed videography was necessary because in order to catch all of the minute details of each vibration, the frequency of camera’s frame rate (ie. how many pictures it took per second) had to be faster than the frequency of the vibrations themselves.

In their testing, the scientists used the popular children’s story Mary Had a Little Lamb – this was a nod to Thomas Edison, who recited the poem into his newly-invented phonograph back in 1877.

Thomas Edison with his phonograph (Courtesy of National Park Service)

Thomas Edison with his phonograph (Courtesy of National Park Service)

While this new algorithm is certainly extremely cool from a tech-lover’s standpoint, it does also raise questions about privacy, with many people worrying that technology like this will inevitably be used to eavesdrop on private conversations.

But Davis has other ideas for the new technology, describing it as a “new kind of imaging” that could reveal details in images that we never even knew were there:

“We’re recovering sounds from objects. That gives us a lot of information about the sound that’s going on around the object, but it also gives us a lot of information about the object itself, because different objects are going to respond to sound in different ways.”

 

The algorithm is so impressive that it was even able to capture audio by recording vibrations on a chip bag through a thick glass door (via YouTube)

The algorithm is so impressive that it was even able to capture audio by recording vibrations on a chip bag through a thick glass door (Screenshot from YouTube)

Alexei Efros is an associate professor of electrical engineering and computer science at the University of California at Berkeley. He had this to say about the recent innovation:

“This is new and refreshing. It’s the kind of stuff that no other group would do right now. We’re scientists, and sometimes we watch these movies, like James Bond, and we think, ‘This is Hollywood theatrics. It’s not possible to do that. This is ridiculous.’ And suddenly, there you have it. This is totally out of some Hollywood thriller. You know that the killer has admitted his guilt because there’s surveillance footage of his potato chip bag vibrating.”

Efros agreed with Davis that the algorithm will prove valuable for the characterization of different material properties, but also added,

“I’m sure there will be applications that nobody will expect. I think the hallmark of good science is when you do something just because it’s cool and then somebody turns around and uses it for something you never imagined. It’s really nice to have this type of creative stuff.”

Read the original press release from MIT.

BONUS: Apparently you don’t need a high-speed camera to pull this off yourself. The Skeptic’s Guide explains how you can use a technology called the rolling shutter sensor to accomplish this feat with smart-phones equipped with cameras capable of only 60fps:

“When you take a selfie, the resulting image (that no one wants to see btw) is built up over a brief period of time as a series of lines, one on top another. Each line is taken at a slightly different slice of time. This, in essence, is encoding information at a much higher frequency than 60 fps. Fast enough to pick up the tiny vibrations needed to reconstruct the audio like I’ve been discussing. The quality isn’t as good as the high speed cameras but it can do the job.”

Reply

This site is using the Seo Wizard plugin developed by http://seo.uk.net