Google DeepMind creates sound effects and dialogue for videos with its new technology

Google’s artificial intelligence laboratory DeepMind announced that it has developed a new technology that can create sound effects and dialogues for videos. This innovative technology can create sound effects appropriate to visual scenes using raw pixels and text inputs from videos. The DeepMind team called this project “audio-with-video” (V2A) technology, and this technology can be used in conjunction with other video creation tools such as Google Veo and OpenAI Sora.

Google DeepMind’s Audio-Video technology

The DeepMind team gave detailed information about the operation of this technology in their blog posts. The system analyzes the raw pixels of the videos and combines this visual data with text inputs, thus creating sound effects appropriate to what is happening on the screen. This feature can also be applied to different types of video, such as traditional sound film and silent film.

Training on the technology was carried out with video, audio and annotations containing detailed audio and dialogue explanations created by artificial intelligence. In this way, the technology learned to associate visual scenes with specific sounds. This feature makes DeepMind different from existing video-with-audio solutions because the system can understand raw pixels and adding text input is optional.

While text input is optional, users can use text input to further shape the final product and create more realistic and accurate sound effects. While the desired sounds can be created by using positive inputs, unwanted sounds can be avoided with negative inputs. For example, when an input such as “cinematic, thriller, horror movie, music, tension, footsteps on concrete” is used, the system can produce sounds appropriate to this input.

The researchers acknowledge that they are working on the current limitations of V2A technology. For example, when there is distortion in the source video, the quality of the output sound may decrease. Additionally, improvements need to be made on the lip syncing of the generated dialogues. The DeepMind team also promises that the technology will undergo rigorous security evaluations and testing before being released to the market.

All in all, this new technology from DeepMind could be a revolutionary step in the process of creating sound effects and dialogues for videos. This technology can make video production processes more efficient and creative, while also revitalizing silent films and other traditional video genres. This innovative work by DeepMind once again reveals the potential of artificial intelligence in the media and entertainment industry.