Sound to Visual Scene Generation
by Audio-to-Visual Latent Alignment

1POSTECH,  2KAIST,  3University of Michigan
Interpolate start reference image.

Sound2Scene synthesizes images of natural scene from the input sound.


How does audio describe the world around us? In this paper, we explore the task of generating an image of the visual scenery that sound comes from. However, this task has inherent challenges, such as a significant modality gap between audio and visual signals, and audio lacks explicit visual information inside. We propose Sound2Scene, a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. Thereby, we translate input audio to visual feature, followed by a powerful pre-trained generator to generate an image. We further incorporate a highly correlated audio-visual pair selection method to stabilize the training. As a result, our method demonstrates substantially better quality in a large number of categories on VEGAS and VGGSound datasets, compared to the prior arts of sound-to-image generation. Besides, we show the spontaneously learned output controllability of our method by applying simple manipulations on the input in the waveform space or latent space.


Interpolate start reference image.

We propose Sound2Scene and its training procedure for generating images from the input sound. First, given an image encoder pre-trained in a self-supervised way, we train a conditional generative adversarial network to generate images from visual embedding vectors of the image encoder. We then train an audio encoder to translate an input sound to its corresponding visual embedding vector, by aligning the audio to the visual space. Afterwards, we can generate diverse images from sound by translating from audio to visual embeddings and synthesizing an image. Since Sound2Scene must be capable of learning from challenging in-the-wild videos, we use sound source localization to select moments in time that have strong cross-modal associations.

Waveform Manipulation for Image Generation

Single Waveform

Sound2Scene generates diverse images in a wide variety of categories from generic input sounds.


Churchbell Ringing




Cow Lowing


Owl Hooting


Lawn Mowing


Tractor Digging


Volcano Explosion


Fire Truck


Scuba Diving


Stream Burbling


Snake Hissing



Volume Changes

Just as humans can roughly predict the distance or the size of an instance by the volume of the sound, Sound2Scene can understand the relation between the volume of the audio and visual changes.

⚠ Please pause the video and adjust the volume. There are large volume sounds.



Mixing Waveforms

Sound2Scene can capture the existence of multiple sound sources and reflect them in the generated images.


Dog+Water Flowing


Baby+Water Flowing









Mixing Waveforms and Changing the Volume

Sound2Scene can mimic the camera movement by placing the object further as the wind sound gets larger.


Latent Manipulation for Image Generation

Image and Audio Conditioned Image Generation

With a simple latent interpolation between audio and visual features, Sound2Scene can generate novel images by conditioning on both audio-visual signals.



Generated Images              




Generated Images



Image Editing with Paired Sound

By moving the visual feature toward the direction of the volume difference between the two audio features, Sound2Scene can edit the original image with the paired sound.


Volume Decrease Direction


Volume Decrease Direction


Volume Decrease Direction




Volume Increase Direction

Volume Increase Direction

Volume Increase Direction


  author    = {Sung-Bin, Kim and Senocak, Arda and Ha, Hyunwoo and Owens, Andrew and Oh, Tae-Hyun},
  title     = {Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment},
  booktitle   = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2023}


This work was supported by IITP grant funded by Korea government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities). The GPU resource was supported by the HPC Support Project, MSIT and NIPA.