Login Paper Search My Schedule Paper Index Help

My ICIP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSMR-4.11
Paper Title DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
Authors Shunyu Yao, Xiongkuo Min, Guangtao Zhai, Shanghai Jiao Tong University, China
SessionSMR-4: Image and Video Sensing, Modeling, and Representation
LocationArea F
Session Time:Wednesday, 22 September, 08:00 - 09:30
Presentation Time:Wednesday, 22 September, 08:00 - 09:30
Presentation Poster
Topic Image and Video Sensing, Modeling, and Representation: Image & video representation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency estimation. The proposed dataset consists of 202 high quality video squences with a large range of motions, scenes and object types. Many of the videos have high audio-visual correspondence. Several experiments are conducted on different datasets. The results demonstrate that our model outperforms the previous state-of-the-art methods by a large margin and the proposed dataset can serve as a new benchmark for the audio-visual saliency estimation task.