Mixed-View Panorama Synthesis using Geospatially Guided Diffusion

Zhexiao Xiong¹, Xin Xing², Scott Workman³, Subash Khanal¹, Nathan Jacobs¹

¹Washington University in St. Louis, ²University of Nebraska Omaha, ³DZYNE Technologies

Transactions on Machine Learning Research(TMLR), 2025

We propose a new task, mixed-view panorama synthesis, in which a satellite image and a set of nearby panoramas (blue, yellow, and green) are used to render a panorama at a novel location (red). Our approach uses diffusion-based modeling and attention to enable flexible, multimodal control.

Abstract

We introduce the task of mixed-view panorama synthesis, where the goal is to synthesize a novel panorama given a small set of input panoramas and a satellite image of the area. This contrasts with previous work which only uses input panoramas (same-view synthesis), or an input satellite image (cross-view synthesis). We argue that the mixed-view setting is the most natural to support panorama synthesis for arbitrary locations worldwide. A critical challenge is that the spatial coverage of panoramas is uneven, with few panoramas available in many regions of the world. We introduce an approach that utilizes diffusion-based modeling and an attention-based architecture for extracting information from all available input imagery. Experimental results demonstrate the effectiveness of our proposed method. In particular, our model can handle scenarios when the available panoramas are sparse or far from the location of the panorama we are attempting to synthesize.

Methods

Our local attention mechanism extracts features from nearby panoramas to provide contextual information for synthesis. The attention weights are learned based on spatial proximity and visual similarity.

Global attention extraction from satellite imagery provides overhead context and spatial structure. This enables our model to understand the overall layout and generate panoramas consistent with the surrounding environment.

Our fusion module combines features from both satellite and panorama inputs through a learned attention mechanism. This allows the model to leverage complementary information from both modalities for high-quality panorama synthesis.

Method

Our mixed-view panorama synthesis approach consists of three main components: (1) A satellite image encoder that extracts global spatial features, (2) A panorama encoder that processes nearby ground-level views, and (3) A diffusion-based synthesis module that generates novel panoramas by combining both sources of information.

Overview of our mixed-view panorama synthesis framework. The model takes as input a satellite image and a set of nearby panoramas, and generates a novel panorama at the target location using diffusion-based synthesis.

Visualization of local geospatial attention. The target location is represented by a green square in the satellite image. The nearby street-level panoramas (color-coded borders) are represented by same-colored circles in the satellite image.

Visualization of global geospatial attention. The color-coded attention maps for two target locations are shown, corresponding to the same-colored dots in the satellite image. Darker colors represent more salient regions.

Experimental Results

We evaluate our approach on diverse locations worldwide, demonstrating its ability to synthesize realistic panoramas even when input panoramas are sparse or distant from the target location. Our method significantly outperforms baseline approaches that use only satellite imagery or only ground-level panoramas.

Comparison with Baselines

Satellite Input

Pix2Pix

PanoGAN

Sat2Density

Ours

Ground Truth

Comparison with baseline methods. The cross-view synthesis methods that we compare with are trained on our collected center-aligned satellite images. Our approach, which integrates nearby street-level panoramas, not only generates more realistic results when compared to baselines, but more accurate results both semantically and geometrically when compared to the ground truth.

Experimental Results

Ablation Study

Ablation study showing the effectiveness of different components in our mixed-view panorama synthesis approach. Our method effectively combines information from both satellite imagery and ground-level panoramas to generate realistic novel views.

BibTeX

@article{xiong2024mixed,
  title={Mixed-View Panorama Synthesis using Geospatially Guided Diffusion},
  author={Xiong, Zhexiao and Xing, Xin and Workman, Scott and Khanal, Subash and Jacobs, Nathan},
  journal={arXiv preprint arXiv:2407.09672},
  year={2024}
}