X-JEPA: A Novel Self-Supervised Framework for Cross-Modal Remote Sensing Retrieval via Predictive Semantic Alignment
Published in Proceedings of the Winter Conference on Applications of Computer Vision (WACV) 2026 , 2025
We propose X-JEPA, a predictive self-supervised joint-embedding architecture for cross-modal remote sensing image retrieval (RS‑CMIR). Instead of reconstructing pixels or using contrastive pairs, X‑JEPA learns by forecasting semantic embeddings across modalities, enforcing modality‑invariant alignment through a geometry‑aware Prediction Space Alignment (PSA) loss that preserves latent space structure without requiring paired inputs. Evaluated on large‑scale BEN‑14K (Sentinel‑1/Sentinel‑2) and fMoW (RGB/Sentinel) benchmarks, X‑JEPA achieves up to 11.0% F1 improvement in cross‑modal retrieval and 9.8% in unimodal settings over MAE, SatMAE, CrossMAE, CSMAE‑SESD, CROMA, SkySense, DeCUR, and REJEPA, while remaining comparatively lightweight and parameter‑efficient.
