geopixel_logo

GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Mustansar Fiaz1, Hiyam Debary1, Paolo Fraccaro1, Danda Paudel2, Luc Van Gool2,3, Fahad Shahbaz Khan4,5, Salman Khan4,6,

1IBM Research, 2INSAIT, 3ETH Zürich, 4Mohamed bin Zayed University of Artificial Intelligence, 5Linköping University 6Australian National University,

News Icon News

[Oct-07-2025]: 📂 GeoVLM-R1 dataset will released on HuggingFace.
[Oct-07-2025]: 🚀 GeoVLM-R1 training and finetuning code is coming soon at Github link.
[Oct-07-2025]: 🔥 Our model checkpoints will released on HuggingFace.
[Sep-30-2025]: 📜 Technical Report of GeoVLM-R1 paper is released arxiv link
[Sep-30-2025]: 🔥 GeoVLM-R1 2025 project is live.

🏆 Contributions

  1. GeoVLM-R1: A specialized VLM for high-resolution remote sensing image Reasoning. We propose GeoVLM-R1, a reinforcement learning framework that encourages VLM to enhance its reasoning capabilities with flexibility, scalability, and ease of experimentation in mind for diverse EO tasks.

  2. Reward Mechansim. We have a sophisticated reward mechanism, enabling effective RL in EO reasoning contexts. To generate structurally coherent and semantically accurate reasoning outputs, we introduce format and task-aware accuracy rewards to better guide reasoning optimization.

  3. Comprehensive evaluation benchmark for RS VLMs. Our experimental results demonstrate the effectiveness of GeoVLM-R1 on multiple challenging EO tasks. Experimental results on 28 downstream benchmarks show that our method performs well compared to existing VLMs and achieves better performance, demonstrating its merits.

GeoVLM GeoVLM-R1: RL Training Paradigm

Illustration of the overall proposed training paradigm for GeoVLM-R1. The model is first initialized via supervised fine-tuning using diverse earth observation tasks. It is then successively optimized using GRPO-based reinforcement learning (RL) for each task. The GeoVLM-R1 processes queries and outputs a structured format that comprises an interpretable reasoning trace (<think> ... </think>) and a final prediction (<answer> ...</answer>).

GeoVLM-R1: RL Policy Update Mechanism

Overall pipeline of GeoVLM-R1 policy update mechanism (left). During fine-tuning, the GRPO module generates multiple candidate responses. These responses are evaluated, and each is assigned a distinct reward equipped with our reward mechanism. In particular, our reward mechanism comprises (i) a format reward to enforce structural compliance and (ii) a task-aware accuracy reward to ensure accuracy compliance. We present a few examples showcasing GeoVLM-R1 using a unique task-aware accuracy reward function, resulting in better performance (right).

State-of-the-art Comparison across EO Tasks

Comparison of recent generic and specialized VLMs over diverse EO tasks. GeoVLM-R1 shows favorable improvements across classification, detection, and captioning tasks.

Image Classification Task

GeoVLM-R1 illustrates a consistent improvement among zero-shot (ZS), multi-label BigEarthNet, and temporal classification datasets compared to other existing VLMs.
Model AID (ZS) UCMerced (ZS) WHU-19 (ZS) BigEarthNet xBD Set 1 (Temporal) FMoW (Temporal)
GPT-4o 74.73 88.76 91.14 49.00 67.95 21.43
InternVL-8B 60.40 58.23 79.30 19.73 51.44 21.04
Qwen2.5-VL-3B 58.27 60.86 78.21 24.75 51.44 34.36
GeoChat 72.03 84.43 80.09 20.35 53.32 59.20
EarthDial 88.76 92.42 96.21 73.03 96.37 70.03
GeoVLM-R1b> 88.46 97.81 97.91 80.91 98.93 76.93

Referred Object Detection, Region-Captioning, Grounding Description Tasks

GeoVLM-R1 illustrates a consistent performance gain across referred object detection, region-captioning, and grounding description tasks.
Model Referred Object Detection Task Region-Captioning Task Grounding Task
GeoChat-Instruct NWPU VHR-10 (Zero-Shot) GeoChat-Instruct NWPU VHR-10 (Zero-Shot) NWPU VHR-10 (Zero-Shot)
Small Med. Large Single Mult. Small Med. Large Single Mult. Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor @0.5 @0.25 Rouge1 Rouge-L Meteor
GPT-4o - - - - - - - - - - 9.41 7.6 8.02 17.68 11.81 9.63 0.7 6.1 14.72 10.82 9.41
InternVL2-4B 6.3 24.37 37.38 24.96 11.72 7.1 12.68 25.48 22.96 8.1 - - - - - - 10.6 29.87 30.67 29.09 21.92
InternVL2-8B 7.20 23.76 31.99 25.77 9.30 4.26 11.85 20.72 21.66 5.86 10.58 9.06 8.5 11.88 9.63 7.7 - - - - -
GeoChat 2.9 13.6 21.7 16 4.3 2.5 3.2 14.7 13.23 1.9 72.77 72.74 61.9 62.02 62.02 53.31 2.2 15.27 21.46 20.74 21.38
EarthDial 11.43 31.76 39.07 34.29 13.41 11.66 14.21 23.12 25.37 8.9 73.38 73.34 62.72 72.14 72.14 60.01 17.07 41.00 27.05 26.35 23.12
GeoVLM-R1 36.02 54.72 55.03 57.1 35.04 34.44 48.76 64.91 55.97 41.45 75.92 75.9 66.43 72.10 72.10 55.49 38.74 61.45 31.31 30.08 26.10

Change Detection (CD) and Image Captioning (IC) Tasks

Comparison of GeoVLM-R1 over change detection (CD) and image captioning (IC) datasets. Results indicate better capabilities of our method to generate captions compared to existing VLMs for both temporal CD and image-captioning datasets. ZS means zero-shot evaluation.
Model CD Dubai-CC CD LEVIR-MCI CD MUDS CD SYSU (ZS) IC NWPU-Captions IC RSCID-Captions IC RSITMD-Captions (ZS)
Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor Rouge1 Rouge-L Meteor
GPT-4o 8.81 7.45 18.68 10.33 8.4 22.05 14.18 11.02 20.92 16.48 12.32 17.49 19.43 14.86 28.16 20.53 15.59 26.03 18.31 14.22 24.83
InternVL2-4B 7.31 6.38 21.12 8.88 7.43 22.14 10.25 7.90 17.73 13.27 9.98 14.36 - - - - - - - - -
InternVL2-8B - - - - - - - - - - - - 20.69 15.64 30.18 21.59 16.13 28.17 18.91 14.65 26.02
Qwen2.5-VL-3B 14.41 13.62 27.59 12.27 10.11 26.11 12.13 9.30 18.22 13.61 10.34 16.06 18.82 14.72 26.79 21.37 16.42 26.53 18.79 15.02 25.05
GeoChat 14.21 14.19 28.91 17.15 35.42 12.35 12.28 12.23 15.98 13.45 12.02 13.96 14.86 12.54 15.21 13.48 11.59 12.39 13.41 11.50 12.33
EarthDial 31.94 30.66 55.83 33.78 30.47 74.80 28.16 24.03 33.56 18.03 17.42 14.98 45.84 39.96 80.61 33.77 27.61 56.18 26.74 21.72 34.06
GeoVLM-R1 36.60 34.15 61.22 37.85 34.02 73.56 34.07 27.65 45.94 19.64 18.46 15.45 46.94 40.96 82.00 34.64 28.63 56.54 30.62 25.39 39.07

Temporal Damage Assessment Tasks

GeoVLM-R1 comparison for various tasks on the xBD dataset for eight diverse tasks, such as temporal image captioning, region classification, image classification, object detection, and referred object detection. Our method exhibits substantial progress across the tasks. In particular, our approach shows a notable performance gain over object detection and referred object detection tasks, compared to other VLMs.
Model Image Captioning Region Classification Image Classification Object Detection Referred Object Detection
Rouge1 Rouge-L Meteor Test Set-1 Test Set-2 Test Set-1 Test Set-2 Test Set-3 mAP@0.5 mAP@0.25 mAP@0.5 mAP@0.25
GPT-4o 14.21 10.35 19.52 51.68 71.62 67.95 75.45 70.41 0.2 2.15 - -
InternVL2-8B 13.89 10.37 14.92 14.39 58.33 51.44 61.52 51.12 0.6 1.07 - 0.7
Qwen2.5-VL-3B 11.98 8.12 19.94 71.19 59.69 51.44 56.16 41.26 - - - -
GeoChat 14.18 10.67 12.20 25.30 57.65 53.32 52.19 49.51 1.15 7.2 0.2 3.09
EarthDial 87.26 87.26 88.53 53.70 83.09 96.37 82.85 54.01 7.6 21.11 5.1 13.09
GeoVLM-R1 92.26 92.26 93.37 81.36 83.55 98.93 86.39 68.60 38.15 48.13 24.52 34.52

Visual Question Answer Task

GeoVLM-R1 performs better compared to existing VLMs for Comp and R/U categories over RSVQA-LRBEN (left) and obtains a better average score for RSVQA-HRBEN (right). Comp: Comparison, R/U: Rural/Urban.
Model RSVQA-LRBEN Model RSVQA-HRBEN (zero-shot)
Presence Comp R/U Avg. Presence Comp Avg.
MiniGPTv2 55.16 55.22 39.00 54.96 MiniGPTv2 40.79 50.91 46.46
Qwen2-VL 38.57 67.59 61.00 55.35 Qwen2-VL 66.44 60.41 63.06
InternVL2-8B 58.54 72.28 71.00 66.51 InternVL2-8B 67.35 76.91 72.70
Qwen2.5-VL-3B 59.59 75.04 63.00 68.40 Qwen2.5-VL-3B 59.89 72.26 66.81
GeoChat 91.09 90.33 94.00 90.70 GeoChat 58.45 83.19 72.30
LHRS-Bot 88.51 90.00 89.07 89.19 EarthGPT 62.77 79.53 72.06
TeoChat 91.70 92.70 94.00 92.29 TeoChat 67.50 81.10 75.04
EarthDial 92.58 92.75 94.00 92.70 EarthDial 58.89 83.11 72.45
GeoVLM-R1 91.81 93.20 96 92.66 GeoVLM-R1 66.38 82.26 75.27

Multi-temporal FMoW dataset Task

Comparison over multi-temporal FMoW dataset, where the model is fine-tuned and tested over TeoChat-Instruct.
Model FMoW-High-Res FMoW-Low-Res
Video-LLaVA 16.6 4.9
Qwen2.5-VL-3B 20.34 5.45
GeoChat 59.2 26.3
TeoChat 75.11 45.5
GeoVLM-R1 78.53 53.0

BibTeX


        @article{fiaz2025geovlmr1,
          title={GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning}, 
          author={Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Shahbaz Khan, Salman Khan},
          journal={ArXiv},
          year={2025},
          url={https://arxiv.org/pdf/2509.25026}
        } 

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.