FLAME: One-Step Active Learning for Lightning-Fast Remote Sensing Specialization

Why open-vocabulary detectors struggle in remote sensing

Open-vocabulary detectors like OWL ViT v2 are trained on massive web image-text pairs and generalize well on natural images. In remote sensing they face two core problems: many categories are fine-grained (for example chimney versus storage tank), and imaging geometry is unusual (nadir aerial tiles, rotated objects, small scales). These factors cause the text and visual embeddings to overlap for look-alike categories, which reduces precision even when recall remains high.

FLAME: a one-step active learning cascade

FLAME is designed to combine the broad coverage of an open-vocabulary detector with the precision of a lightweight, task-specific refiner — without hours of GPU fine-tuning or thousands of labels. The idea is to keep the base open-vocabulary detector frozen to preserve recall and generalization, then add a tiny classifier that learns the exact semantics a user needs from a handful of targeted labels.

Pipeline in detail

The FLAME pipeline is a sequence of steps that finds the most informative samples for quick specialization:

Run a zero-shot open-vocabulary detector to generate many candidate boxes for a text query (for example chimney).
Represent every candidate with its visual features and its similarity to the text query.
Identify marginal samples near the decision boundary by projecting features to a low-dimensional space (PCA), estimating density, and selecting the uncertain band.
Enforce diversity by clustering that uncertain band and picking one item per cluster.
Ask a user to label roughly 30 crops as positive or negative.
Optionally rebalance skewed labels using SMOTE or SVM SMOTE.
Train a small refiner (for example an RBF SVM or a two-layer MLP) to accept or reject the original proposals.

Because the base detector stays frozen, FLAME preserves high-recall proposals from the open-vocabulary model while the refiner removes false positives and captures the user's intended semantics.

Datasets, base models, and experimental setup

Evaluation was performed on two remote sensing detection benchmarks: DOTA (oriented boxes, 15 categories, high-resolution aerial imagery) and DIOR (23,463 images, 192,472 instances, 20 categories). Comparisons include a zero-shot OWL ViT v2 baseline, RS OWL ViT v2 (fine-tuned on RS WebLI), and several few-shot baselines.

RS OWL ViT v2 raises the zero-shot mean AP to 31.827% on DOTA and 29.387% on DIOR, and this model serves as the starting point for FLAME.

Performance highlights

With about 30 labeled shots per category and no base-model fine-tuning, FLAME cascaded on RS OWL ViT v2 achieves 53.96% AP on DOTA and 53.21% AP on DIOR, surpassing prior few-shot methods listed by the authors. A striking example: on DIOR the chimney class AP jumps from 0.11 in zero-shot to 0.94 after FLAME, showing strong suppression of look-alike false positives.

Zero-shot OWL ViT v2 starts at 13.774% AP on DOTA and 14.982% AP on DIOR; RS OWL ViT v2 improves those zero-shot scores substantially, and FLAME then delivers the large precision gains on top.

Adaptation runs in about one minute per label on a standard CPU, enabling near-real-time, user-in-the-loop specialization.

What this means in practice

FLAME provides a practical path to quickly specialize open-vocabulary detectors for remote sensing tasks where categories are fine-grained or imaging conditions differ from web-scale training data. By selecting marginal and diverse samples, collecting only a few dozen labels, and training a lightweight refiner, users can reach state-of-the-art few-shot performance without long GPU runs or massive labeling efforts.

For further technical details and experimental tables, see the linked paper and project resources.