A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

J. E. Domínguez-Vidal
4/1/2026
cs.ROcs.AIcs.CV

Abstract

Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper

View on arXivView PDF

Code Implementations(5)

Florence-2 wrapper for ROS 2

20CMake, ShellFeb 21, 20263 weeks agoMIT

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

13,195881Jun 18, 20211 years agoMIT
adaptationdebertadeep-learninggpt-2gpt-3+5 more

This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025

7,164537May 1, 202511 months agoNOASSERTION

An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the simplest implementation of a deep research agent - e.g. an agent that can refine its research direction overtime and deep dive into a topic.

18,3671,895Feb 4, 20257 months agoMIT
agentaigpto3-miniresearch

Integrating object recognition application for vision-based object picking operation based on YOLO-V3 model and Inferencing on GPU base HPE server to detect a object (mAP-50 57.92%) on workspace to reach the object via ROS command.This application is also adaptable predicted class of detection is available.

51Jul 31, 20196 years agoMIT

Discussion