λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics

1Brown University, 2Rutgers University, 3University of Pennsylvania

Abstract

Efficiently learning and executing long-horizon mobile manipulation (MoMa) tasks is crucial for advancing robotics in household and workplace settings. However, current MoMa models are data-inefficient, underscoring the need for improved models that require realistic-sized benchmarks to evaluate their efficiency, which do not exist. To address this, we introduce the LAMBDA (λ) benchmark (Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities), which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. The benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We benchmark several models, including learning-based models and a neuro-symbolic modular approach combining foundation models with task and motion planning. Learning-based models show suboptimal success rates, even when leveraging pretrained weights, underscoring significant data inefficiencies. However, the neuro-symbolic approach performs significantly better while being more data efficient. Findings highlight the need for more data-efficient learning-based MoMa approaches. λ addresses this gap by serving as a key benchmark for evaluating the data efficiency of those future models in handling household robotics tasks.

Overview

Overview figure

Improving data efficiency for long-horizon mobile manipulation (MoMa) tasks is critical for practical robotic deployment in human environments. Current approaches demand large-scale datasets, which are costly and resource-intensive. To bridge this gap, we introduce the LAMBDA (λ) benchmark, a dataset with 571 language-conditioned, human-collected demonstrations covering diverse indoor multi-room, multi-floor pick-and-place tasks. Unlike existing large datasets, λ emphasizes data efficiency, realistic variability, and replay-verifiability, serving as a valuable testbed for developing robust, practical MoMa models.


Related works figure

Existing benchmarks for MoMa robotics often lack critical elements necessary for real-world applicability, such as natural language conditioning, long-horizon tasks, human-collected demonstrations, and real-world validation. Most benchmarks either rely on planner-generated data, templated commands, or are restricted to tabletop manipulation, limiting their realism and utility. Very few benchmarks offer free-form natural language instructions, quadruped robot data, or multi-room/floor navigation. The λ benchmark uniquely integrates all these elements, addressing significant gaps and providing a comprehensive evaluation framework for realistic, long-horizon MoMa tasks.




Demonstrations

Trajectory figure

Second demonstration figure

The λ benchmark assesses models' data efficiency specifically for long-horizon mobile manipulation tasks involving language-conditioned, room-to-room, and floor-to-floor pick-and-place activities. It comprises 571 expert human-collected demonstrations, blending simulated data (521 trajectories) with real-world data (50 trajectories), reflecting diverse objects, instructions, and realistic complexity. Tasks are specified via crowdsourced free-form natural language instructions, challenging models to generalize robustly across varied linguistic expressions. The use of both simulated environments and real-world data ensures comprehensive and practical evaluation.



Results

To establish baseline performance for λ, we benchmarked two behavior cloning (BC) models: RT-1, a transformer-based MoMa model originally trained on large-scale robot data, and MotionGlot-MoMa, an adapted version of MotionGlot designed for multi-embodiment action generation. Both models were evaluated when trained from scratch and after fine-tuning their pretrained parameters. Additionally, we evaluated LIMP, a zero-shot neuro-symbolic system integrating large multimodal foundation models with task and motion planning, requiring no robot demonstration training. Models were assessed using a success rate metric, where tasks comprise sequential subtasks: navigating to an object, grasping, transporting, and placing it at the goal location, measuring performance comprehensively across long-horizon tasks.


Scene generalization experiment figure

RT-1 and MG-MoMa achieved low success rates averaging 2.7% and 2.4%, respectively, indicating significant challenges in generalizing to unseen environments with novel room layouts and object placements. Both models marginally surpassed a random baseline, confirming minimal learning. Performance varied slightly across scenes based on complexity and spatial constraints, with simpler tasks achieving slightly higher scores. Overall, the results underscore current limitations in end-to-end models for scene generalization in long-horizon MoMa tasks.


Task generalization experiment figure

In task generalization, RT-1 and MG-MoMa exhibited slightly improved but still low success rates, around 5%. Fine-tuning these models with pretrained parameters did not substantially enhance performance, indicating limited benefits from pretraining on external data. Conversely, LIMP, the zero-shot neuro-symbolic system, significantly outperformed end-to-end models with a 44.4% success rate. This result highlights the promise of neuro-symbolic approaches in generalizing to novel tasks without robot-specific demonstration training.





Ablations

Ablations figure

We investigated how varying dataset sizes and model architectures impact data efficiency in MoMa tasks. Increasing the dataset size from 25% to 100% led to modest performance improvements for RT-1, indicating data inefficiency. Architectural comparisons revealed that replacing RT-1's transformer with a Mamba architecture (RM-1) resulted in consistently better performance across all dataset sizes, outperforming both the transformer and LSTM alternatives. These findings suggest that architectural choices significantly influence data efficiency, with Mamba-based models demonstrating superior generalization capabilities for long-horizon MoMa tasks.




Acknowledgements

This work is supported by ONR under grant award numbers N00014-22-1-2592 and N00014-23-1-2794, NSF under grant award number CNS-2150184, and with support from Amazon Robotics. We also thank Aryan Singh, George Chemmala, Ziyi Yang, David Paulius, Ivy He, Lakshita Dodeja, Mingxi Jia, Benned Hedegaard, Thao Nguyen, Selena Williams, Tuluhan Akbulut, and George Konidaris for their help in various phases of work.





BibTeX


    @misc{lambdabenchmark,
      title={{\lambda}: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics}, 
      author={Ahmed Jaafar and Shreyas Sundara Raman and Yichen Wei and Sofia Juliani and Anneke Wernerfelt and Benedict Quartey and Ifrah Idrees and Jason Xinyu Liu and Stefanie Tellex},
      year={2025},
      eprint={2412.05313},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2412.05313}, 
    }