Transformer‑based Models for Visuomotor Planning

At the end of my graduate studies, it became clear that embodied AI's future was likely not based on algorithmic task and motion planning (TAMP). While incredibly capable, these approaches to solving long-horizon tasks with robots required enormous hand-engineering for practical deployment. Additionally, as they rely on known symbolic representations of their environments and static domain definitions, their ability to generalize to out-of-distribution tasks was limited. In situations with known environments, objects, semi-static obstacles, and definable interactions – integrated TAMP methods are reliable, provable, and precisely what is needed. However, most applications one could imagine robots revolutionizing have systems operating in unstructured, partly unknown, and constantly changing scenes.

illustration of various policy types for RL [1]

This discrepancy prompted me to alter the course of my research interests and familiarize myself with the recent groundbreaking publications introducing transformer and diffusion machine learning models into reinforcement learning [1-12]. Pre-existing approaches went about mapping observations to actions by crafting different action representations. These mappings can be explicitly done by learning a mixture of Gaussians or a categorical representation of a finite set of actions, or implicitly done by learning an energy function conditioned on observations and actions and selecting actions that minimize the energy landscape

These newer approaches – particularly that of Chi et al. [1] titled Diffusion Policy - completely change this action representation by introducing a transformer-based diffusion model that gradually refines noise into actions by learning a gradient field in the space of robot actions. Furthermore, these models are trained to accept vision observations directly and operate on closed-loop action sequences (which allows replanning failed actions). The resulting pipeline exhibited an enormous improvement over state-of-the-art approaches and opened a new direction for robotics.

As recent advances in image generation demonstrate, diffusion models can operate stably over high-dimensional output spaces [13]. It appears that this property carries over to robotics where – in the case of manipulation and long-horizon task planning – we often have to deal with robots with many degrees of freedom operating in highly non-linear dynamics.

transformer-based RL policy Octo [8]

video of diffusion policy in action [5]

Given this, I set it upon myself to train my versions of these models to experiment with modifying the transformer backbone. The recent explosion in large language models has prompted enormous interest in improving transformers' capacity, flexibility, and inference speed [13-20]. We have even seen transformer-based models succeed at traditional intractable state-space problems[14], [21-23]. These advancements prompted me to investigate if similar improvements could be deployed on the previously mentioned work.

environments used in real-world experiments [1]

Currently, I am investigating methods to increase the model capacity of these visuomotor models by introducing the mixture of experts [24] transformer blocks presented by Jiang et al. in their work titled Mixtral of Experts [25]. In this work, the authors a sparse mixture of experts based on the open-source LLM Mistral [26] that achieved similar or better performance to LLaMa 2 with 70 billion parameters [27] and GPT-3.5-Turbo. Jiang et al. hypothesized that different "experts" in their model would specialize in dealing with particular forms or topics of speech. Additionally, Lin et al. demonstrated a similar improvement in large vision-language models [28]. Qualitatively, a similar advantage can be gained for robotics applications as there are often many modes of interaction that a robot must navigate.

comparison of Mixtral with LLama 2 [25]

Additionally, open-source ML model repositories have made accessing high-quality pre-trained transformer models easier than ever. One can imagine the power a visuomotor policy that accepts textual instructions would possess. I am investigating training these higher capacity visuomotor policies with text labels for each task to condition the model to understand task instructions in plain text – similar to [10], [12], [8]. Doing so might require a model trained on many such tasks, so my available GPU computing power will be a limitation.

As of January 2024, this is an ongoing investigation on my part. I hope to demonstrate notable improvements over the current state-of-the-art, test them on RL benchmarks [29-30], and publish this work.

[1] C. Chi et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” arXiv, Mar. 07, 2023. doi: 10.48550/arXiv.2303.04137.

[2] J. Luo et al., “SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning.” arXiv, Jan. 31, 2024. doi: 10.48550/arXiv.2401.16013.

[3] P. Wu et al., “Masked Trajectory Models for Prediction, Representation, and Control.” arXiv, May 04, 2023. doi: 10.48550/arXiv.2305.02968.

[4] D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A General Navigation Model to Drive Any Robot.” arXiv, May 22, 2023. doi: 10.48550/arXiv.2210.03370.

[5] Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv, Jan. 04, 2024. doi: 10.48550/arXiv.2401.02117.

[6] I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-World Robot Learning with Masked Visual Pre-training.” arXiv, Oct. 06, 2022. doi: 10.48550/arXiv.2210.03109.

[7] I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot Learning with Sensorimotor Pre-training.” arXiv, Dec. 14, 2023. doi: 10.48550/arXiv.2306.10007.

[8] D. Ghosh et al., “Octo: An Open-Source Generalist Robot Policy”.

[9] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” arXiv, Apr. 23, 2023. doi: 10.48550/arXiv.2304.13705.

[10] M. Ahn et al., “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.” arXiv, Aug. 16, 2022. doi: 10.48550/arXiv.2204.01691.

[11] J. Yang, D. Sadigh, and C. Finn, “Polybot: Training One Policy Across Robots While Embracing Variability.” arXiv, Jul. 07, 2023. doi: 10.48550/arXiv.2307.03719.

[12] T. Xiao et al., “Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models.” arXiv, Jul. 01, 2023. doi: 10.48550/arXiv.2211.11736.

[13] P. Marion et al., “Implicit Diffusion: Efficient Optimization through Stochastic Sampling.” arXiv, Feb. 08, 2024. doi: 10.48550/arXiv.2402.05468.

[14] D. Ulmer, E. Mansimov, K. Lin, J. Sun, X. Gao, and Y. Zhang, “Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk.” arXiv, Jan. 10, 2024. doi: 10.48550/arXiv.2401.05033.

[15] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv, Dec. 13, 2023. doi: 10.48550/arXiv.2305.18290.

[16] V. Dwaracherla, S. M. Asghari, B. Hao, and B. Van Roy, “Efficient Exploration for LLMs.” arXiv, Feb. 01, 2024. doi: 10.48550/arXiv.2402.00396.

[17] N. Shazeer, “GLU Variants Improve Transformer.” arXiv, Feb. 12, 2020. doi: 10.48550/arXiv.2002.05202.

[18] P. Lu, T. Jiang, Y. Li, X. Li, K. Chen, and W. Yang, “RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation.” arXiv, Dec. 12, 2023. doi: 10.48550/arXiv.2312.07526.

[19] V. Micheli, E. Alonso, and F. Fleuret, “Transformers are Sample Efficient World Models.” Sep. 01, 2022. Accessed: Oct. 04, 2022. [Online]. Available: http://arxiv.org/abs/2209.00588

[20] M. Oren, M. Hassid, Y. Adi, and R. Schwartz, “Transformers are Multi-State RNNs.” arXiv, Jan. 11, 2024. doi: 10.48550/arXiv.2401.06104.

[21] Q. Anthony, Y. Tokpanov, P. Glorioso, and B. Millidge, “BlackMamba: Mixture of Experts for State-Space Models.” arXiv, Feb. 01, 2024. doi: 10.48550/arXiv.2402.01771.

[22] A. Ruoss et al., “Grandmaster-Level Chess Without Search.” arXiv, Feb. 06, 2024. doi: 10.48550/arXiv.2402.04494.

[23] M.-C. Dinu, C. Leoveanu-Condrei, M. Holzleitner, W. Zellinger, and S. Hochreiter, “SymbolicAI: A framework for logic-based approaches combining generative models and solvers.” arXiv, Feb. 05, 2024. doi: 10.48550/arXiv.2402.00854.

[24] N. Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv, Jan. 23, 2017. doi: 10.48550/arXiv.1701.06538.

[25] A. Q. Jiang et al., “Mixtral of Experts.” arXiv, Jan. 08, 2024. Accessed: Jan. 16, 2024. [Online]. Available: http://arxiv.org/abs/2401.04088

[26] A. Q. Jiang et al., “Mistral 7B.” arXiv, Oct. 10, 2023. doi: 10.48550/arXiv.2310.06825.

[27] C. Wu et al., “LLaMA Pro: Progressive LLaMA with Block Expansion.” arXiv, Jan. 04, 2024. doi: 10.48550/arXiv.2401.02415.

[28] B. Lin et al., “MoE-LLaVA: Mixture of Experts for Large Vision-Language Models.” arXiv, Feb. 04, 2024. doi: 10.48550/arXiv.2401.15947.

[29] O. X.-E. Collaboration et al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv, Dec. 17, 2023. doi: 10.48550/arXiv.2310.08864.

[30] J. Luo et al., “FMB: a Functional Manipulation Benchmark for Generalizable Robotic Learning.” arXiv, Jan. 16, 2024. Accessed: Jan. 17, 2024. [Online]. Available: http://arxiv.org/abs/2401.08553

Project Repository

TRANSFORMER-BASED MODELS FOR VISUOMOTOR PLANNING

THOMAS HERRING

TH3rRING@GMAIL.COM

832-291-7499