LaNMP is a mobile manipulation robot dataset comprised of Natural Language, Navigation, Manipulation, and Perception (LaNMP) data. The dataset is collected in both simulated and real-world environments. The environments are multi-room, ensuring the tasks are long-horizon in nature. The tasks are pick-and-place described by humans to a robot in natural language. The trajectories, which are collected from robots via human teleoperation, contain LaNMP data at every timestep. There are 524 simulated and 50 real trajectories, totalling to 574 trajectories.
- Name, Team: Ahmed Jaafar (Owner)
Brown University, Rutgers University, University of Pennsylvania
- Academic - Tech
- Publishing POC: Ahmed Jaafar
- Affiliation: Brown University
- Contact: [email protected]
- Website:
- Ahmed Jaafar, Brown University
- Shreyas Sundara Raman, Brown University
- Yichen Wei, Brown University
- Sofia Juliani, Rutgers University
- Anneke Wernerfelt, University of Pennsylvania
- Ifrah Idrees, Brown University
- Jason Xinyu Liu, Brown University
- Stefanie Tellex, Associate Professor, Brown University
- Office of Naval Research (ONR)
- National Science Foundation (NSF)
- Amazon Robotics
This work is supported by ONR under grant award numbers N00014-22-1-2592 and N00014-23-1-2794, NSF under grant award number CNS-2150184, and with support from Amazon Robotics.
- Data about places and objects
- Synthetically generated data
- Data about systems or products and their behaviors
- Others (Language data provided by humans, robot movement and visual data)
Category | Data |
Size of Dataset | 288400 MB |
Number of Instances | 574 |
Human Labels | 574 |
Capabilities | 4 |
Avg. Trajectory Length | 247 |
Number of environments | 8 |
Number of rooms | 30 |
Number of actions | 12 |
Number of robots | 2 |
Above: The numbers are combining both the simulated and real datasets. "Capabilities" refers to the high-level aspects/modalities this dataset covers: Natural language, Navigation, Manipulation, and Perception. "Human Labels" refers to the natural language commands of robot tasks provided by humans. "Number of actions" refers to the high-level discrete actions in only simulation.
Additional Notes: The robots used are mobile manipulators. The simulation robot is from ManipulaTHOR and the real robot is quadruped with an arm, a Boston Dynamics Spot.
Every data point in simulation (trajectory time step) contains these important aspects: natural language command, egocentric RGB-D, instance segmentations, bounding boxes, robot body pose, robot end-effector pose, and grasped object poses.
Every data point in real (trajectory time step) contains on a high-level: natural language command, egocentric RGB-D, egocentric RGB-D, gripper RGB-D, gripper instance segmentations, robot body pose, robot arm pose, feet positions, joint angles, robot body velocity, robot arm velocity, gripper open percentage, object held boolean.
Statistic | Simulation Trajectories | Real Trajectories |
count | 524 | 50 |
mean | 172 | 323 |
std | 71 | 187 |
min | 52 | 123 |
max | 594 | 733 |
Above: The mean, std, min, and max of the trajectories refers to their lengths.
- User Content
- Anonymous Data
- Others (Robot movement and visual data)
- No Known Risks
Actively Maintained - No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.
Current Version: 1.0
Last Updated: 06/2024
Release Date: 06/2024
Ahmed Jaafar will be maintaining this dataset and resolving dataset issues brought up by the community.
- Multimodel (Natural Language, Vision, Navigation, Manipulation)
Simulation | Value | Description |
Natural Language | "Go pick up the apple and put it on the couch." | The command the human tells the robot for completing a certain task |
Scene | "FloorPlan_Train8_1" | The simulation environment in AI2THOR |
Sim time | 0.19645 | The simulation time |
Wall clock time | 14:49:37 | The real-world time |
Body state | [4.0, 6.2, 7.5 , 226] | The global state of the robot, [x, y, z, yaw] |
End-effector state | [2.59, 0.89, -4.17, -1.94, -1.27, 1.94] | The global state of the robot's end-effector, [x, y, z, roll, pitch, yaw] |
Hand sphere radius | 0.059 | The radius of the hand grasp field |
Held objects | [Apple] | A list of objects currently held by the robot |
Held object state | [4.4, 2.3, 5.1] | The global state of the currently held objects, [x, y, z] |
Bounding boxes | {"keys": [Apple], "values":[418, 42, 23, 321]} | The objects detected with bounding boxes and the coordinates of those boxes |
RGB | ./rgb_0.npy |
The path to the RGB npy egocentric image of the time step |
Depth | ./depth_0.npy |
The path to the depth npy egocentric image of the time step |
Instance segmentations | ./inst_seg_0.npy |
The path to the instance segmentations npy egocentric image of the time step |
Real-world | Value | Description |
Natural Language | "Go pick up the apple and put it on the couch." | The command the human tells the robot for completing a certain task |
Scene | "FloorPlan_Train8_1" | The simulation environment in AI2THOR |
Wall clock time | 14:49:37 | The real-world time |
Body state | [4.0, 6.2, 7.5] | The global euclidean state of the robot, [x, y, z] |
Body state quaternion | [0.04, 0, 0, 0.99] | The global quaternion state of the robot body, [w, x, y, z] |
Body orientation | [0, 0.17, 3.05] | The global rotation of the robot body, [roll, pitch, yaw] |
Body linear velocity | [0, 0.5, 0.1] | The linear velocity of the robot body, [x, y, z] |
Body angular velocity | [0, 0.5, 0.1] | The angular velocity of the robot body, [x, y, z] |
Arm state | [0.5, 0, 0.26] | The robot arm state relative to the body, [x, y, z] |
Arm quaternion state | [0.99, 0, 0.7, 0.008] | The quaternion robot arm state relative to the body, [w, x, y, z] |
Arm state global | [1.9, 0.5, 0] | The global robot arm state, [x, y, z] |
Arm quaternion state global | [0.04, 0, 0, 0.99] | The global quaternion robot arm state, [w, x, y, z] |
Arm linear velocity | [0.2, 0.04, 0] | The linear velocity of the robot arm, [x, y, z] |
Arm angular velocity | [0.1, 0.4, 0.008] | The angular velocity of the robot arm, [x, y, z] |
Arm stowed | 1 | Boolean of if the arm is stowed or not |
Gripper open | 0.512 | The percentage of how open the gripper is |
Object held | 1 | Boolean of if an object is currently held by the gripper |
Feet state | [0.32, 0.17, 0], ... | The state of the four quadruped feet relative to the body, [x, y, z] |
Feet state global | [-0.21, 0.05, 0], ... | The global state of the four quadruped feet |
Joint angles | {fl.hx: -0.05, fl.hy: 0.79, -1.57, ...} | The angles of all the quadruped's joints |
Joint velocities | {fl.hx: 0.004, fl.hy: 0.01, 0.57, ...} | The velocities of all the quadruped's joints |
Left RGB | ./left_fisheye_image_0.npy |
The path of the left eye RGB egocentric image, which captures the right side of the view |
Right RGB | ./right_fisheye_image_0.npy |
The path of the right eye RGB egocentric image, which captures the left side of the view |
Left Depth | ./left_fisheye_depth_0.npy |
The path of the left eye depth egocentric image, which captures the right side of the view |
Right Depth | ./right_fisheye_depth_0.npy |
The path of the right eye depth egocentric image, which captures the left side of the view |
Left instance segmentations | ./left_fisheye_image_instance_seg_0.npy |
The path of the left eye instance segmentations egocentric image, which captures the right side of the view |
Right instance segmentations | ./right_fisheye_image_instance_seg_0.npy |
The path of the right eye instance segmentations egocentric image, which captures the left side of the view |
Gripper RGB | ./gripper_image_0.npy |
The path of the gripper RGB image |
Gripper depth | ./gripper_depth_0.npy |
The path of the gripper depth image |
Gripper instance segmentations | ./gripper_image_instance_seg_0.npy |
The path of the gripper instance segmentations image |
"nl_command": "Go to the table and pick up the salt and place it in the white bin in the living room.",
"scene": "FloorPlan_Train8_1",
"steps": [
"sim_time": 0.1852477639913559,
"wall-clock_time": "15:10:47.900",
"action": "Initialize",
"state_body": [3.0, 0.9009992480278015, -4.5, 269.9995422363281],
"state_ee": [2.5999975204467773, 0.8979992270469666, -4.171003341674805, -1.9440563492718068e-07, -1.2731799533306385, 1.9440386333307377e-07],
"hand_sphere_radius": 0.05999999865889549
"held_objs": [],
"held_objs_state": {},
"inst_det2D": {
"keys": [
"values": [
[418, 43, 1139, 220], [315, 0, 417, 113], ...
"rgb": "./rgb_0.npy",
"depth": "./depth_0.npy",
"inst_seg": "./inst_seg_0.npy",
"language_command": "Go pick up Hershey's syrup in the room with the big window and bring it to the room with the other Spot.",
"scene_name": "",
"wall_clock_time": "12:50:10.923",
"left_fisheye_rgb": "./Trajectories/trajectories/data_3/",
"left_fisheye_depth": "./Trajectories/trajectories/data_3/",
"right_fisheye_rgb": "./Trajectories/trajectories/data_3/",
"right_fisheye_depth": "./Trajectories/trajectories/data_3/",
"gripper_rgb": "./Trajectories/trajectories/data_3/",
"gripper_depth": "./Trajectories/trajectories/data_3/",
"left_fisheye_instance_seg": "./Trajectories/trajectories/data_3/",
"right_fisheye_instance_seg": "./Trajectories/trajectories/data_3/",
"gripper_fisheye_instance_seg": "./Trajectories/trajectories/data_3/",
"body_state": {"x": 1.7732375781707208, "y": -0.2649551302417769, "z": 0.04729541059536978},
"body_quaternion": {"w": 0.11121513326494507, "x": 0.00003060940357089109, "y": 0.0006936040684443222, "z": 0.9937961119411372},
"body_orientation": {"r": 0.0017760928400286857, "p": 0.016947586302323542, "y": 2.919693676695565},
"body_linear_velocity": {"x": 0.0007985030885781894, "y": 0.0007107887103978708, "z": -0.00001997174236456424},
"body_angular_velocity": {"x": -0.002894917543479851, "y": -0.0017834609980581554, "z": 0.00032649917985633773},
"arm_state_rel_body": {"x": 0.5536401271820068, "y": 0.0001991107128560543, "z": 0.2607555091381073},
"arm_quaternion_rel_body": {"w": 0.9999642968177795, "x": 0.00019104218517895788, "y": 0.008427758701145649, "z": 0.008427758701145649},
"arm_orientation_rel_body": {"x": 0.0003903917486135314, "y": 0.016855526363847233, "z":0.0009807885066525242},
"arm_state_global": {"x": 1.233305266138133, "y": 0.0001991107128560543, "z": 0.2607555091381073},
"arm_quaternion_global": {"w": 0.11071797661404018, "x": -0.0083232786094425, "y": 0.0018207155823512953, "z": 0.9938152930378756},
"arm_orientation_global": {"x": 0.0017760928400286857, "y": 0.016947586302323542, "z": 2.919693676695565},
"arm_linear_velocity": {"x": -0.00015927483240388228, "y": 0.00006229256340773636, "z": -0.003934306244239418},
"arm_angular_velocity": {"x": 0.02912604479413378, "y": -0.012041083915871545, "z": 0.009199674753842119},
"arm_stowed": 1,
"gripper_open_percentage": 0.521618127822876,
"object_held": 0,
"feet_state_rel_body": [
{"x": 0.32068437337875366, "y": 0.17303785681724548, "z": -0.5148577690124512},
{"x": 0.32222312688827515, "y": -0.17367061972618103, "z": -0.5163648128509521},
"feet_state_global": [
{"x": -0.35111223090819643, "y": -0.0985760241189894, "z": -0.5146475087953596},
{"x": -0.27597323368156573, "y": 0.239893453842677, "z": -0.5166350285289446},
"all_joint_angles": {"fl.hx": 0.013755097053945065, "fl.hy": 0.7961212992668152, "": -1.5724135637283325, ...},
"all_joint_velocities": {"fl.hx": -0.007001522462815046, "fl.hy": 0.0006701984675601125, "": 0.00015050712681841105, ...}
- Research
, Imitation Learning
, Behavior Cloning
, Reinfocement Learning
, Machine Learning
There have been recent advances in robotic mobile manipulation, however the field as a whole is still lagging behind. We feel one reason behind this is a lack of useful and difficult benchmarks for mobile manipulation models. In particular, there were no benchmarks that have data for long-horizon room-to-room pick-and-place tasks comprised of natural langauge, navigation, manipulation, and perception in both simulation and the real-world, including a quadruped.
- Safe for research use
Suitable Use Case: Training and testing behavior cloning models.
Suitable Use Case: Learning reward functions via inverse reinforcement learning.
Suitable Use Case: Robot skill learning.
Suitable Use Case: Providing in-context examples for robot planning.
This dataset intendes to serve as a benchmark addressing the gap of the integration of natural language, navigation, manipulation, and perception for pick-and-place mobile manipulation tasks that span room-to-room and floor-to-floor in both simulated and real environments. Mobile manipulation is lagging behind overall, and we believe one of the reasons behind that is a lack of difficult comprehensive benchmarks that models in developement can be tested against. LaNMP is here to fill this gap.
Guidelines & Steps: As simple as referncing the BiBTeX below.
Coming soon!
- External - Open Access
- Dataset Website URL:
- GitHub URL:
- Crowdsourced - Paid
- Crowdsourced - Volunteer
- Survey, forms, or polls
- Others (Keyboard teleoperated, tablet-controller teleoperated)
Collection Type
Source: Prolific.
Platform: Prolific, A crowdsourcing platform for researchers to collect data.
Is this source considered sensitive or high-risk? No
Dates of Collection: [03 2024 - 04 2024]
Primary modality of collection data:
- Text Data
Update Frequency for collected data:
- Static
Additional Notes: Used to collect the natural language commands. Crowdsourced humans explore the simulated environements and come up with commands for tasks the robot can do in those environements.
Collection Type
Source: Human teleoperation
Platform: AI2THOR simulator
Is this source considered sensitive or high-risk? No
Dates of Collection: [03 2024 - 04 2024]
Primary modality of collection data:
- Multimodal (Navigation, Manipulation, Vision)
Update Frequency for collected data:
- Static
Additional Notes: Humans teleoperate a simulated robot via keyboard to collect the robot trajectory data.
Collection Type
Source: Human speech
Platform: N/A
Is this source considered sensitive or high-risk? No
Dates of Collection: [05 2024]
Primary modality of collection data:
- Text Data
Update Frequency for collected data:
- Static
Additional Notes: Used to collect the natural language commands. Humans explore the real-world environements and come up with commands for tasks the robot can do in those environements.
Collection Type
Source: Human teleoperation
Platform: Boston Dynamics Spot
Is this source considered sensitive or high-risk? No
Dates of Collection: [05 2024]
Primary modality of collection data:
- Multimodal (Navigation, Manipulation, Vision)
Update Frequency for collected data:
- Static
Additional Notes: Human teleoperates a real quadruped robot via a tablet/joystick controller to collect the robot trajectory data.
Static: Data was collected once from single or multiple sources.
Collection Method or Source
Description: Natural language commands
Methods employed: Utilized other humans to manually correct grammatical mistakes in the given textual natural language commands. The humans deleted the commands that were not possible for the robot to execute or did not match the desired research goal.
Tools or libraries: N/A
Collection Method or Source
Description: Robot trajectories
Methods employed: Utilized other humans to manually delete incomplete collected trajectories.
Tools or libraries: N/A
- Natural language commands: The criteria for selction included commands that mention a pick-and-place task where the robot picks up an object and places it somewhere else, and having the robot go from room-to-room.
- Trajectories: The criteria for selction included trajectories that execute the commands in the most efficient manner, ones that minimize robot lag, and ones that don't collide objects in the environment.
- Combines natural language, navigation, manipulation, and perception robot data
- Mobile manipulation pick-and-place tasks that are room-to-room and some are cross-floor making them long-horizon
- Utilizing a quadruped which can handle terrains that other robots can't, such as stairs, enabling cross-floor tasks
- Diverse environements and objects
- Only pick-and-place tasks
- No ground-truth goal position of the target object
- Size
- Language
Intentionally Collected Attributes
Human attributes were labeled or collected as a part of the dataset creation process.
Field Name | Description |
nl_command | Natural language commands given by humans telling the robot what task to do in the simulator |
language_command | Natural language commands given by humans telling the robot what task to do in the real-world |
Unintentionally Collected Attributes
Human attributes were not explicitly collected as a part of the dataset creation process but can be inferred using additional methods.
We wanted to capture a natural distribution of commands that humans would tell a househould robots to complete long-horizon mobile manipulation tasks. Rather than automatically generating the commands using tools such as LLMs, we wanted to capture what humans really want done in households by assitant robots, so we used humans to provide the commands. Since the ultimate goal is to one day have assistive robots in the home and workplace, capturing the commands that humans would eventually tell them now is crucial for research and development to get to that goal.
- Human Attribute:
- Human Attribute: In-person humans
- Safe to use with other data
- Make sure the datasets are both in the same format
- Do not mix at the time step level, only at the trajectory level, e.g. Other dataset trajectory Y can come after LaNMP trajectory X, but X and Y's time steps should not be mixed
- Safe to form and/or sample
- Cluster Sampling
- Haphazard Sampling
- Multi-stage sampling
- Random Sampling
- Stratified Sampling
- Systematic Sampling
- Weighted Sampling
Do not sample at the time step level, only at the trajectory level, e.g. sample trajectories 4-15 but not the timesteps of those trajectories.
- Training
- Testing
- Validation
- Fine Tuning
Exploration Demo: Google Colab notebook
Set | Number of data points |
Train | 446 |
Test | 78 |
Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.
Additional Notes: This split was only used during the task generalization experiment. More details in the paper.
Statistic | Train | Test |
Count | 446 | 78 |
Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.
- Other (Fixing natural langauge command grammatical mistakes)
Transformation Type
Field Name | Description |
nl_command | Natural language commands given by humans telling the robot what task to do in the simulator |
language_command | Natural language commands given by humans telling the robot what task to do in the real-world |
Additional Notes: Fixing grammatical mistakes of the commands or deleting trajectories where the commands are incomplete.
Transformation Type
Method: Manually fixing grammatically incorrect natural language commands and injecting them into their respective trajectories to replace the already saved wrong commands. Also deleting trajectories that have incomplete commands e.g. "Pick up the blue"
Transformation Results: Trajectories with the fixed commands, and less trajectories overall due to the deletion of the ones that had incompelete commands.
- Human Annotations (Expert)
- Human Annotations (Non-Expert)
- Human Annotations (Employees)
- Human Annotations (Crowdsourcing)
Expert | Number |
Number of unique annotations | 50 |
Total number of annotations | 50 |
Average annotations per example | 1 |
Number of annotators | 1 |
Number of annotators per example | 1 |
Above: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.
Non-Expert | Number |
Number of unique annotations | 50 |
Total number of annotations | 50 |
Average annotations per example | 1 |
Number of annotators | 7 |
Number of annotators per example | 1 |
Above: Humans that gave natural language commands of tasks for the real-world robot to execute.
Employees | Number |
Number of unique annotations | 524 |
Total number of annotations | 524 |
Average annotations per example | 1 |
Number of annotators | 15 |
Number of annotators per example | 1 |
Above: Humans that exected the trajectories in the simulator.
Crowdsourcing | Number |
Number of unique annotations | 524 |
Total number of annotations | 524 |
Average annotations per example | 1 |
Number of annotators | 41 |
Number of annotators per example | 1 |
Above: Humans that gave natural language commands of tasks for the simulated robot to execute.
Description: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.
Link: N/A
Platforms, tools, or libraries:
- Boston Dynamics Spot
Description: Humans that gave natural language commands of tasks for the real-world robot to execute.
Link: N/A
Platforms, tools, or libraries:
- N/A
Description: Humans that exected the trajectories in the simulator.
Platforms, tools, or libraries:
Description: Humans that gave natural language commands of tasks for the simulated robot to execute.
Platforms, tools, or libraries:
- Prolific
Expert Real-Robot Trajectory Collection
Task type: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors
Number of unique annotators: 1
Expertise of annotators: Expert
Description of annotators: An author.
Language distribution of annotators: English
Geographic distribution of annotators: United States
Annotation platforms: Boston Dyanmics Spot
Non-Expert Real-Robot Command Collection
Task type: Humans that gave natural language commands of tasks for the real-world robot to execute
Number of unique annotators: 7
Expertise of annotators: Non-Expert
Description of annotators: Students
Language distribution of annotators: English
Geographic distribution of annotators: United States
Annotation platforms: N/A
Employed Simulator Command Collection
Task type: Humans that exected the trajectories in the simulator
Number of unique annotators: 7
Expertise of annotators: Non-Expert
Description of annotators: General adults
Language distribution of annotators: English
Geographic distribution of annotators: United States and United Kingdom
Annotation platforms:
- English [100%]
Above: All the natural language commands.
- Unsampled
Classification, Regression, Supervised Learning, Imitation Learning
Model Card: In page 21 of the paper.
Model Card: No card available. Please refer to the GitHub repo instead.
Evaluation Results
Model | SR | Length | Grasp SR | RMSE v.s. GT | Weighted |
CLIP EMA Score | End Goal Dist | CE Loss |
Cross-Scene | ||||||||
--- ALFRED Seq2Seq | 0.0 | 655.09 ± 450.52 | 0.0 | 3.11 ± 0.63 | 0.0026 ± 0.0035 | 0.1614 ± 0.0120 | 12.42 ± 5.44 | 286.77 ± 20.31 |
--- RT-1 | 0.0 | 205.03 ± 27.36 | 0.0 | 9.50 ± 0.27 | 1.3423 ± 0.1133 | 0.1521 ± 0.0065 | 12.56 ± 6.67 | 80.98 ± 4.68 |
Task Generalization | ||||||||
--- ALFRED Seq2Seq | 0.0 | 501.60 ± 578.62 | 0.0 | 3.01 ± 1.18 | 0.0008 ± 0.0014 | 0.1681 ± 0.0327 | 12.83 ± 11.12 | 286.66 ± 398.80 |
--- RT-1 | 0.0 | 199.56 ± 106.11 | 0.0 | 9.74 ± 1.67 | 1.3980 ± 0.5834 | 0.1488 ± 0.0243 | 12.40 ± 12.20 | 82.61 ± 1.81 |
Ground Truth | 1.0 | 171.69 ± 70.80 | 1.0 | --- | 0.5576 ± 0.1751 | 0.2067 ± 0.0311 | --- | --- |
Additional Notes: These results are from the simulation data only.
[Metrics used]:
- Task Success (GTR): a binary value measuring whether an agent achieves the goal/completes the task specified in the command.
- Distance From Goal (GTR): the spatial distance between the agent's final position after executing a learned trajectory and the designated gold goal state.
d = 1/2 (sqrt{x_{gt_body,n}^2 - x_{eval_body,n}^2} + sqrt{x_{gt_ee,n}^2 - x_{eval_ee,n}^2})
- Grasp Success Rate (GTR): the efficacy of the agent's attempts to grasp objects in the scene. Specifically, the percentage of attempts that result in successful object acquisition.
- Average RMSE (GTR): the average root-mean-square error of the agent's body and end-effector coordinates between the generated trajectory and the ground truth. It reports a weighted average between body and end-effector errors normalized across the maximum length of both trajectories.
RMSE = sum_{i=0}^n 1/2 (sqrt{x_{gt_body,i}^2 - x_{eval_body,i}^2} + sqrt{x_{gt_ee,i}^2 - x_{eval_ee,i}^2})
- Average Number of Steps (GTR): the total number of actions an agent takes. It serves to evaluate a model's ability to replicate efficient human navigation.
- Mean and Standard Deviation in State Differences (GTI): the standard deviation in positional differences between successive timesteps in a trajectory. It assesses the control smoothness exhibited by the agent to compare learned trajectories against the fluidity and naturalness of the ground-truth trajectories.
Delta = sum_{i=1}^n 1/2 (sqrt{x_{eval_body,i}^2 - x_{eval_body,(i-1)}^2} + sqrt{x_{eval_ee,i}^2 - x_{eval_ee,(i-1)}^2})
- CLIP Embedding Reward (GTI): the exponential moving average of CLIP text-image correlation scores for all steps of a trajectory. Natural language task specification can be ambiguous and difficult to formulate into a structured goal condition. Inspired by previous works using CLIP for RL rewards, we propose this metric to capture complex semantic correlations between the trajectory and task specification. That is understanding, reasoning, the grounding of a task using the CLIP embedding space. This provides a measure of the agent's task comprehension and execution fidelity.
EMA_i = alpha EMA_{i-1} + (1-alpha)r_i
r_i := CLIP(task,img_i)
Additional Notes: For robust evaluation, we consider two categories of metrics for cross-scene and task generalization experiments: ``ground truth relative" (GTR) metrics that compare against trajectories in LaNMP as standards and "ground truth independent" (GTI) metrics that evaluate a trajectory (ground-truth or generated) on task understanding or smoothness.
Model Card: In page 21 of the paper.
Model Description: Robotics Transformer 1 (RT-1) is a model designed for generalizing across large-scale, multi-task datasets with real-time inference capabilities. RT-1 leverages a Transformer architecture to process images and natural language instructions to generate discretized actions for mobile manipulation. RT-1 is trained on a diverse dataset of approximately 130K episodes across more than 700 tasks collected using 13 robots. This enables RT-1 to learn through BC from human demonstrations annotated with detailed instructions.
- Model Size: 35M (params)
Model Card: No card available. Please refer to the GitHub repo instead.
Model Description: The ALFRED paper introduces a Sequence-to-Sequence model leveraging a CNN-LSTM architecture with an attention mechanism for task execution. It encodes visual inputs via ResNet-18 and processes language through a bidirectional LSTM. A decoder leverages these multimodal inputs along with historical action data to iteratively predict subsequent actions and generate pixelwise interaction masks, enhancing precise object manipulation capabilities within the given environment.
- Model Size: 35M (params)
Expected Performance: We expected RT-1 to perform better than ALFRED Seq2Seq due to it being more recent and more advanced. We expected both models to perform poorly, especially on the Task Success metric.
Known Caveats: The model architectures had to be modified to make them work for LaNMP. RT-1 had to be pretrained by us instead of using the provided pretrained checkpoint. There were some simulator issues during real-time evaluation.