LaNMP

LaNMP is a mobile manipulation robot dataset comprised of Natural Language, Navigation, Manipulation, and Perception (LaNMP) data. The dataset is collected in both simulated and real-world environments. The environments are multi-room, ensuring the tasks are long-horizon in nature. The tasks are pick-and-place described by humans to a robot in natural language. The trajectories, which are collected from robots via human teleoperation, contain LaNMP data at every timestep. There are 524 simulated and 50 real trajectories, totalling to 574 trajectories.

Dataset Link

https://www.dropbox.com/scl/fo/c1q9s420pzu1285t1wcud/AGMDPvgD5R1ilUFId0i94KE?rlkey=7lwmxnjagi7k9kgimd4v7fwaq&dl=0

Data Card Author(s)

Name, Team: Ahmed Jaafar (Owner)

Authorship

Publishers

Publishing Organization(s)

Brown University, Rutgers University, University of Pennsylvania

Industry Type(s)

Academic - Tech

Contact Detail(s)

Publishing POC: Ahmed Jaafar
Affiliation: Brown University
Contact: ahmed_jaafar@brown.edu
Website: https://lanmpdataset.github.io/

Author(s)

Ahmed Jaafar, Brown University
Shreyas Sundara Raman, Brown University
Yichen Wei, Brown University
Sofia Juliani, Rutgers University
Anneke Wernerfelt, University of Pennsylvania
Ifrah Idrees, Brown University
Jason Xinyu Liu, Brown University
Stefanie Tellex, Associate Professor, Brown University

Funding Sources

Institution(s)

Office of Naval Research (ONR)
National Science Foundation (NSF)
Amazon Robotics

Funding or Grant Summary(ies)

This work is supported by ONR under grant award numbers N00014-22-1-2592 and N00014-23-1-2794, NSF under grant award number CNS-2150184, and with support from Amazon Robotics.

Dataset Overview

Data Subject(s)

Data about places and objects
Synthetically generated data
Data about systems or products and their behaviors
Others (Language data provided by humans, robot movement and visual data)

Dataset Snapshot

Category	Data
Size of Dataset	288400 MB
Number of Instances	574
Human Labels	574
Capabilities	4
Avg. Trajectory Length	247
Number of environments	8
Number of rooms	30
Number of actions	12
Number of robots	2

Above: The numbers are combining both the simulated and real datasets. "Capabilities" refers to the high-level aspects/modalities this dataset covers: Natural language, Navigation, Manipulation, and Perception. "Human Labels" refers to the natural language commands of robot tasks provided by humans. "Number of actions" refers to the high-level discrete actions in only simulation.

Additional Notes: The robots used are mobile manipulators. The simulation robot is from ManipulaTHOR and the real robot is quadruped with an arm, a Boston Dynamics Spot.

Content Description

Every data point in simulation (trajectory time step) contains these important aspects: natural language command, egocentric RGB-D, instance segmentations, bounding boxes, robot body pose, robot end-effector pose, and grasped object poses.

Every data point in real (trajectory time step) contains on a high-level: natural language command, egocentric RGB-D, egocentric RGB-D, gripper RGB-D, gripper instance segmentations, robot body pose, robot arm pose, feet positions, joint angles, robot body velocity, robot arm velocity, gripper open percentage, object held boolean.

Descriptive Statistics

Statistic	Simulation Trajectories	Real Trajectories
count	524	50
mean	172	323
std	71	187
min	52	123
max	594	733

Above: The mean, std, min, and max of the trajectories refers to their lengths.

Sensitivity of Data

Sensitivity Type(s)

User Content
Anonymous Data
Others (Robot movement and visual data)

Risk Type(s)

No Known Risks

Dataset Version and Maintenance

Maintenance Status

Actively Maintained - No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.

Version Details

Current Version: 1.0

Last Updated: 06/2024

Release Date: 06/2024

Maintenance Plan

Ahmed Jaafar will be maintaining this dataset and resolving dataset issues brought up by the community.

Example of Data Points

Primary Data Modality

Multimodel (Natural Language, Vision, Navigation, Manipulation)

Data Fields

Simulation	Value	Description
Natural Language	"Go pick up the apple and put it on the couch."	The command the human tells the robot for completing a certain task
Scene	"FloorPlan_Train8_1"	The simulation environment in AI2THOR
Sim time	0.19645	The simulation time
Wall clock time	14:49:37	The real-world time
Body state	[4.0, 6.2, 7.5 , 226]	The global state of the robot, [x, y, z, yaw]
End-effector state	[2.59, 0.89, -4.17, -1.94, -1.27, 1.94]	The global state of the robot's end-effector, [x, y, z, roll, pitch, yaw]
Hand sphere radius	0.059	The radius of the hand grasp field
Held objects	[Apple]	A list of objects currently held by the robot
Held object state	[4.4, 2.3, 5.1]	The global state of the currently held objects, [x, y, z]
Bounding boxes	{"keys": [Apple], "values":[418, 42, 23, 321]}	The objects detected with bounding boxes and the coordinates of those boxes
RGB	`./rgb_0.npy`	The path to the RGB npy egocentric image of the time step
Depth	`./depth_0.npy`	The path to the depth npy egocentric image of the time step
Instance segmentations	`./inst_seg_0.npy`	The path to the instance segmentations npy egocentric image of the time step

Real-world	Value	Description
Natural Language	"Go pick up the apple and put it on the couch."	The command the human tells the robot for completing a certain task
Scene	"FloorPlan_Train8_1"	The simulation environment in AI2THOR
Wall clock time	14:49:37	The real-world time
Body state	[4.0, 6.2, 7.5]	The global euclidean state of the robot, [x, y, z]
Body state quaternion	[0.04, 0, 0, 0.99]	The global quaternion state of the robot body, [w, x, y, z]
Body orientation	[0, 0.17, 3.05]	The global rotation of the robot body, [roll, pitch, yaw]
Body linear velocity	[0, 0.5, 0.1]	The linear velocity of the robot body, [x, y, z]
Body angular velocity	[0, 0.5, 0.1]	The angular velocity of the robot body, [x, y, z]
Arm state	[0.5, 0, 0.26]	The robot arm state relative to the body, [x, y, z]
Arm quaternion state	[0.99, 0, 0.7, 0.008]	The quaternion robot arm state relative to the body, [w, x, y, z]
Arm state global	[1.9, 0.5, 0]	The global robot arm state, [x, y, z]
Arm quaternion state global	[0.04, 0, 0, 0.99]	The global quaternion robot arm state, [w, x, y, z]
Arm linear velocity	[0.2, 0.04, 0]	The linear velocity of the robot arm, [x, y, z]
Arm angular velocity	[0.1, 0.4, 0.008]	The angular velocity of the robot arm, [x, y, z]
Arm stowed	1	Boolean of if the arm is stowed or not
Gripper open	0.512	The percentage of how open the gripper is
Object held	1	Boolean of if an object is currently held by the gripper
Feet state	[0.32, 0.17, 0], ...	The state of the four quadruped feet relative to the body, [x, y, z]
Feet state global	[-0.21, 0.05, 0], ...	The global state of the four quadruped feet
Joint angles	{fl.hx: -0.05, fl.hy: 0.79, fl.kn: -1.57, ...}	The angles of all the quadruped's joints
Joint velocities	{fl.hx: 0.004, fl.hy: 0.01, fl.kn: 0.57, ...}	The velocities of all the quadruped's joints
Left RGB	`./left_fisheye_image_0.npy`	The path of the left eye RGB egocentric image, which captures the right side of the view
Right RGB	`./right_fisheye_image_0.npy`	The path of the right eye RGB egocentric image, which captures the left side of the view
Left Depth	`./left_fisheye_depth_0.npy`	The path of the left eye depth egocentric image, which captures the right side of the view
Right Depth	`./right_fisheye_depth_0.npy`	The path of the right eye depth egocentric image, which captures the left side of the view
Left instance segmentations	`./left_fisheye_image_instance_seg_0.npy`	The path of the left eye instance segmentations egocentric image, which captures the right side of the view
Right instance segmentations	`./right_fisheye_image_instance_seg_0.npy`	The path of the right eye instance segmentations egocentric image, which captures the left side of the view
Gripper RGB	`./gripper_image_0.npy`	The path of the gripper RGB image
Gripper depth	`./gripper_depth_0.npy`	The path of the gripper depth image
Gripper instance segmentations	`./gripper_image_instance_seg_0.npy`	The path of the gripper instance segmentations image

Typical Data Point

Simulation:

{
    "nl_command": "Go to the table and pick up the salt and place it in the white bin in the living room.",
    "scene": "FloorPlan_Train8_1",
    "steps": [
        {
            "sim_time": 0.1852477639913559,
            "wall-clock_time": "15:10:47.900",
            "action": "Initialize",
            "state_body": [3.0, 0.9009992480278015, -4.5, 269.9995422363281],
            "state_ee": [2.5999975204467773, 0.8979992270469666, -4.171003341674805, -1.9440563492718068e-07, -1.2731799533306385, 1.9440386333307377e-07],
            "hand_sphere_radius": 0.05999999865889549
            "held_objs": [],
            "held_objs_state": {},
            "inst_det2D": {
                "keys": [
                    "Wall_4|0.98|1.298|-2.63",
                    "RemoteControl|+01.15|+00.48|-04.24",
                ],
                "values": [
                    [418, 43, 1139, 220], [315, 0, 417, 113], ...
                ]
            },
            "rgb": "./rgb_0.npy",
            "depth": "./depth_0.npy",
            "inst_seg": "./inst_seg_0.npy",
        }
    ]
}

Real-world:

{
  "language_command": "Go pick up Hershey's syrup in the room with the big window and bring it to the room with the other Spot.",
  "scene_name": "",
  "wall_clock_time": "12:50:10.923",
  "left_fisheye_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_image_0.npy",
  "left_fisheye_depth": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_depth_0.npy",
  "right_fisheye_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_image_0.npy",
  "right_fisheye_depth": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_depth_0.npy",
  "gripper_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_image_0.npy",
  "gripper_depth": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_depth_0.npy",
  "left_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_image_instance_seg_0.npy",
  "right_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_image_instance_seg_0.npy",
  "gripper_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_image_instance_seg_0.npy",
  "body_state": {"x": 1.7732375781707208, "y": -0.2649551302417769, "z": 0.04729541059536978},
  "body_quaternion": {"w": 0.11121513326494507, "x": 0.00003060940357089109, "y": 0.0006936040684443222, "z": 0.9937961119411372},
  "body_orientation": {"r": 0.0017760928400286857, "p": 0.016947586302323542, "y": 2.919693676695565},
  "body_linear_velocity": {"x": 0.0007985030885781894, "y": 0.0007107887103978708, "z": -0.00001997174236456424},
  "body_angular_velocity": {"x": -0.002894917543479851, "y": -0.0017834609980581554, "z": 0.00032649917985633773},
  "arm_state_rel_body": {"x": 0.5536401271820068, "y": 0.0001991107128560543, "z": 0.2607555091381073},
  "arm_quaternion_rel_body": {"w": 0.9999642968177795, "x": 0.00019104218517895788, "y": 0.008427758701145649, "z": 0.008427758701145649},
  "arm_orientation_rel_body": {"x": 0.0003903917486135314, "y": 0.016855526363847233, "z":0.0009807885066525242},
  "arm_state_global": {"x": 1.233305266138133, "y": 0.0001991107128560543, "z": 0.2607555091381073},
  "arm_quaternion_global": {"w": 0.11071797661404018, "x": -0.0083232786094425, "y": 0.0018207155823512953, "z": 0.9938152930378756},
  "arm_orientation_global": {"x": 0.0017760928400286857, "y": 0.016947586302323542, "z": 2.919693676695565},
  "arm_linear_velocity": {"x": -0.00015927483240388228, "y": 0.00006229256340773636, "z": -0.003934306244239418},
  "arm_angular_velocity": {"x": 0.02912604479413378, "y": -0.012041083915871545, "z": 0.009199674753842119},
  "arm_stowed": 1,
  "gripper_open_percentage": 0.521618127822876,
  "object_held": 0,
  "feet_state_rel_body": [
    {"x": 0.32068437337875366, "y": 0.17303785681724548, "z": -0.5148577690124512},
    {"x": 0.32222312688827515, "y": -0.17367061972618103, "z": -0.5163648128509521},
    ...
  ],
  "feet_state_global": [
    {"x": -0.35111223090819643, "y": -0.0985760241189894, "z": -0.5146475087953596},
    {"x": -0.27597323368156573, "y": 0.239893453842677, "z": -0.5166350285289446},
    ...
  ],
  "all_joint_angles": {"fl.hx": 0.013755097053945065, "fl.hy": 0.7961212992668152, "fl.kn": -1.5724135637283325, ...},
  "all_joint_velocities": {"fl.hx": -0.007001522462815046, "fl.hy": 0.0006701984675601125, "fl.kn": 0.00015050712681841105, ...}
}

Motivations & Intentions

Motivations

Purpose(s)

Research

Domain(s) of Application

Robotics, Imitation Learning, Behavior Cloning, Reinfocement Learning, Machine Learning

Motivating Factor(s)

There have been recent advances in robotic mobile manipulation, however the field as a whole is still lagging behind. We feel one reason behind this is a lack of useful and difficult benchmarks for mobile manipulation models. In particular, there were no benchmarks that have data for long-horizon room-to-room pick-and-place tasks comprised of natural langauge, navigation, manipulation, and perception in both simulation and the real-world, including a quadruped.

Intended Use

Dataset Use(s)

Safe for research use

Suitable Use Case(s)

Suitable Use Case: Training and testing behavior cloning models.

Suitable Use Case: Learning reward functions via inverse reinforcement learning.

Suitable Use Case: Robot skill learning.

Suitable Use Case: Providing in-context examples for robot planning.

Research and Problem Space(s)

This dataset intendes to serve as a benchmark addressing the gap of the integration of natural language, navigation, manipulation, and perception for pick-and-place mobile manipulation tasks that span room-to-room and floor-to-floor in both simulated and real environments. Mobile manipulation is lagging behind overall, and we believe one of the reasons behind that is a lack of difficult comprehensive benchmarks that models in developement can be tested against. LaNMP is here to fill this gap.

Citation Guidelines

Guidelines & Steps: As simple as referncing the BiBTeX below.

BiBTeX:

Coming soon!

Access

Access Type

External - Open Access

Documentation Link(s)

Dataset Website URL: https://www.dropbox.com/scl/fo/c1q9s420pzu1285t1wcud/AGMDPvgD5R1ilUFId0i94KE?rlkey=7lwmxnjagi7k9kgimd4v7fwaq&dl=0
GitHub URL: https://github.com/h2r/LaNPM-Dataset/

Provenance

Collection

Method(s) Used

Crowdsourced - Paid
Crowdsourced - Volunteer
Survey, forms, or polls
Others (Keyboard teleoperated, tablet-controller teleoperated)

Methodology Detail(s)

Collection Type

Source: Prolific.

Platform: Prolific, A crowdsourcing platform for researchers to collect data.

Is this source considered sensitive or high-risk? No

Dates of Collection: [03 2024 - 04 2024]

Primary modality of collection data:

Text Data

Update Frequency for collected data:

Static

Additional Notes: Used to collect the natural language commands. Crowdsourced humans explore the simulated environements and come up with commands for tasks the robot can do in those environements.

Collection Type

Source: Human teleoperation

Platform: AI2THOR simulator

Is this source considered sensitive or high-risk? No

Dates of Collection: [03 2024 - 04 2024]

Primary modality of collection data:

Multimodal (Navigation, Manipulation, Vision)

Update Frequency for collected data:

Static

Additional Notes: Humans teleoperate a simulated robot via keyboard to collect the robot trajectory data.

Collection Type

Source: Human speech

Platform: N/A

Is this source considered sensitive or high-risk? No

Dates of Collection: [05 2024]

Primary modality of collection data:

Text Data

Update Frequency for collected data:

Static

Additional Notes: Used to collect the natural language commands. Humans explore the real-world environements and come up with commands for tasks the robot can do in those environements.

Collection Type

Source: Human teleoperation

Platform: Boston Dynamics Spot

Is this source considered sensitive or high-risk? No

Dates of Collection: [05 2024]

Primary modality of collection data:

Multimodal (Navigation, Manipulation, Vision)

Update Frequency for collected data:

Static

Additional Notes: Human teleoperates a real quadruped robot via a tablet/joystick controller to collect the robot trajectory data.

Collection Cadence

Static: Data was collected once from single or multiple sources.

Data Processing

Collection Method or Source

Description: Natural language commands

Methods employed: Utilized other humans to manually correct grammatical mistakes in the given textual natural language commands. The humans deleted the commands that were not possible for the robot to execute or did not match the desired research goal.

Tools or libraries: N/A

Collection Method or Source

Description: Robot trajectories

Methods employed: Utilized other humans to manually delete incomplete collected trajectories.

Tools or libraries: N/A

Collection Criteria

Data Selection

Natural language commands: The criteria for selction included commands that mention a pick-and-place task where the robot picks up an object and places it somewhere else, and having the robot go from room-to-room.
Trajectories: The criteria for selction included trajectories that execute the commands in the most efficient manner, ones that minimize robot lag, and ones that don't collide objects in the environment.

Relationship to Source

Benefit and Value(s)

Combines natural language, navigation, manipulation, and perception robot data
Mobile manipulation pick-and-place tasks that are room-to-room and some are cross-floor making them long-horizon
Utilizing a quadruped which can handle terrains that other robots can't, such as stairs, enabling cross-floor tasks
Diverse environements and objects

Limitation(s) and Trade-Off(s)

Only pick-and-place tasks
No ground-truth goal position of the target object
Size

Human and Other Sensitive Attributes

Sensitive Human Attribute(s)

Language

Intentionality

Intentionally Collected Attributes

Human attributes were labeled or collected as a part of the dataset creation process.

Field Name	Description
nl_command	Natural language commands given by humans telling the robot what task to do in the simulator
language_command	Natural language commands given by humans telling the robot what task to do in the real-world

Unintentionally Collected Attributes

Human attributes were not explicitly collected as a part of the dataset creation process but can be inferred using additional methods.

N/A

Rationale

We wanted to capture a natural distribution of commands that humans would tell a househould robots to complete long-horizon mobile manipulation tasks. Rather than automatically generating the commands using tools such as LLMs, we wanted to capture what humans really want done in households by assitant robots, so we used humans to provide the commands. Since the ultimate goal is to one day have assistive robots in the home and workplace, capturing the commands that humans would eventually tell them now is crucial for research and development to get to that goal.

Source(s)

Human Attribute: Prolific.com
Human Attribute: In-person humans

Extended Use

Use with Other Data

Safety Level

Safe to use with other data

Best Practices

Make sure the datasets are both in the same format
Do not mix at the time step level, only at the trajectory level, e.g. Other dataset trajectory Y can come after LaNMP trajectory X, but X and Y's time steps should not be mixed

Forking & Sampling

Safety Level

Safe to form and/or sample

Acceptable Sampling Method(s)

Cluster Sampling
Haphazard Sampling
Multi-stage sampling
Random Sampling
Stratified Sampling
Systematic Sampling
Weighted Sampling

Best Practice(s)

Do not sample at the time step level, only at the trajectory level, e.g. sample trajectories 4-15 but not the timesteps of those trajectories.

Use in ML or AI Systems

Dataset Use(s)

Training
Testing
Validation
Fine Tuning

Notable Feature(s)

Exploration Demo: Google Colab notebook

Distribution(s)

Set	Number of data points
Train	446
Test	78

Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.

Additional Notes: This split was only used during the task generalization experiment. More details in the paper.

Split Statistics

Statistic	Train	Test
Count	446	78

Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.

Transformations

Synopsis

Transformation(s) Applied

Other (Fixing natural langauge command grammatical mistakes)

Field(s) Transformed

Transformation Type

Field Name	Description
nl_command	Natural language commands given by humans telling the robot what task to do in the simulator
language_command	Natural language commands given by humans telling the robot what task to do in the real-world

Additional Notes: Fixing grammatical mistakes of the commands or deleting trajectories where the commands are incomplete.

Library(ies) and Method(s) Used

Transformation Type

Method: Manually fixing grammatically incorrect natural language commands and injecting them into their respective trajectories to replace the already saved wrong commands. Also deleting trajectories that have incomplete commands e.g. "Pick up the blue"

Transformation Results: Trajectories with the fixed commands, and less trajectories overall due to the deletion of the ones that had incompelete commands.

Annotations & Labeling

Annotation Workforce Type

Human Annotations (Expert)
Human Annotations (Non-Expert)
Human Annotations (Employees)
Human Annotations (Crowdsourcing)

Annotation Characteristic(s)

Expert	Number
Number of unique annotations	50
Total number of annotations	50
Average annotations per example	1
Number of annotators	1
Number of annotators per example	1

Above: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.

Non-Expert	Number
Number of unique annotations	50
Total number of annotations	50
Average annotations per example	1
Number of annotators	7
Number of annotators per example	1

Above: Humans that gave natural language commands of tasks for the real-world robot to execute.

Employees	Number
Number of unique annotations	524
Total number of annotations	524
Average annotations per example	1
Number of annotators	15
Number of annotators per example	1

Above: Humans that exected the trajectories in the simulator.

Crowdsourcing	Number
Number of unique annotations	524
Total number of annotations	524
Average annotations per example	1
Number of annotators	41
Number of annotators per example	1

Above: Humans that gave natural language commands of tasks for the simulated robot to execute.

Annotation Description(s)

Expert

Description: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.

Link: N/A

Platforms, tools, or libraries:

Boston Dynamics Spot

Non-Expert

Description: Humans that gave natural language commands of tasks for the real-world robot to execute.

Link: N/A

Platforms, tools, or libraries:

N/A

Employees

Description: Humans that exected the trajectories in the simulator.

Link: https://ai2thor.allenai.org/

Platforms, tools, or libraries:

AI2THOR

Crowdsourcing

Description: Humans that gave natural language commands of tasks for the simulated robot to execute.

Link: https://www.prolific.com/

Platforms, tools, or libraries:

Prolific

Human Annotators

Annotator Description(s)

Expert Real-Robot Trajectory Collection

Task type: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors

Number of unique annotators: 1

Expertise of annotators: Expert

Description of annotators: An author.

Language distribution of annotators: English

Geographic distribution of annotators: United States

Annotation platforms: Boston Dyanmics Spot

Non-Expert Real-Robot Command Collection

Task type: Humans that gave natural language commands of tasks for the real-world robot to execute

Number of unique annotators: 7

Expertise of annotators: Non-Expert

Description of annotators: Students

Language distribution of annotators: English

Geographic distribution of annotators: United States

Annotation platforms: N/A

Employed Simulator Command Collection

Task type: Humans that exected the trajectories in the simulator

Number of unique annotators: 7

Expertise of annotators: Non-Expert

Description of annotators: General adults

Language distribution of annotators: English

Geographic distribution of annotators: United States and United Kingdom

Annotation platforms: Prolific.com

Language(s)

English [100%]

Above: All the natural language commands.

Sampling Methods

Method(s) Used

Unsampled

Known Applications & Benchmarks

ML Application(s)

Classification, Regression, Supervised Learning, Imitation Learning

Evaluation Result(s)

RT-1

Model Card: In page 21 of the paper.

ALFRED Seq2Seq

Model Card: No card available. Please refer to the GitHub repo instead.

Evaluation Results

Model	SR	Length	Grasp SR	RMSE v.s. GT	Weighted $\Delta_\text{xyz}$	CLIP EMA Score	End Goal Dist	CE Loss
Cross-Scene
--- ALFRED Seq2Seq	0.0	655.09 ± 450.52	0.0	3.11 ± 0.63	0.0026 ± 0.0035	0.1614 ± 0.0120	12.42 ± 5.44	286.77 ± 20.31
--- RT-1	0.0	205.03 ± 27.36	0.0	9.50 ± 0.27	1.3423 ± 0.1133	0.1521 ± 0.0065	12.56 ± 6.67	80.98 ± 4.68
Task Generalization
--- ALFRED Seq2Seq	0.0	501.60 ± 578.62	0.0	3.01 ± 1.18	0.0008 ± 0.0014	0.1681 ± 0.0327	12.83 ± 11.12	286.66 ± 398.80
--- RT-1	0.0	199.56 ± 106.11	0.0	9.74 ± 1.67	1.3980 ± 0.5834	0.1488 ± 0.0243	12.40 ± 12.20	82.61 ± 1.81
Ground Truth	1.0	171.69 ± 70.80	1.0	---	0.5576 ± 0.1751	0.2067 ± 0.0311	---	---

Additional Notes: These results are from the simulation data only.

Evaluation Process(es)

[Metrics used]:

Task Success (GTR): a binary value measuring whether an agent achieves the goal/completes the task specified in the command.
Distance From Goal (GTR): the spatial distance between the agent's final position after executing a learned trajectory and the designated gold goal state.
```
d = 1/2 (sqrt{x_{gt_body,n}^2 - x_{eval_body,n}^2} + sqrt{x_{gt_ee,n}^2 - x_{eval_ee,n}^2})
```
Grasp Success Rate (GTR): the efficacy of the agent's attempts to grasp objects in the scene. Specifically, the percentage of attempts that result in successful object acquisition.
Average RMSE (GTR): the average root-mean-square error of the agent's body and end-effector coordinates between the generated trajectory and the ground truth. It reports a weighted average between body and end-effector errors normalized across the maximum length of both trajectories.
```
RMSE = sum_{i=0}^n 1/2 (sqrt{x_{gt_body,i}^2 - x_{eval_body,i}^2} + sqrt{x_{gt_ee,i}^2 - x_{eval_ee,i}^2})
```
Average Number of Steps (GTR): the total number of actions an agent takes. It serves to evaluate a model's ability to replicate efficient human navigation.
Mean and Standard Deviation in State Differences (GTI): the standard deviation in positional differences between successive timesteps in a trajectory. It assesses the control smoothness exhibited by the agent to compare learned trajectories against the fluidity and naturalness of the ground-truth trajectories.
```
Delta = sum_{i=1}^n 1/2 (sqrt{x_{eval_body,i}^2 - x_{eval_body,(i-1)}^2} + sqrt{x_{eval_ee,i}^2 - x_{eval_ee,(i-1)}^2})
```
CLIP Embedding Reward (GTI): the exponential moving average of CLIP text-image correlation scores for all steps of a trajectory. Natural language task specification can be ambiguous and difficult to formulate into a structured goal condition. Inspired by previous works using CLIP for RL rewards, we propose this metric to capture complex semantic correlations between the trajectory and task specification. That is understanding, reasoning, the grounding of a task using the CLIP embedding space. This provides a measure of the agent's task comprehension and execution fidelity.
```
EMA_i = alpha EMA_{i-1} + (1-alpha)r_i
```
where
```
r_i := CLIP(task,img_i)
```

Additional Notes: For robust evaluation, we consider two categories of metrics for cross-scene and task generalization experiments: ``ground truth relative" (GTR) metrics that compare against trajectories in LaNMP as standards and "ground truth independent" (GTI) metrics that evaluate a trajectory (ground-truth or generated) on task understanding or smoothness.

Description(s) and Statistic(s)

RT-1

Model Card: In page 21 of the paper.

Model Description: Robotics Transformer 1 (RT-1) is a model designed for generalizing across large-scale, multi-task datasets with real-time inference capabilities. RT-1 leverages a Transformer architecture to process images and natural language instructions to generate discretized actions for mobile manipulation. RT-1 is trained on a diverse dataset of approximately 130K episodes across more than 700 tasks collected using 13 robots. This enables RT-1 to learn through BC from human demonstrations annotated with detailed instructions.

Model Size: 35M (params)

ALFRED Seq2Seq

Model Card: No card available. Please refer to the GitHub repo instead.

Model Description: The ALFRED paper introduces a Sequence-to-Sequence model leveraging a CNN-LSTM architecture with an attention mechanism for task execution. It encodes visual inputs via ResNet-18 and processes language through a bidirectional LSTM. A decoder leverages these multimodal inputs along with historical action data to iteratively predict subsequent actions and generate pixelwise interaction masks, enhancing precise object manipulation capabilities within the given environment.

Model Size: 35M (params)

Expected Performance and Known Caveats

Expected Performance: We expected RT-1 to perform better than ALFRED Seq2Seq due to it being more recent and more advanced. We expected both models to perform poorly, especially on the Task Success metric.

Known Caveats: The model architectures had to be modified to make them work for LaNMP. RT-1 had to be pretrained by us instead of using the provided pretrained checkpoint. There were some simulator issues during real-time evaluation.

Files

DataCard.md

Latest commit

History

DataCard.md

File metadata and controls

LaNMP

Dataset Link

Data Card Author(s)

Authorship

Publishers

Publishing Organization(s)

Industry Type(s)

Contact Detail(s)

Author(s)

Funding Sources

Institution(s)

Funding or Grant Summary(ies)

Dataset Overview

Data Subject(s)

Dataset Snapshot

Content Description

Descriptive Statistics

Sensitivity of Data

Sensitivity Type(s)

Risk Type(s)

Dataset Version and Maintenance

Maintenance Status

Version Details

Maintenance Plan

Example of Data Points

Primary Data Modality

Data Fields

Typical Data Point

Motivations & Intentions

Motivations

Purpose(s)

Domain(s) of Application

Motivating Factor(s)

Intended Use

Dataset Use(s)

Suitable Use Case(s)

Research and Problem Space(s)

Citation Guidelines

Access

Access

Access Type

Documentation Link(s)

Provenance

Collection

Method(s) Used

Methodology Detail(s)

Collection Cadence

Data Processing

Collection Criteria

Data Selection

Relationship to Source

Benefit and Value(s)

Limitation(s) and Trade-Off(s)

Human and Other Sensitive Attributes

Sensitive Human Attribute(s)

Intentionality

Rationale

Source(s)

Extended Use

Use with Other Data

Safety Level

Best Practices

Forking & Sampling

Safety Level

Acceptable Sampling Method(s)

Best Practice(s)

Use in ML or AI Systems

Dataset Use(s)

Notable Feature(s)

Distribution(s)

Split Statistics

Transformations

Synopsis

Transformation(s) Applied