Datasets used in the paper

Dataset	Description	Source	Percentage in Training Mixture (RT-2-PaLI-X)	Percentage in Training Mixture (RT-2-PaLM-E)
WebLI	Around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give 1B training examples.	Chen et al. (2023b), Driess et al. (2023)	N/A	N/A
Episodic WebLI	Not used in co-fine-tuning RT-2-PaLI-X.	Chen et al. (2023a)	N/A	N/A
Robotics Dataset	Demonstration episodes collected with a mobile manipulation robot. Each demonstration is annotated with a natural language instruction from one of seven skills.	Brohan et al. (2022)	50%	66%
Language-Table	Used for training on several prediction tasks.	Lynch et al. (2022)	N/A	N/A

Provide feedback