Skip to content

Commit

Permalink
[CLEANUP]
Browse files Browse the repository at this point in the history
  • Loading branch information
Kye committed Dec 23, 2023
1 parent b25b62e commit 884b478
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 48 deletions.
71 changes: 26 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
---


Robotic Transformer 2 (RT-2) leverages both web and robotics data to generate actionable instructions for robotic control.
This is my implementation of the model behind RT-2. RT-2 leverages PALM-E as the backbone with a Vision encoder and language backbone where images are embedded and concatenated in the same space as the language embeddings. This architecture is quite easy to architect but suffers from a lack of deep understanding of both the unified multi modal representation or the individual modality representations.

[CLICK HERE FOR THE PAPER](https://robotics-transformer2.github.io/assets/rt2.pdf)

Expand All @@ -35,13 +35,6 @@ RT-2 can be easily installed using pip:
```bash
pip install rt2
```

Additionally, you can manually install the dependencies:

```bash
pip install -r requirements.txt
```

# Usage


Expand All @@ -53,26 +46,21 @@ First, you need to initialize the `RT2` class. You can do this by providing the

```python

import torch
import torch
from rt2.model import RT2

model = RT2()

video = torch.randn(2, 3, 6, 224, 224)
# img: (batch_size, 3, 256, 256)
# caption: (batch_size, 1024)
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

instructions = [
'bring me that apple sitting on the table',
'please pass the butter'
]

# compute the train logits
train_logits = model.train(video, instructions)
# model: RT2
model = RT2()

# set the model to evaluation mode
model.model.eval()
# Run model on img and caption
output = model(img, caption)
print(output) # (1, 1024, 20000)

# compute the eval logits with a conditional scale of 3
eval_logits = model.eval(video, instructions, cond_scale=3.)

```

Expand Down Expand Up @@ -101,24 +89,19 @@ RT-2 is fine-tuned using both web and robotics data. The resultant model interpr
| Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |


## Datasets
Datasets used in the paper

# Appreciation

* Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
* Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu,
* Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog,
* Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
* Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch,
* Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi,
* Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong,
* Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu,
* and Brianna Zitkovich
| Dataset | Description | Source | Percentage in Training Mixture (RT-2-PaLI-X) | Percentage in Training Mixture (RT-2-PaLM-E) |
|---------|-------------|--------|----------------------------------------------|----------------------------------------------|
| WebLI | Around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give 1B training examples. | Chen et al. (2023b), Driess et al. (2023) | N/A | N/A |
| Episodic WebLI | Not used in co-fine-tuning RT-2-PaLI-X. | Chen et al. (2023a) | N/A | N/A |
| Robotics Dataset | Demonstration episodes collected with a mobile manipulation robot. Each demonstration is annotated with a natural language instruction from one of seven skills. | Brohan et al. (2022) | 50% | 66% |
| Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |

for writing this amazing paper and advancing Humanity

* LucidRains for providing the base repositories for [PALM](https://github.com/lucidrains/PaLM-rlhf-pytorch) and [RT-1](https://github.com/kyegomez/RT-2)

* Any you yes the Human looking at this right now, I appreciate you and love you.

## Commercial Use Cases

Expand All @@ -128,25 +111,18 @@ The unique capabilities of RT-2 open up numerous commercial applications:
- **Healthcare**: In robotic surgeries or patient care, RT-2 can assist in understanding and performing tasks based on both visual and verbal instructions.
- **Smart Homes**: Integration of RT-2 in smart home systems can lead to improved automation, understanding homeowner instructions in a much more nuanced manner.

## Examples and Documentation

Detailed examples and comprehensive documentation for using RT-2 can be found in the [examples](https://github.com/kyegomez/RT-2/tree/master/examples) directory and the [documentation](https://github.com/kyegomez/RT-2/tree/master/docs) directory, respectively.

## Contributing

Contributions to RT-2 are always welcome! Feel free to open an issue or pull request on the GitHub repository.

## License

RT-2 is provided under the MIT License. See the LICENSE file for details.

## Contact

For any queries or issues, kindly open a GitHub issue or get in touch with [kyegomez](https://github.com/kyegomez).

## Citation

```
```bibtex
@inproceedings{RT-2,2023,
title={},
author={Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
Expand All @@ -160,4 +136,9 @@ Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng X
and Brianna Zitkovich},
year={2024}
}
```
```


## License

RT-2 is provided under the MIT License. See the LICENSE file for details.
7 changes: 5 additions & 2 deletions example.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import torch
from rt2.model import RT2

# usage
# img: (batch_size, 3, 256, 256)
# caption: (batch_size, 1024)
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

# model: RT2
model = RT2()

# Run model on img and caption
output = model(img, caption)
print(output.shape) # (1, 1024, 20000)
print(output) # (1, 1024, 20000)
2 changes: 1 addition & 1 deletion rt2/model.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import torch
from torch import nn
from zeta import (
from zeta.structs import (
AutoregressiveWrapper,
Decoder,
Encoder,
Expand Down

0 comments on commit 884b478

Please sign in to comment.