(this is a course project for CS391R 2022)
Table of Contents
Decision Transformer reformulates offline reinforcement learning as a sequence modeling problem that can be effectively solved with large Transformer models. The most related work to our project is the original Decision Transformer.
TODO
Dataset download link: https://drive.google.com/drive/folders/1dHMUOSLUr6AwW3PETn1DQO9CWMklTuqy?usp=share_link
Download the file and place the datasets/
folder in this repo. These datasets are generated with robomimic
using the low dimensional state representations and dense reward information.
Mixture of suboptimal data from state-of-the-art RL agents
500 total, with 200 proficient human and 300 multi-human. Demonstrations from teleoperators of varying proficiency
More challenging combination of MG, MH, and PH Weighted towards lower-quality MG data
Lift: lift the cube
Can: pick up the can and place it in proper spot
We chose these because they have large amounts of low quality machine generated data, which supports our goal of return conditioning on mixed quality data.
We input state, actions, and returns-to-go into a causal transformer to get our desired actions. We combine the states actions and return to go into one token. This shortens the sequence length and computational requirements. The original decision transformer uses deterministic policy, we train a multi-modal stochastic policy, which helps to better model continuous actions.
During development, we found that robomimic uses sparse rewards due to a binary (success or no success) in the sequence data. We attempted to enable dense rewards in robomimic, but found that the dense reward returned was uncorrelated with dataset quality.
Through debugging, this led to us manually altering the reward function to add a semi-sparse success bonus that decreased on every time step, giving a wider distribution of target RTGs for the decision transformer than the default binary option of success in robomimic. The max sequence is 500, so if you go past 500 time steps you get nothing!
The Function: max(500 - success time, 0)
In future work, we hope that this function sees more iterations of development, and possibly altering the actual dense reward and not the function itself.
With this change, we altered the training data accordingly, as you can see in the new SequenceDataset.
Longer sequence modeling improves action prediction and eases problems caused by multi-modal demonstrations:
[Naive BC]: Removing the low-quality data allows for expert performance, as in original robomimic
[DT-1, PH Only]: Removing the low-quality data allows for expert performance, as in original robomimic
[DT-20]: Decision Transformer can (mostly) filter the good demonstrations from the machine-generated noise
[DT-3]: Action and RTG input sequence makes this task significantly more difficult. But DT is much better than naive BC
[DT-3, DT-10, DT-20, all small]: Smaller Transformer sizes decrease performance in the can task
[DT-3, Gaussian, Large]: Standard Gaussian policies are less capable of modeling multi-modal action distributions than our Gaussian Mixture Model default
Alex Chandler - alex [dot] chandler [at] utexas.edu
Jake Grigsby - grigsby [at] cs.utexas.edu
Omeed Tehrani - omeed [at] cs.utexas.edu