-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do we need two codebooks? So is this actually a 1D+2D representation learning? #5
Comments
I am not sure if I understand it 100% correctly, but I do see the MaskGIT's VQ embedding( For the Tab 3 (c), did you use the same model architecture(1 encoder + 2 decoders(one titokdecoder + one pixeldecoder))? Can we remove the pixel-decoder and use titokdecoder to reconstruct the image directly from titok space to pixel space? You mentioned the gap is because btw: I just read the code, and it seems open-muse is used, but I found it was removed in huggingface: https://huggingface.co/openMUSE. Where did you get the |
Hi, thanks for the valuable comments. To being with, I would like to define some terms that could be confusing in MaskGIT. The MaskGIT framework contains a tokenizer (we refer to as MaskGIT-VQ or MaskGIT-VQGAN), which shares the same architecture to the original VQGAN (refer as Taming-VQ) except the attention layers are removed (so technically MaskGIT-VQ does not have a stronger architecture). However, MaskGIT-VQ significantly outperforms public available Taming-VQ (rFID 2.28 vs. 7.94), assumed because of their stronger training recipe, for which no code or details are revealed in their official code or paper. For the MaskGIT generator, we refer to as MaskGIT.
|
thank you for this very detailed explanation! could you elaborate more on what losses and hyperparameter you used for the single-stage training experiment? |
Thank the author for explanation. For simplify, please add the image reconstruction code, which includes image to 32 tokens and 32 tokens to image. Also, what is the loss weights of recon_loss, quantizer_loss, commitment_loss, and codebook_loss during training. |
Both reconstruction (image -> 32 tokens -> image) and generation (class label -> 32 tokens -> image) already exist in the demo jupyter notebook, we will provide a more explicit example in the README. |
I initially was very excited about this paper. However, after reading the code, I found there were actually two code books and two representations, where one is 1D(K=32) and another is 2D(16x16). All the other models use one code book and one presentation, why does this use two codebooks and two representations? Why not just use the 1D codewords seq for reconstruction?
The text was updated successfully, but these errors were encountered: