Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model is adding an amino acid to the original sequence #92

Open
tony-res opened this issue Feb 26, 2024 · 5 comments
Open

Model is adding an amino acid to the original sequence #92

tony-res opened this issue Feb 26, 2024 · 5 comments

Comments

@tony-res
Copy link

I'm using a nanobody PDB as the input to ProteinMPNN. I simply changed examples/submit_example_1.sh to the directory for the attached pdb file.

When I run ProteinMPNN, the FASTA file has an additional proline inserted into the sequence. Is there something different about my PDB file that may be causing this behavior?

The PDB file has this sequence:

NANOBODY_TESTING.H
EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYY
ADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS

ProteinMPNN gives this for the FASTA:

NANOBODY_TESTING, score=1.6920, global_score=1.6920, fixed_chains=[], designed_chains=['H'], model_name=v_48_020, git_hash=8907e6671bfbfc92303b5f79c4b5e6ce47cdef57, seed=37
EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYY
ADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDTPAPNDYWGQGTLVTVSS

                                                       ^

Note the highlighted proline is not in the original PDB file.

I'm probably doing something wrong, but I'm having trouble seeing it. Any help would be greatly appreciated.

Thanks!
-Tony

NANOBODY_TESTING.txt

@tony-res
Copy link
Author

I've tracked it down a little more. I think it is coming from the script that creates the parsed_pdbs.jsonl file because that file looks like this:

{"seq_chain_H": "EVQLVESGP-GLVQPGKSLRLSCVASGFTF----SGYGMHWVRQAPGKGLEWIALIIYD--ESNKYYADSVK-GRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDTPAPNDYWGQGTLVTVSS", "coords_chain_H": {"N_chain_H": [[-11.394, -4.934, -17.026], [-9.415, -5.505, -14.037], [-7.515, -2.819, -12.586], [-4.768, -1.263, -10.961], [-2.865, 1.551, -10.077], [-0.044, 2.882, -8.846], [1.493, 5.892, -7.505], [3.436, 8.866, -6.619], [6.281, 8.208, -5.929], [NaN, NaN, NaN], [9.799, 8.533, -4.849], [13.014, 10.127, -3.929], [13.389, 10.802, -0.512], [14.366, 12.408, 2.32], [15.559, 11.094, 5.561], [13.32, 11.104, 8.369], [11.537, 12.932, 7.067], [8.281, 13.081, 5.352], [6.011, 11.214, 3.226], [3.476, 10.527, 0.716], [1.955, 8.134, -0.727], [-0.102, 7.201, -3.541], [-2.64, 5.229, -4.93], [-4.499, 4.565, -7.639], [-7.528, 2.983, -8.404], [-9.327, 0.896, -10.749], [-11.897, -0.72, -12.862], [-13.244, -2.535, -11.189], [-15.546, -2.74, -8.358], [-14.286, -0.709, -6.165], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN], 

Note that the P is there. So the code seems to be buggy in this script parse_multiple_chains.py

@tony-res
Copy link
Author

I printed out an intermediate variable seq in parse_multiple_chains.py. It gives me this:
image
Note that at position 111, there is a dictionary with two values rather than one. That's where the bug is coming from. I'll see if I can find a patch and submit it.

@tony-res
Copy link
Author

image
It's this. The PDB file is IMGT numbered. The parse code is assuming that it is just an integer. So the 112A and the 112 positions are getting lumped together.

@tony-res
Copy link
Author

            if resn[-1].isalpha(): 
                print(resn)
                resa,resn = resn[-1],int(resn[:-1])-1
                print(resn)
            else: 
                resa,resn = "",int(resn)-1

Note that if the position has an alphabetic character (e.g. "112A"), then it removes the character and subtracts 1 from the integer. So "112A" is listed as position 111.

@tony-res
Copy link
Author

I created a patch and did a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant