Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to attribution #84

Merged
merged 5 commits into from
May 13, 2022

Conversation

whitead
Copy link
Contributor

@whitead whitead commented May 7, 2022

Following discussion from #79, #75, and #78 I've revised the attribution map to track both input and output token indices. As this made the attributions get a bit complicated, I added two dataclasses Attribution and AttributionMap to help make it clear what the meaning is of the various elements of the attribution. One (maybe) disadvantage is that it can be hard to know what the index means because it comes from the tokenizer within the library. One possible change would be to make index be strictly ascending or switch to string index.

Here are examples:

Encoder

s, am = sf.encoder('C1([O-])C=CC=C1Cl.CCO', attribute=True)
print(s)
for a in am:
    print(a)
[C][Branch1][C][O-1][C][=C][C][=C][Ring1][=Branch1][Cl].[C][C][O]
AttributionMap(index=0, token='[C]', attribution=[Attribution(index=0, token='C')])
AttributionMap(index=1, token='[O-1]', attribution=[Attribution(index=3, token='[O-]')])
AttributionMap(index=1, token='[Branch1]', attribution=[Attribution(index=3, token='[O-]')])
AttributionMap(index=2, token='[C]', attribution=[Attribution(index=3, token='[O-]')])
AttributionMap(index=4, token='[C]', attribution=[Attribution(index=5, token='C')])
AttributionMap(index=5, token='[=C]', attribution=[Attribution(index=7, token='C')])
AttributionMap(index=6, token='[C]', attribution=[Attribution(index=8, token='C')])
AttributionMap(index=7, token='[=C]', attribution=[Attribution(index=10, token='C')])
AttributionMap(index=8, token='[Ring1]', attribution=None)
AttributionMap(index=9, token='[=Branch1]', attribution=None)
AttributionMap(index=10, token='[Cl]', attribution=[Attribution(index=12, token='Cl')])
AttributionMap(index=11, token='[C]', attribution=[Attribution(index=13, token='C')])
AttributionMap(index=12, token='[C]', attribution=[Attribution(index=14, token='C')])
AttributionMap(index=13, token='[O]', attribution=[Attribution(index=15, token='O')])

Decoder

s, am = sf.decoder('[C][Branch1][C][O-1][C][=C][C][=C][Ring1][=Branch1][Cl]', attribute=True)
print(s)
for a in am:
    print(a)
C1([O-1])C=CC=C1Cl.CCO
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=5, token='[O-1]', attribution=[Attribution(index=1, token='[Branch1]'), Attribution(index=3, token='[O-1]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=4, token='[C]')])
AttributionMap(index=9, token='=', attribution=[Attribution(index=5, token='[=C]')])
AttributionMap(index=10, token='C', attribution=[Attribution(index=5, token='[=C]')])
AttributionMap(index=12, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=13, token='=', attribution=[Attribution(index=7, token='[=C]')])
AttributionMap(index=14, token='C', attribution=[Attribution(index=7, token='[=C]')])
AttributionMap(index=18, token='Cl', attribution=[Attribution(index=10, token='[Cl]')])
AttributionMap(index=19, token='C', attribution=[Attribution(index=11, token='[C]')])
AttributionMap(index=21, token='C', attribution=[Attribution(index=12, token='[C]')])
AttributionMap(index=23, token='O', attribution=[Attribution(index=13, token='[O]')])

@whitead
Copy link
Contributor Author

whitead commented May 7, 2022

It seems the tox environment is set to test python 3.6 only, despite CI saying its testing python 3.7. This PR uses features from 3.7. Should I rewrite to target 3.6 or was this a misconfiguration in tox env?

@whitead whitead mentioned this pull request May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants