Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding POS tagging while building pattern for Spaczzruler #24

Closed
Ibrokhimsadikov opened this issue Nov 9, 2020 · 13 comments
Closed

Adding POS tagging while building pattern for Spaczzruler #24

Ibrokhimsadikov opened this issue Nov 9, 2020 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@Ibrokhimsadikov
Copy link

Ibrokhimsadikov commented Nov 9, 2020

Hello, I am really liking Spaczz, to fuzzy match entity patterns.

Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}

@gandersen101
Copy link
Owner

Hi @Ibrokhimsadikov, thanks for the kind words. I have gotten behind on spaczz maintenance and improvements lately but am hoping to get back on track in the near future.

I believe implementing some form of POS constraints should be doable but I'm going to half to think about how I actually want to go about it.

I will keep you updated here as that progresses.

@gandersen101 gandersen101 added the enhancement New feature or request label Nov 10, 2020
@gandersen101 gandersen101 self-assigned this Dec 2, 2020
@gandersen101
Copy link
Owner

gandersen101 commented Dec 15, 2020

Hi @Ibrokhimsadikov, sorry there has not been much visible development on this issue yet. However, I did want to update you on where I am at with thinking/working through this.

The ideal way to add this feature would be adding fuzzy matching support directly into spaCy's matcher, however because much of this is written in Cython, it is beyond my current coding capabilities.

Accordingly, my original thought was to write a Python implementation very similar spaCy's matcher. However this quickly proved to be a massive undertaking that was mostly redundant.

Therefore I think the way I am going to attempt to incorporate this with writing an abstraction that translate these "fuzzy" patterns to spaCy matcher compatible patterns. It would find the fuzzy matches then rewrite the patterns with the verbatim text found. For example:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The manager gave me acess to the database so now I can acces it.")
pattern = [{"TEXT": {"FUZZY": "access"}, "POS": "NOUN"}]
# AbstractedMatcher.add(pattern)
# Under the hood would find fuzzy matches of "access" in the text and then use those to rewrite patterns
# that are compatible with spaCy's matcher.
[{"TEXT": "acess", "POS": "NOUN"}, {"TEXT": "acces", "POS": "NOUN"}]
# This would then only return the first mispelling of "access" - "acess" as it is the noun form.

This will still take some time to develop but I feel better about this direction.

In the meantime I will post a more obtuse, but still useful, work around you can use in the meantime that makes use of on-match callbacks with the FuzzyMatcher. I should get to that this evening.

@Ibrokhimsadikov
Copy link
Author

Ibrokhimsadikov commented Dec 15, 2020

Dear @gandersen101,

First of all, thank you so much for not forgetting about me. I am so much grateful for your effort as this is the only library that integrates fuzzy approach. With spaczz I was able to get more entities rather than only using spacy's matcher. As you know, one of the biggest issues in NER is building dictionary/knowledge base which usually comes with different variations of string, or synonyms, which is very time consuming manual effort for custom NER. Spaczz is doing good even though in the expense of memory consumption while running inside spacy pipeline.

Also, AbstractedMatcher is it your custom pipeline similar to Spaczzruler.

Thank you so much, I always check in this repo from time to time to see your updates, Looking forward to your "obtuse" :) solution and I can start testing it as right now I am working with spaczz

@gandersen101
Copy link
Owner

gandersen101 commented Dec 16, 2020

Hi @Ibrokhimsadikov thanks for the kind words. I'm very happy that you and others are finding this project useful. I certainly haven't forgotten about this request, I've just had less time than I would like to work on spaczz lately.

The AbstractedMatcher in the example above is just a placeholder which I will probably end up naming SpaczzMatcher and I will also incorporate it's functionality into the SpaczzRuler.

Below is a workaround with the FuzzyMatcher you can use for now. It will only work as expected with single token patterns and the flex argument set to 0. This is definitely a limited solution but you may be able to expand the idea. The eventual SpaczzMatcher will be much more flexible than this.

import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher

nlp = spacy.load("en_core_web_md")
text = "The manager gave me acess to the database so now I can acces it."
doc = nlp(text)


def add_ent(matcher, doc, i, matches):
    """Callback on match function. Adds entities to doc with name of label."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label=match_id)
    # If Span already has entity assigned will skip rather than raising exception.
    try:
        doc.ents += (entity,)
    except AttributeError:
        pass


def select_nouns(matcher, doc, i, matches):
    """Callback on match function. Will continue passing matches that are nouns."""
    # This will only work with single-token patterns.
    # Also calling the above callback within this function to add entities to the doc.
    match_id, start, _end, _ratio = matches[i]
    if doc[start].pos_ == "NOUN":
        add_ent(matcher, doc, i, matches)


matcher = FuzzyMatcher(nlp.vocab, flex=0)
# Flex = 0 with single-token patterns will approximate token matching for now.
matcher.add("TEST", [nlp("access")], on_match=select_nouns)
matches = matcher(doc)

# Only the noun version of "access" was added to the doc.
for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))
('acess', 4, 5, 'TEST')

Hope that helps for now!

@Ibrokhimsadikov
Copy link
Author

Thank you so much for your response I will start using it. Immense thanks

@ronyarmon
Copy link

People interested in using the cython source may find this question of interest:
https://stackoverflow.com/questions/65454160/incorporating-fuzzy-search-in-a-matcher-object

@gandersen101
Copy link
Owner

Hi @ronyarmon thank you for keeping us updated with your research.

I hope to eventually Cythonize the algorithmic components of spaczz and integrate them with spaCy Vocab objects but that is currently beyond my programming capabilities. It will be a fairly long-term process for me to develop my C/Cython skills enough to accomplish that so if you and/or others are able to accomplish that faster/better than I can you'll certainly have my full support! If the spaCy team decides to implement some of this functionality even better!

Ultimately, I made spaczz to provide features I didn't see anywhere else in the current spaCy ecosystem but I know for sure they could be implemented better than they are now.

In the meantime, I hope to have a new version of spaczz with this requested feature ready in the next couple weeks and will continue to provide updates here.

@gandersen101
Copy link
Owner

So as of now I have implementing this feature broken up into 5 distinct elements that I will be working on mostly sequentially.

  • Create the algorithm that will search through a Doc using token patterns.
  • Create mapping for output of algorithm to spaCy Matcher compatible patterns.
  • Wrap the algorithm and mapping into the SpaczzMatcher.
  • The SpaczzMatch won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information.
  • Integrate the new SpaczzMatcher into the SpaczzRuler.

Pull #35 completes the first task in this list. Hoping to have more done soon!

@gandersen101
Copy link
Owner

More progress on this feature. Please see the roadmap below:

  • Create the algorithm that will search through a Doc using token patterns.
  • Create mapping for output of algorithm to spaCy Matcher compatible patterns.
  • Wrap the algorithm and mapping into the SpaczzMatcher.
  • The SpaczzMatch won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information.
  • Integrate the new SpaczzMatcher into the SpaczzRuler.

I am hoping to have this feature finished this week.

@ronyarmon your stackoverflow question received an interesting response that I will explore in the near future. Seeing that I am close to implementing this feature in my pure-Python way, I will finish this before exploring expanding the spaCy Matcher.

@Ibrokhimsadikov
Copy link
Author

Thank you for sharing that @gandersen101

@gandersen101
Copy link
Owner

A few days overdue but this is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any bugs!

@Ibrokhimsadikov
Copy link
Author

Thank you so much @gandersen101, I will definitely try that. Just FYI, I know it is known fact with speed issues, I want to share my observations: for processing 2mln reports with average of 150words each, it took approximately 20 days to process them, while with entityruler from spacy 3 days, in production with AWS ml.m5.12xlarge notebook instance. For pos Spaczz is amazing, Thank you once again, I will implement POS tagging capability as well.

@gandersen101
Copy link
Owner

Hey @Ibrokhimsadikov. Thank you for the speed profiling. Definitely a lot of room for improvement. Issue #41 turns into a performance discussion and I am planning on doing some (hopefully substantial) enhancements very soon. I will also try to keep track of major performance updates in issue #20 over the long-term.

Let me know if you have questions on the token matcher. There is an example in the readme and more in spaczz document tests and test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants