Adding POS tagging while building pattern for Spaczzruler #24

Ibrokhimsadikov · 2020-11-09T05:48:00Z

Hello, I am really liking Spaczz, to fuzzy match entity patterns.

Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}

gandersen101 · 2020-11-10T00:01:20Z

Hi @Ibrokhimsadikov, thanks for the kind words. I have gotten behind on spaczz maintenance and improvements lately but am hoping to get back on track in the near future.

I believe implementing some form of POS constraints should be doable but I'm going to half to think about how I actually want to go about it.

I will keep you updated here as that progresses.

gandersen101 · 2020-12-15T18:35:20Z

Hi @Ibrokhimsadikov, sorry there has not been much visible development on this issue yet. However, I did want to update you on where I am at with thinking/working through this.

The ideal way to add this feature would be adding fuzzy matching support directly into spaCy's matcher, however because much of this is written in Cython, it is beyond my current coding capabilities.

Accordingly, my original thought was to write a Python implementation very similar spaCy's matcher. However this quickly proved to be a massive undertaking that was mostly redundant.

Therefore I think the way I am going to attempt to incorporate this with writing an abstraction that translate these "fuzzy" patterns to spaCy matcher compatible patterns. It would find the fuzzy matches then rewrite the patterns with the verbatim text found. For example:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The manager gave me acess to the database so now I can acces it.")
pattern = [{"TEXT": {"FUZZY": "access"}, "POS": "NOUN"}]
# AbstractedMatcher.add(pattern)
# Under the hood would find fuzzy matches of "access" in the text and then use those to rewrite patterns
# that are compatible with spaCy's matcher.
[{"TEXT": "acess", "POS": "NOUN"}, {"TEXT": "acces", "POS": "NOUN"}]
# This would then only return the first mispelling of "access" - "acess" as it is the noun form.

This will still take some time to develop but I feel better about this direction.

In the meantime I will post a more obtuse, but still useful, work around you can use in the meantime that makes use of on-match callbacks with the FuzzyMatcher. I should get to that this evening.

Ibrokhimsadikov · 2020-12-15T19:00:20Z

Dear @gandersen101,

First of all, thank you so much for not forgetting about me. I am so much grateful for your effort as this is the only library that integrates fuzzy approach. With spaczz I was able to get more entities rather than only using spacy's matcher. As you know, one of the biggest issues in NER is building dictionary/knowledge base which usually comes with different variations of string, or synonyms, which is very time consuming manual effort for custom NER. Spaczz is doing good even though in the expense of memory consumption while running inside spacy pipeline.

Also, AbstractedMatcher is it your custom pipeline similar to Spaczzruler.

Thank you so much, I always check in this repo from time to time to see your updates, Looking forward to your "obtuse" :) solution and I can start testing it as right now I am working with spaczz

gandersen101 · 2020-12-16T04:14:25Z

Hi @Ibrokhimsadikov thanks for the kind words. I'm very happy that you and others are finding this project useful. I certainly haven't forgotten about this request, I've just had less time than I would like to work on spaczz lately.

The AbstractedMatcher in the example above is just a placeholder which I will probably end up naming SpaczzMatcher and I will also incorporate it's functionality into the SpaczzRuler.

Below is a workaround with the FuzzyMatcher you can use for now. It will only work as expected with single token patterns and the flex argument set to 0. This is definitely a limited solution but you may be able to expand the idea. The eventual SpaczzMatcher will be much more flexible than this.

import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher

nlp = spacy.load("en_core_web_md")
text = "The manager gave me acess to the database so now I can acces it."
doc = nlp(text)


def add_ent(matcher, doc, i, matches):
    """Callback on match function. Adds entities to doc with name of label."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label=match_id)
    # If Span already has entity assigned will skip rather than raising exception.
    try:
        doc.ents += (entity,)
    except AttributeError:
        pass


def select_nouns(matcher, doc, i, matches):
    """Callback on match function. Will continue passing matches that are nouns."""
    # This will only work with single-token patterns.
    # Also calling the above callback within this function to add entities to the doc.
    match_id, start, _end, _ratio = matches[i]
    if doc[start].pos_ == "NOUN":
        add_ent(matcher, doc, i, matches)


matcher = FuzzyMatcher(nlp.vocab, flex=0)
# Flex = 0 with single-token patterns will approximate token matching for now.
matcher.add("TEST", [nlp("access")], on_match=select_nouns)
matches = matcher(doc)

# Only the noun version of "access" was added to the doc.
for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

('acess', 4, 5, 'TEST')

Hope that helps for now!

Ibrokhimsadikov · 2020-12-16T05:44:44Z

Thank you so much for your response I will start using it. Immense thanks

ronyarmon · 2020-12-26T06:43:09Z

People interested in using the cython source may find this question of interest:
https://stackoverflow.com/questions/65454160/incorporating-fuzzy-search-in-a-matcher-object

gandersen101 · 2020-12-26T16:56:33Z

Hi @ronyarmon thank you for keeping us updated with your research.

I hope to eventually Cythonize the algorithmic components of spaczz and integrate them with spaCy Vocab objects but that is currently beyond my programming capabilities. It will be a fairly long-term process for me to develop my C/Cython skills enough to accomplish that so if you and/or others are able to accomplish that faster/better than I can you'll certainly have my full support! If the spaCy team decides to implement some of this functionality even better!

Ultimately, I made spaczz to provide features I didn't see anywhere else in the current spaCy ecosystem but I know for sure they could be implemented better than they are now.

In the meantime, I hope to have a new version of spaczz with this requested feature ready in the next couple weeks and will continue to provide updates here.

gandersen101 · 2020-12-30T02:32:49Z

So as of now I have implementing this feature broken up into 5 distinct elements that I will be working on mostly sequentially.

Create the algorithm that will search through a Doc using token patterns.
Create mapping for output of algorithm to spaCy Matcher compatible patterns.
Wrap the algorithm and mapping into the SpaczzMatcher.
The SpaczzMatch won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information.
Integrate the new SpaczzMatcher into the SpaczzRuler.

Pull #35 completes the first task in this list. Hoping to have more done soon!

gandersen101 · 2021-01-11T03:14:58Z

More progress on this feature. Please see the roadmap below:

Create the algorithm that will search through a Doc using token patterns.
Create mapping for output of algorithm to spaCy Matcher compatible patterns.
Wrap the algorithm and mapping into the SpaczzMatcher.
The SpaczzMatch won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information.
Integrate the new SpaczzMatcher into the SpaczzRuler.

I am hoping to have this feature finished this week.

@ronyarmon your stackoverflow question received an interesting response that I will explore in the near future. Seeing that I am close to implementing this feature in my pure-Python way, I will finish this before exploring expanding the spaCy Matcher.

Ibrokhimsadikov · 2021-01-11T03:22:38Z

Thank you for sharing that @gandersen101

gandersen101 · 2021-01-20T19:34:03Z

A few days overdue but this is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any bugs!

Ibrokhimsadikov · 2021-01-20T19:45:31Z

Thank you so much @gandersen101, I will definitely try that. Just FYI, I know it is known fact with speed issues, I want to share my observations: for processing 2mln reports with average of 150words each, it took approximately 20 days to process them, while with entityruler from spacy 3 days, in production with AWS ml.m5.12xlarge notebook instance. For pos Spaczz is amazing, Thank you once again, I will implement POS tagging capability as well.

gandersen101 · 2021-01-21T18:03:29Z

Hey @Ibrokhimsadikov. Thank you for the speed profiling. Definitely a lot of room for improvement. Issue #41 turns into a performance discussion and I am planning on doing some (hopefully substantial) enhancements very soon. I will also try to keep track of major performance updates in issue #20 over the long-term.

Let me know if you have questions on the token matcher. There is an example in the readme and more in spaczz document tests and test suite.

gandersen101 added the enhancement New feature or request label Nov 10, 2020

gandersen101 self-assigned this Dec 2, 2020

gandersen101 mentioned this issue Dec 24, 2020

Fuzzy Match of Term Combinations #34

Closed

gandersen101 mentioned this issue Dec 30, 2020

Enhancement tokensearcher #35

Merged

gandersen101 closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding POS tagging while building pattern for Spaczzruler #24

Adding POS tagging while building pattern for Spaczzruler #24

Ibrokhimsadikov commented Nov 9, 2020 •

edited

Loading

gandersen101 commented Nov 10, 2020

gandersen101 commented Dec 15, 2020 •

edited

Loading

Ibrokhimsadikov commented Dec 15, 2020 •

edited

Loading

gandersen101 commented Dec 16, 2020 •

edited

Loading

Ibrokhimsadikov commented Dec 16, 2020

ronyarmon commented Dec 26, 2020

gandersen101 commented Dec 26, 2020

gandersen101 commented Dec 30, 2020

gandersen101 commented Jan 11, 2021

Ibrokhimsadikov commented Jan 11, 2021

gandersen101 commented Jan 20, 2021

Ibrokhimsadikov commented Jan 20, 2021

gandersen101 commented Jan 21, 2021

Adding POS tagging while building pattern for Spaczzruler #24

Adding POS tagging while building pattern for Spaczzruler #24

Comments

Ibrokhimsadikov commented Nov 9, 2020 • edited Loading

gandersen101 commented Nov 10, 2020

gandersen101 commented Dec 15, 2020 • edited Loading

Ibrokhimsadikov commented Dec 15, 2020 • edited Loading

gandersen101 commented Dec 16, 2020 • edited Loading

Ibrokhimsadikov commented Dec 16, 2020

ronyarmon commented Dec 26, 2020

gandersen101 commented Dec 26, 2020

gandersen101 commented Dec 30, 2020

gandersen101 commented Jan 11, 2021

Ibrokhimsadikov commented Jan 11, 2021

gandersen101 commented Jan 20, 2021

Ibrokhimsadikov commented Jan 20, 2021

gandersen101 commented Jan 21, 2021

Ibrokhimsadikov commented Nov 9, 2020 •

edited

Loading

gandersen101 commented Dec 15, 2020 •

edited

Loading

Ibrokhimsadikov commented Dec 15, 2020 •

edited

Loading

gandersen101 commented Dec 16, 2020 •

edited

Loading