Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for exact matches only? #41

Open
buchanae opened this issue Jul 18, 2018 · 6 comments
Open

Allow for exact matches only? #41

buchanae opened this issue Jul 18, 2018 · 6 comments
Assignees

Comments

@buchanae
Copy link

Querying something like "BRCA1", I get a lot of seemingly unrelated matches such as "BRAT1".

This is obviously a symptom of the nature of ElasticSearch. In analytical use cases, personally, I think fuzzy matches are dangerous.

Could we add a query parameter to require an exact match? Or maybe it exists and I'm not seeing the docs?

@newgene
Copy link
Member

newgene commented Jul 18, 2018

@buchanae general query like q=BRCA1 will match multiple fields, like symbol, name, .... But fuzzy matches are not used. The match of "BRAT1" gene is because "BRCA1" is mentioned in its gene name.

You can get exactly what you need by using the fielded query:

q=symbol:BRCA1

or limited to human only:

q=symbol:BRCA1&species=human

@buchanae
Copy link
Author

Ah, ok, thanks!

I actually can't even reproduce the results I mentioned now. Wish I had posted the query.

These are the queries I tried this morning: https://gist.github.com/buchanae/5cba60894e190c35da1ac3e1c7e5e511

@buchanae
Copy link
Author

Here's an example I don't understand:

import mygene
mg = mygene.MyGeneInfo()
mg.querymany(["CBLB"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
	[('CBLB', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'CBLB',
  '_id': '868',
  '_score': 89.78527,
  'alias': ['Cbl-b', 'Nbla00127', 'RNF56'],
  'ensembl': {'gene': 'ENSG00000114423'},
  'symbol': 'CBLB'},
 {'query': 'CBLB',
  '_id': '326625',
  '_score': 9.830278,
  'alias': ['ATR', 'CFAP23', 'cblB', 'cob'],
  'ensembl': {'gene': 'ENSG00000139428'},
  'symbol': 'MMAB'}]

Since I'm not passing returnall=True, shouldn't this return only the best hit?

@buchanae
Copy link
Author

And another.

mg.querymany(["MCM3"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
	[('MCM3', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'MCM3',
  '_id': '4172',
  '_score': 84.13076,
  'alias': ['HCC5', 'P1-MCM3', 'P1.h', 'RLFB'],
  'ensembl': {'gene': 'ENSG00000112118'},
  'symbol': 'MCM3'},
 {'query': 'MCM3',
  '_id': '4176',
  '_score': 5.8433404,
  'alias': ['CDC47',
   'MCM2',
   'P1.1-MCM3',
   'P1CDC47',
   'P85MCM',
   'PNAS146',
   'PPP1R104'],
  'ensembl': {'gene': 'ENSG00000166508'},
  'symbol': 'MCM7'}]

As far as I can tell, the second match is happening because of a partial match on the string P1.1-MCM3

@buchanae buchanae reopened this Jul 19, 2018
@newgene
Copy link
Member

newgene commented Jul 24, 2018

@buchanae "alias" field was indexed as free text, as we did observe the values of "alias" field can have whitespaces in it sometime. We can do some more inspection on the alias field and optimize the indexing a bit (e.g. do not treat "-" as a word separator).

@sirloon
Copy link
Member

sirloon commented Aug 6, 2018

"alias" field is coming from entrez_gene collection, currently contains 21M documents:

  • 14642 documents have an alias field with space in it, (eg. gene 814677, "SEC12P-like 2 protein")
  • 117911 docs have an alias with a "-" in it, (eg. gene 35543593, "xcc-b100_0084")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants