Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KG2 Merge Fail (Snakemake) #1022

Closed
ecwood opened this issue Aug 18, 2020 · 39 comments
Closed

KG2 Merge Fail (Snakemake) #1022

ecwood opened this issue Aug 18, 2020 · 39 comments

Comments

@ecwood
Copy link
Collaborator

ecwood commented Aug 18, 2020

[Tue Aug 18 09:05:47 2020]
rule Merge:
    input: /home/ubuntu/kg2-build/kg2-owl.json, /home/ubuntu/kg2-build/kg2-uniprotkb.json, /home/ubuntu/kg2-build/kg2-semmeddb-edges.json, /home/ubuntu/kg2-build/kg2-chembl.json, /home/ubuntu/kg2-build/kg2-ensembl.json, /home/ubuntu/kg2-build/kg2-unichem.json, /home/ubuntu/kg2-build/kg2-ncbigene.json, /home/ubuntu/kg2-build/kg2-dgidb.json, /home/ubuntu/kg2-build/kg2-rtx-kg1.json, /home/ubuntu/kg2-build/kg2-repodb.json, /home/ubuntu/kg2-build/kg2-drugbank.json, /home/ubuntu/kg2-build/kg2-smpdb.json, /home/ubuntu/kg2-build/kg2-hmdb.json, /home/ubuntu/kg2-build/kg2-go-annotation.json
    output: /home/ubuntu/kg2-build/kg2.json, /home/ubuntu/kg2-build/kg2-orphans-edges.json
    jobid: 5

[/home/ubuntu/kg2-build/kg2-owl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-owl.json] number of nodes added: 7076627
[/home/ubuntu/kg2-build/kg2-uniprotkb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-uniprotkb.json] number of nodes added: 26490
[/home/ubuntu/kg2-build/kg2-semmeddb-edges.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-semmeddb-edges.json] number of nodes added: 38
[/home/ubuntu/kg2-build/kg2-chembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-chembl.json] number of nodes added: 1893001
[/home/ubuntu/kg2-build/kg2-ensembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ensembl.json] number of nodes added: 67668
[/home/ubuntu/kg2-build/kg2-unichem.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-unichem.json] number of nodes added: 0
[/home/ubuntu/kg2-build/kg2-ncbigene.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ncbigene.json] number of nodes added: 61559
[/home/ubuntu/kg2-build/kg2-dgidb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-dgidb.json] number of nodes added: 4505
[/home/ubuntu/kg2-build/kg2-rtx-kg1.json] reading nodes from file
[OBO:go/extensions/go-plus.owl] inconsistent category information; keeping original category biolink:RelationshipType and discarding new category biolink:BiologicalProcess: GO:0006461
[OBO:go/extensions/go-plus.owl] inconsistent category_label information; keeping original category_label relationship_type and discarding new category_label biological_process: GO:0006461
Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/merge_graphs.py", line 52, in <module>
    nodes[node_id] = kg2_util.merge_two_dicts(nodes[node_id], node)
  File "/home/ubuntu/RTX/code/kg2/kg2_util.py", line 490, in merge_two_dicts
    ret_dict[key] = list(first_element) + sorted(list(set(value + stored_value) - first_element))
TypeError: '<' not supported between instances of 'NoneType' and 'str'
@ecwood
Copy link
Collaborator Author

ecwood commented Aug 18, 2020

With the following change to kg2_util (see the two print lines at the end of the code block), I reran merge_graphs.py:

elif type(value) == list and type(stored_value) == list:
                    if key != 'synonym':
                        ret_dict[key] = sorted(list(set(value + stored_value)))
                    else:
                        if len(stored_value) > 0:
                            first_element = {stored_value[0]}
                        elif len(value) > 0 and len(stored_value) == 0:
                            first_element = {value[0]}
                        else:
                            first_element = set()
                        try:
                            ret_dict[key] = list(first_element) + sorted(list(set(value + stored_value) - first_element))
                        except:
                            print("First element", list(first_element))
                            print("Second part", list(set(value + stored_value) - first_element))

Since the list is getting sorted (the error was TypeError: '<' not supported between instances of 'NoneType' and 'str', which is a sorted error), there can't be none in the Second part. As you'll see below, there are a lot of things in Second part, including a None.

[/home/ubuntu/kg2-build/kg2-rtx-kg1.json] reading nodes from file
[OBO:go/extensions/go-plus.owl] inconsistent category information; keeping original category biolink:RelationshipType and discarding new category biolink:BiologicalProcess: GO:0006461
[OBO:go/extensions/go-plus.owl] inconsistent category_label information; keeping original category_label relationship_type and discarding new category_label biological_process: GO:0006461
First element ['InChI=1S/C25H17F2N5O3S/c1-35-25-23(32-36(33,34)24-5-3-18(26)12-21(24)27)11-17(13-29-25)15-2-4-22-20(10-15)19(7-8-28-22)16-6-9-30-31-14-16/h2-14,32H,1H3']
Second part ['COc1ncc(cc1NS(=O)(=O)c2ccc(F)cc2F)c3ccc4nccc(c5ccnnc5)c4c3', 'ASTRAZENECA:5599', 'USP/USAN:18664', 'CANDIDATES:18664', 'OMIPALISIB', None, 'SID137275909', 'Omipalisib', 'PUBCHEM_BIOASSAY:137275909', 'CGBJSGAELGCMKE-UHFFFAOYSA-N', '2,4-difluoro-N-(2-methoxy-5-(4-(pyridazin-4-yl)quinolin-6-yl)pyridin-3-yl)benzenesulfonamide']

I think that I can fix this by removing all "None" values from the list.

I'm going to tag @saramsey so that he is aware of this.

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

Hi @ericawood how about

ret_dict[key] = list(first_element) + sorted(filter(None, list(set(value + stored_value) - first_element)))

(see added filter(None, ...) call). If you agree, can you please implement this fix?

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 18, 2020

Hi @saramsey, thank you for your quick response. Since you converted the list to a set, there is only 1 instance of None, so I used list.remove(). (I started testing it a couple of seconds before you commented.) If it fails, I think that filter(None, ...) will work. Either way, I will implement the fix.

@saramsey
Copy link
Member

Hi @saramsey, thank you for your quick response. Since you converted the list to a set, there is only 1 instance of None, so I used list.remove(). (I started testing it a couple of seconds before you commented.) If it fails, I think that filter(None, ...) will work. Either way, I will implement the fix.

Good catch. Please go with whichever fix you feel is more readable.

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

Some module somewhere is sticking a None item in a node synonym list, I guess? I suppose we should track that down and fix it, under another issue (marked lower priority in light of this fix).

@saramsey
Copy link
Member

Looks like maybe the None synonym came from KG1. I am investigating.

@saramsey
Copy link
Member

From the debugging info that @ericawood posted in this issue, it looks like the build failed on merge_two_dicts for compound CHEMBL1236962 (omipalisib).

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 18, 2020

From the debugging info that @ericawood posted in this issue, it looks like the build failed on merge_two_dicts for compound CHEMBL1236962 (omipalisib).

There were multiple that failed (I only posted the first one).

ecwood added a commit that referenced this issue Aug 18, 2020
@saramsey
Copy link
Member

From the debugging info that @ericawood posted in this issue, it looks like the build failed on merge_two_dicts for compound CHEMBL1236962 (omipalisib).

There were multiple that failed (I only posted the first one).

Understood, thanks. I am just tracking down the one that you posted, for starters.

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

Some more debugging info. On kg2steve:

ubuntu@ip-172-31-3-188:~/kg2-build$ grep -c CHEMBL1236962 kg2-*.json
kg2-chembl-test.json:14
kg2-chembl.json:1
kg2-dgidb-test.json:11
kg2-dgidb.json:1
kg2-drugbank-test.json:0
kg2-drugbank.json:1
kg2-ensembl-test.json:0
kg2-ensembl.json:0
kg2-go-annotation-test.json:0
kg2-go-annotation.json:0
kg2-hmdb-test.json:0
kg2-hmdb.json:0
kg2-ncbigene-test.json:0
kg2-ncbigene.json:0
kg2-ont-test.json:0
kg2-ont.json:0
kg2-owl.json:0
kg2-repodb-test.json:0
kg2-repodb.json:0
kg2-report-test.json:0
kg2-report.json:0
kg2-rtx-kg1-test.json:0
kg2-rtx-kg1.json:1
kg2-semmeddb-edges.json:0
kg2-semmeddb-test-edges.json:0
kg2-simplified-report-test.json:0
kg2-simplified-report.json:0

@saramsey
Copy link
Member

OK, the problem appears to be in kg2-chembl.json:

       {
            "id": "CHEMBL.COMPOUND:CHEMBL1236962",
            "iri": "https://identifiers.org/chembl.compound:CHEMBL1236962",
            "name": "OMIPALISIB",
            "full_name": "OMIPALISIB",
            "category": "biolink:ChemicalSubstance",
            "category_label": "chemical_substance",
            "description": "OMIPALISIB; FULL_MW:505.51; MAX_FDA_APPROVAL_PHASE: 1",
            "synonym": [
                "InChI=1S/C25H17F2N5O3S/c1-35-25-23(32-36(33,34)24-5-3-18(26)12-21(24)27)11-17(13-29-25)15-2-4-22-20(10-15)19(7-8-28-22)16-6-9-30-31-14-16/h2-14,32H,1H3",
                "CGBJSGAELGCMKE-UHFFFAOYSA-N",
                "COc1ncc(cc1NS(=O)(=O)c2ccc(F)cc2F)c3ccc4nccc(c5ccnnc5)c4c3",
                "USP/USAN:18664",
                null,
                "CANDIDATES:18664",
                "ASTRAZENECA:5599",
                "Omipalisib",
                "SID137275909",
                "OMIPALISIB",
                "2,4-difluoro-N-(2-methoxy-5-(4-(pyridazin-4-yl)quinolin-6-yl)pyridin-3-yl)benzenesulfonamide",
                "PUBCHEM_BIOASSAY:137275909"
            ],

saramsey added a commit that referenced this issue Aug 18, 2020
@saramsey
Copy link
Member

I have a hunch that 2ee6899 will fix this issue. Going to test on kg2steve.rtx.ai.

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

Running this on kg2steve now (in the kg2-build dir):

python3 -u ~/kg2-code/chembl_mysql_to_kg_json.py mysql-config.conf chembl test.json

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

Confirmed, 2ee6899 fixes this issue, at least for kg2-chembl.json:

Screen Shot 2020-08-18 at 11 01 58 AM

but of course we still need db043a4 for the general case...

@saramsey
Copy link
Member

Hmm, still getting this issue at line 490 in kg2_util.py. From last night's build on kg2dev, during the running of merge_graphs.py, we got:

Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/merge_graphs.py", line 52, in <module>
    nodes[node_id] = kg2_util.merge_two_dicts(nodes[node_id], node)
  File "/home/ubuntu/RTX/code/kg2/kg2_util.py", line 490, in merge_two_dicts
    ret_dict[key] = list(first_element) + sorted(filter(None, list(set(value + stored_value) - first_element)))
TypeError: '<' not supported between instances of 'NoneType' and 'str'

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

Hi @saramsey, when did it fail? (On what file) (It is working fine on kg2steve, are you sure the code is up to date?)

@saramsey
Copy link
Member

Good questions!

As to when the failure occurred:

ubuntu@ip-172-31-0-169:~/kg2-build$ ls -alh build-kg2.log
-rw-rw-r-- 1 ubuntu ubuntu 28M Aug 19 09:31 build-kg2.log

so that's 0931 UTC today.

As for the code being up-to-date:

ubuntu@ip-172-31-0-169:~/kg2-code$ git status
On branch master
Your branch is up to date with 'origin/master'.

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

Hi @saramsey, thank you for your quick response! Is it alright if I do some digging on kg2dev?

@saramsey
Copy link
Member

Sure

@saramsey
Copy link
Member

The RTXConfiguration-config.json file in /home/ubuntu/kg2-build on kg2dev.rtx.ai looks reasonable; it is configured to query KG1 on arax.rtx.ai as I would expect.

@saramsey
Copy link
Member

OK, I think it may be time to try catching the TypeError at runtime and print out the contents of first_element, value, and stored_value.

@saramsey
Copy link
Member

saramsey commented Aug 19, 2020

@ericawood can you try that, running merge_graphs.py like this:

/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/merge_graphs.py \
                       --kgFiles /home/ubuntu/kg2-build/kg2-ont.json \
                                      /home/ubuntu/kg2-build/kg2-semmeddb-edges.json \
                                      /home/ubuntu/kg2-build/kg2-uniprotkb.json \
                                      /home/ubuntu/kg2-build/kg2-ensembl.json \
                                      /home/ubuntu/kg2-build/kg2-unichem.json \
                                      /home/ubuntu/kg2-build/kg2-chembl.json \
                                      /home/ubuntu/kg2-build/kg2-ncbigene.json \
                                      /home/ubuntu/kg2-build/kg2-dgidb.json \
                                      /home/ubuntu/kg2-build/kg2-repodb.json \
                                      /home/ubuntu/kg2-build/kg2-smpdb.json \
                                      /home/ubuntu/kg2-build/kg2-drugbank.json \
                                      /home/ubuntu/kg2-build/kg2-hmdb.json \
                                      /home/ubuntu/kg2-build/kg2-go-annotation.json \
                                      /home/ubuntu/kg2-build/kg2-rtx-kg1.json \
                          --kgFileOrphanEdges /home/ubuntu/kg2-build/kg2-orphans-edges.json \
                                     /home/ubuntu/kg2-build/kg2.json

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

OK, I think it may be time to try catching the TypeError at runtime and print out the contents of first_element, value, and stored_value.

Hi @saramsey, that is a debugging statement I put in to figure out the problem.

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

I did run

/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/merge_graphs.py  --kgFiles /home/ubuntu/kg2-build/kg2-uniprotkb.json /home/ubuntu/kg2-build/kg2-chembl.json /home/ubuntu/kg2-build/kg2-ensembl.json /home/ubuntu/kg2-build/kg2-unichem.json /home/ubuntu/kg2-build/kg2-ncbigene.json /home/ubuntu/kg2-build/kg2-dgidb.json /home/ubuntu/kg2-build/kg2-rtx-kg1.json /home/ubuntu/kg2-build/kg2-repodb.json /home/ubuntu/kg2-build/kg2-drugbank.json /home/ubuntu/kg2-build/kg2-smpdb.json /home/ubuntu/kg2-build/kg2-hmdb.json /home/ubuntu/kg2-build/kg2-go-annotation.json --kgFileOrphanEdges /home/ubuntu/kg2-build/kg2-orphans-edges.json /home/ubuntu/kg2-build/kg2.json

(no SemMed or UMLS/ontologies for times sake)

and the output (before I stopped it), passed kg2-rtx-kg1.json:

[/home/ubuntu/kg2-build/kg2-uniprotkb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-uniprotkb.json] number of nodes added: 26483
[/home/ubuntu/kg2-build/kg2-chembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-chembl.json] number of nodes added: 1893001
[/home/ubuntu/kg2-build/kg2-ensembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ensembl.json] number of nodes added: 67668
[/home/ubuntu/kg2-build/kg2-unichem.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-unichem.json] number of nodes added: 0
[/home/ubuntu/kg2-build/kg2-ncbigene.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ncbigene.json] number of nodes added: 61559
[/home/ubuntu/kg2-build/kg2-dgidb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-dgidb.json] number of nodes added: 4505
[/home/ubuntu/kg2-build/kg2-rtx-kg1.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-rtx-kg1.json] number of nodes added: 124811
[/home/ubuntu/kg2-build/kg2-repodb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-repodb.json] number of nodes added: 1
[/home/ubuntu/kg2-build/kg2-drugbank.json] reading nodes from file

@saramsey
Copy link
Member

OK, I guess we will have to try running the full merge_graphs.py command-line invocation, to see if we can trip this problem again.

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

OK, I guess we will have to try running the full merge_graphs.py command-line invocation, to see if we can trip this problem again.

Could it be an issue with the order of KG files? The order used in Snakemake is different than that used in build-kg2.sh.

@saramsey
Copy link
Member

Yes, order matters here. Please test on kg2dev using the exact invocation I showed above. Thanks.

@saramsey
Copy link
Member

I'm gonna wager this may be some kind of conflict between kg2-ont.json and kg2-rtx-kg1.json, maybe?

@ecwood
Copy link
Collaborator Author

ecwood commented Aug 19, 2020

I'm gonna wager this may be some kind of conflict between kg2-ont.json and kg2-rtx-kg1.json, maybe?

Should I take SMPDB out of the testing? It takes a few hours for merge to handle.

@saramsey
Copy link
Member

I'm gonna wager this may be some kind of conflict between kg2-ont.json and kg2-rtx-kg1.json, maybe?

Should I take SMPDB out of the testing? It takes a few hours for merge to handle.

This is a good idea, maybe to do on kg2steve?

@saramsey
Copy link
Member

Based on the code branch at line 490, I think this has to involve something toxic in the synonym field of a node in some JSON file. Probably kg2-rtx-kg1.json, if I had to guess.

saramsey pushed a commit that referenced this issue Aug 19, 2020
@saramsey
Copy link
Member

FWIW, I have written a script to look for a None entry in the synonym slot for all nodes in a KG2 JSON file. I am running it on kg2dev. The script is RTX/code/kg2/misc-tools/find_non_synonym.py.

saramsey pushed a commit that referenced this issue Aug 19, 2020
@saramsey
Copy link
Member

OK, on kg2dev, the kg2-chembl.json file contains some nodes with None in the synonym field! This may be a clue.

{'category': 'biolink:ChemicalSubstance',
 'category_label': 'chemical_substance',
 'creation_date': None,
 'deprecated': False,
 'description': 'PRAZOSIN; FULL_MW:383.41; MAX_FDA_APPROVAL_PHASE: 4',
 'full_name': 'PRAZOSIN',
 'id': 'CHEMBL.COMPOUND:CHEMBL2',
 'iri': 'https://identifiers.org/chembl.compound:CHEMBL2',
 'name': 'PRAZOSIN',
 'provided_by': 'identifiers_org_registry:chembl',
 'publications': ['PMID:16250647',
                  'PMID:10579841',
                  'PMID:21900013',
                  'PMID:14985103',
                  'PMID:2542561',
                  'PMID:9572880',
                  'PMID:2896246',
                  'PMID:25813897',
                  'PMID:10602703',
                  'PMID:8778245',
                  'PMID:10891117',
                  'PMID:12519065',
                  'PMID:6133954',
                  'PMID:18851888',
                  'PMID:2535878',
                  'PMID:18588282',
                  'PMID:11437380',
                  'PMID:23382458',
                  'PMID:23241029',
                  'PMID:18788725',
                  'PMID:19734051',
                  'PMID:19586686',
                  'PMID:21726069',
                  'PMID:22961681',
                  'PMID:6296387',
                  'PMID:8960552',
                  'PMID:6620302',
                  'PMID:8096245',
                  'PMID:2886664',
                  'PMID:21232965',
                  'PMID:21377769',
                  'PMID:23956101',
                  'PMID:16723224',
                  'PMID:21236664',
                  'PMID:16472241',
                  'PMID:6094812',
                  'PMID:25075762',
                  'PMID:19445515',
                  'PMID:15646539',
                  'PMID:18457386',
                  'PMID:21458999',
                  'PMID:21185626',
                  'PMID:22194678',
                  'PMID:20014752',
                  'PMID:8064796',
                  'PMID:7731013',
                  'PMID:23466604',
                  'PMID:11728183',
                  'PMID:2887657',
                  'PMID:20829430',
                  'PMID:20850911',
                  'PMID:12110607',
                  'PMID:25557493',
                  'PMID:26948801',
                  'PMID:11814815',
                  'PMID:1967315',
                  'PMID:18983139',
                  'PMID:9380680',
                  'PMID:18625562',
                  'PMID:7699710',
                  'PMID:24900570',
                  'PMID:2884316',
                  'PMID:24805037',
                  'PMID:9667967',
                  'PMID:12166933',
                  'PMID:8759642',
                  'PMID:9548811',
                  'PMID:2896247',
                  'PMID:2888896',
                  'PMID:21051535',
                  'PMID:21549456',
                  'PMID:12482417',
                  'PMID:7513748',
                  'PMID:15267234',
                  'PMID:23403082',
                  'PMID:3806618',
                  'PMID:2989524',
                  'PMID:24332655',
                  'PMID:15633998',
                  'PMID:23073734',
                  'PMID:6142954',
                  'PMID:2863377',
                  'PMID:8523408',
                  'PMID:2894465',
                  'PMID:6123600',
                  'PMID:6136611',
                  'PMID:7752182',
                  'PMID:17336075',
                  'PMID:14521410',
                  'PMID:2879919',
                  'PMID:23683590',
                  'PMID:8101878',
                  'PMID:8917649',
                  'PMID:17391966',
                  'PMID:26475518',
                  'PMID:18378462',
                  'PMID:22855735',
                  'PMID:9822553',
                  'PMID:18768239',
                  'PMID:20875743',
                  'PMID:6150111',
                  'PMID:2842504',
                  'PMID:11448222',
                  'PMID:9135028',
                  'PMID:2896245',
                  'PMID:12877594',
                  'PMID:24630561',
                  'PMID:3746815',
                  'PMID:6133953',
                  'PMID:10893315',
                  'PMID:3361578',
                  'PMID:24365159',
                  'PMID:20070106',
                  'PMID:2562855',
                  'PMID:18426954',
                  'PMID:9857099',
                  'PMID:21908192',
                  'PMID:10522703',
                  'PMID:14584940',
                  'PMID:11602674',
                  'PMID:23582449',
                  'PMID:9214740',
                  'PMID:24304387',
                  'PMID:27876250',
                  'PMID:18760923',
                  'PMID:18490167',
                  'PMID:2579237',
                  'PMID:16033273',
                  'PMID:2567783',
                  'PMID:11405649',
                  'PMID:7310808',
                  'PMID:9888842',
                  'PMID:7310823',
                  'PMID:9871765',
                  'PMID:17870541',
                  'PMID:7658428',
                  'PMID:22541068',
                  'PMID:15911273',
                  'PMID:6312043',
                  'PMID:11462977',
                  'PMID:20022146',
                  'PMID:2785211',
                  'PMID:15935663',
                  'PMID:20547819',
                  'PMID:18372181',
                  'PMID:7562940',
                  'PMID:9888831',
                  'PMID:9276013',
                  'PMID:2903929',
                  'PMID:26988801',
                  'PMID:10395498'],
 'replaced_by': None,
 'synonym': ['InChI=1S/C19H21N5O4/c1-26-15-10-12-13(11-16(15)27-2)21-19(22-17(12)20)24-7-5-23(6-8-24)18(25)14-4-3-9-28-14/h3-4,9-11H,5-8H2,1-2H3,(H2,20,21,22)',
             'IENZQIKPVFGBNW-UHFFFAOYSA-N',
             'COc1cc2nc(nc(N)c2cc1OC)N3CCN(CC3)C(=O)c4occc4',
             'PUBCHEM_BIOASSAY:124882486',
             'SID50104275',
             'prazosine',
             'ATLAS:prazosin',
             'SID26751613',
             '4-(4-amino-6,7-dimethoxy-2-quinazolinyl)hexahydro-1-pyrazinyl-2-furylmethanone',
             'TP_TRANSPORTER:2000',
             'PUBCHEM_BIOASSAY:170465415',
             '[3H]prazosin',
             'SID50104273',
             'PRAZOSIN',
             'SID124882482',
             'PUBCHEM_BIOASSAY:26751613',
             'PUBCHEM_BIOASSAY:11112649',
             'SID124882484',
             'PRAZOSINE',
             'DRUGMATRIX:658',
             'SID50104274',
             'SID11113367',
             'PUBCHEM_BIOASSAY:11112650',
             'PUBCHEM_BIOASSAY:144207195',
             'SID11112650',
             'TP_TRANSPORTER:2483',
             'SID144207195',
             '[4-(4-Amino-6,7-dimethoxy-quinazolin-2-yl)-piperazin-1-yl]-furan-2-yl-methanone',
             'Prazosin',
             'PUBCHEM_BIOASSAY:11111665',
             'SID50104272',
             'PUBCHEM_BIOASSAY:124882484',
             '(Prazosin)[4-(4-Amino-6,7-dimethoxy-quinazolin-2-yl)-piperazin-1-yl]-furan-2-yl-methanone',
             '[3H]-prazosin',
             'SID11112649',
             'PUBCHEM_BIOASSAY:11113367',
             'ASTRAZENECA:741',
             'PUBCHEM_BIOASSAY:50104274',
             'ATC:16480',
             'SID50100502',
             '(4-(4-amino-6,7-dimethoxyquinazolin-2-yl)piperazin-1-yl)(furan-2-yl)methanone',
             'PUBCHEM_BIOASSAY:90340959',
             'PUBCHEM_BIOASSAY:50104273',
             'TP_TRANSPORTER:3435',
             'SID124882486',
             'SID26751614',
             'prazosin',
             None,
             'PUBCHEM_BIOASSAY:124882487',
             'PUBCHEM_BIOASSAY:26751614',
             'PUBCHEM_BIOASSAY:124882482',
             'PUBCHEM_BIOASSAY:50104275',
             'TP_TRANSPORTER:1298',
             'SID124882487',
             'TP_TRANSPORTER:1000',
             'Minizide',
             'SID170465415',
             'Prazocin',
             'Prazosine',
             'PUBCHEM_BIOASSAY:50104272',
             'PUBCHEM_BIOASSAY:50100502',
             'SID11111665',
             'SID90340959'],
 'update_date': '2018-12-10'}

@saramsey
Copy link
Member

I think I may have found the bug in chembl_mysql_to_kg_json.py. Testing now

saramsey added a commit that referenced this issue Aug 19, 2020
@saramsey
Copy link
Member

See 78dc069. That is definitely a bug! Not entirely sure it is the root cause of this problem, but I strongly suspect that it is . Testing on kg2dev now.

saramsey added a commit that referenced this issue Aug 19, 2020
@saramsey
Copy link
Member

Ugh, see also 879f0e2. This module really needs a code review and a run through a linter.

@saramsey
Copy link
Member

OK, the issue of None showing up in the synonym field of a node in kg2-chembl.json has been fixed by 78dc069 and/or 879f0e2. See the test results on kg2dev.rtx.ai, run just now:

(kg2-venv) ubuntu@ip-172-31-0-169:~/kg2-build$ python ~/kg2-code/chembl_mysql_to_kg_json.py mysql-config.conf chembl kg2-chembl.json
have processed 100000 compounds
have processed 200000 compounds
have processed 300000 compounds
have processed 400000 compounds
have processed 500000 compounds
have processed 600000 compounds
have processed 700000 compounds
have processed 800000 compounds
have processed 900000 compounds
have processed 1000000 compounds
have processed 1100000 compounds
have processed 1200000 compounds
have processed 1300000 compounds
have processed 1400000 compounds
have processed 1500000 compounds
have processed 1600000 compounds
have processed 1700000 compounds
have processed 1800000 compounds
(kg2-venv) ubuntu@ip-172-31-0-169:~/kg2-build$ python ~/kg2-code/misc-tools/find_none_synonym.py kg2-chembl.json
(kg2-venv) ubuntu@ip-172-31-0-169:~/kg2-build$

@saramsey
Copy link
Member

Expanded error information from the log file:

running merge_graphs.py
+ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/merge_graphs.py --kgFiles /home/ubuntu/kg2-build/kg2-ont.json /home/ubuntu/kg2-build/kg2-semmeddb-edges.json /home/ubuntu/kg2-build/kg2-uniprotkb.json /home/ubuntu/kg2-build/kg2-ensembl.json /home/ubuntu/kg2-build/kg2-unichem.json /home/ubuntu/kg2-build/kg2-chembl.json /home/ubuntu/kg2-build/kg2-ncbigene.json /home/ubuntu/kg2-build/kg2-dgidb.json /home/ubuntu/kg2-build/kg2-repodb.json /home/ubuntu/kg2-build/kg2-smpdb.json /home/ubuntu/kg2-build/kg2-drugbank.json /home/ubuntu/kg2-build/kg2-hmdb.json /home/ubuntu/kg2-build/kg2-go-annotation.json /home/ubuntu/kg2-build/kg2-rtx-kg1.json --kgFileOrphanEdges /home/ubuntu/kg2-build/kg2-orphans-edges.json /home/ubuntu/kg2-build/kg2.json
[/home/ubuntu/kg2-build/kg2-ont.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ont.json] number of nodes added: 7060053
[/home/ubuntu/kg2-build/kg2-semmeddb-edges.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-semmeddb-edges.json] number of nodes added: 38
[/home/ubuntu/kg2-build/kg2-uniprotkb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-uniprotkb.json] number of nodes added: 26483
[/home/ubuntu/kg2-build/kg2-ensembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ensembl.json] number of nodes added: 67668
[/home/ubuntu/kg2-build/kg2-unichem.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-unichem.json] number of nodes added: 0
[/home/ubuntu/kg2-build/kg2-chembl.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-chembl.json] number of nodes added: 1893001
[/home/ubuntu/kg2-build/kg2-ncbigene.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-ncbigene.json] number of nodes added: 61559
[/home/ubuntu/kg2-build/kg2-dgidb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-dgidb.json] number of nodes added: 4505
[/home/ubuntu/kg2-build/kg2-repodb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-repodb.json] number of nodes added: 1
[/home/ubuntu/kg2-build/kg2-smpdb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-smpdb.json] number of nodes added: 3698755
[/home/ubuntu/kg2-build/kg2-drugbank.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-drugbank.json] number of nodes added: 13564
[/home/ubuntu/kg2-build/kg2-hmdb.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-hmdb.json] number of nodes added: 10001
[/home/ubuntu/kg2-build/kg2-go-annotation.json] reading nodes from file
[/home/ubuntu/kg2-build/kg2-go-annotation.json] number of nodes added: 0
[/home/ubuntu/kg2-build/kg2-rtx-kg1.json] reading nodes from file
[OBO:go/extensions/go-plus.owl] inconsistent category information; keeping original category biolink:RelationshipType and discarding new category biolink:BiologicalProcess: GO:0006461
[OBO:go/extensions/go-plus.owl] inconsistent category_label information; keeping original category_label relationship_type and discarding new category_label biological_process: GO:0006461
Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/merge_graphs.py", line 52, in <module>
    nodes[node_id] = kg2_util.merge_two_dicts(nodes[node_id], node)
  File "/home/ubuntu/RTX/code/kg2/kg2_util.py", line 490, in merge_two_dicts
    ret_dict[key] = list(first_element) + sorted(filter(None, list(set(value + stored_value) - first_element)))
TypeError: '<' not supported between instances of 'NoneType' and 'str'

@saramsey
Copy link
Member

Just an FYI, since the merge on kg2dev has now completed, I have resumed the build on kg2dev from the point just after the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants