Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 73 - make axioms.owl update #93

Merged
merged 16 commits into from
Apr 9, 2020

Conversation

wdduncan
Copy link
Member

To test making the axioms.owl target, I have created the file my-Makefile. It contains a subset of the original Makefile. It is run using the -f flag: make -f my-Makefile axioms.owl

Current targets in my-Makefile are:

  • exposure-xco.obo
  • exposure-zeco.obo
  • exposure-peco.obo
    • In your comments, you say that this is now called pheco, but I couldn't find that ontology. Also, this has replaced eo.
  • exposure-ncit.obo
  • exposure-go.obo
  • exposure-exo.obo
  • exposure-ecto.obo
  • slim-exposure.obo
  • axioms.owl

Targets that still need to be re-worked are:

  • probs.tsv (planning to focus on this next)
  • exposure-mesh.obo
  • exposure-snomed.obo
  • exposure-mre.obo (what is mre?)
  • access-ncit.obo
  • exposure-wikidata.obo

I tried to use robot where I could but encountered difficulties.
Issue 1: memory.
Building exposure-ncit.obo had memory issues.
Using the query:

prefix xsd: <http://www.w3.org/2001/XMLSchema#> 
select ?id 
where
	{
		?id ?p ?o .
		filter ( !isblank(?id) )
		filter ( datatype(?o) = xsd:string )
		filter ( contains(?o, "xposure") )
	}

I got robot query to work by upping the memory to 16G:

export ROBOT_JAVA_ARGS=-Xmx16G && robot query \
	  -i tmp-in.obo \
	  -q xposure.sparql tmp-out.obo

However, this was very slow (over 5 minutes).

Using obo-grep.pl and obo-strip.pl was much faster:

obo-grep.pl -r ".*xposure.*" tmp-in.obo > tmp-xposure.obo # find obo stanzas containing *xposure*. 
obo-strip.pl -t id tmp-xposure.obo | grep id: | cut -f2 -d' ' | sed 's/NCIT:/http:\/\/purl\.obolibrary\.org\/obo\/NCIT_/g' > tmp-ids

## run robot filter to get any subclasses of the NCIT classes
robot filter -i tmp-xposure.obo --term-file tmp-ids --trim false --select "self children" -o tmp-out.obo

For building exposure.obo, robot extract worked fine, but using obo-grep.pl was better for filtering the obo file:

robot extract --method top -i tmp-in.obo -t GO:0050896 -o tmp-out.obo
obo-grep.pl -r 'name: response to' tmp-out.obo > tmp-go.obo

Issue 2: obo format error*
When I used robot merge on the obo files:

robot merge \
	  -i exposure-zeco.obo \
	  -i exposure-peco.obo \
	  -i exposure-exo.obo \
	  -i exposure-xco.obo \
	  -i exposure-ecto.obo \
	  -o tmp.obo

it threw the error:

OBO STRUCTURE ERROR Ontology does not conform to OBO structure rules:
multiple def tags not allowed. in frame:Frame(ENVO:00010483 id( ENVO:00010483)synonym( portion of environmental material EXACT)name( environmental material)namespace( plant_experimental_conditions_ontology)def( A portion of environmental material is a fiat object part which forms the medium or part of the medium of an environmental system.[DOI:10.1186/2041-1480-4-43 MA:ma ORCID:0000-0002-4366-3088 URL:http://ontology.buffalo.edu/smith/articles/niches.html ])def( A portion of environmental material is a fiat object which forms the medium or part of the medium of an environmental system.[DOI:10.1186/2041-1480-4-43 MA:ma ORCID:0000-0002-4366-3088 URL:http://ontology.buffalo.edu/smith/articles/niches.html ])is_a( BFO:0000040)is_a( BFO:0000024))

So instead, I used obo-cat.pl:

obo-cat.pl $^ | grep -v ^namespace: | grep -v ^property_value: | ./add-syns.pl > $@

Not sure of the best way to address this. Plus, it is nice to call obo-cat.pl with $^. Maybe a similar feature can be added to robot? I didn't test this, but maybe --inputs will work with "file1 file2". The documentation specifies a regular expression. So maybe this would work: --inputs "file1|file2".

@wdduncan wdduncan requested a review from cmungall March 28, 2020 17:55
@jamesaoverton
Copy link
Collaborator

I know very little about OBO format but here are some notes on ROBOT that I hope are helpful.

ROBOT is designed for OWL and almost every ROBOT operation uses OWLAPI to load ontologies. The main exception is robot query which primarily uses Apache Jena. OWLAPI is pretty heavy.

OWLAPI has code for reading/writing OBO format, which is what ROBOT uses. Like me, ROBOT knows very little about OBO format. Loading NCIT into OWLAPI or Jena is never going to be as fast as scanning through an OBO format file as text. As an aside, we are starting to play with a streaming XML processor for MIREOT, and handling ncit.owl is a use case: ontodev/robot#650

The OBO STRUCTURE ERROR is thrown by the OBO format code in OWLAPI, but robot convert has a --check flag to ignore that error:

http://robot.obolibrary.org/errors#obo-structure-error
http://robot.obolibrary.org/convert#converting-to-obo-format

@wdduncan
Copy link
Member Author

@jamesaoverton
Thanks for the info about --check false.

@cmungall
Copy link
Member

Plus, it is nice to call obo-cat.pl with $^

I assume this is in the context of a Makefile

Note that in the general case you can use make commands like this $(patsubst %,-i %,%^)

    robot COMMAND $(patsubst %,-i %,%^) OTHER_ARGS

It's a bit more verbose but not a big deal

But that is often not necessary

I didn't test this, but maybe --inputs will work with "file1 file2". The documentation specifies a regular expression. So maybe this would work: --inputs "file1|file2".

No need to guess, the docs are clear:

http://robot.obolibrary.org/merge

You can also specify merging of multiple files that match a pattern with --inputs. The argument to --inputs must be a quoted wildcard pattern. This option supports ? to match any single character, or * to match any number of characters.

This command will merge edit.owl and edit2.owl:

robot merge --inputs "edit*.owl" --output results/merged2.owl

This will work with $^ in Makefiles

It may be more unixesque to accept open ended lists of files as final argument but I'm happy with existing robot idioms, and introducing changes now would be disruptive.

Copy link
Member

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a my-Makefile. Be bold! You own the Makefile, make the changes there!

@cmungall
Copy link
Member

Regarding the speed issues in sparql queries. @balhoff has done some investigation here, and has made changes to the GO pipeline to speed things up by avoiding multiple invocations (the speed issue is frequently to do with initial load using owlapi, followed by conversion to Jena model, as James indicates).

I know you have also looked into the TDB option of ROBOT and found that helped. We've also discussed whether other triplestores, embedded or as a server, would help here. We can imagine this being orchestrated by ROBOT, as per the TDB option. We can also imagine workflows where a triplestore is loaded from multiple sources at the outset with subsequent queries happening on that. The beauty is the same SPARQL queries will work either way!

@balhoff
Copy link

balhoff commented Mar 31, 2020

When ROBOT gets ready to answer SPARQL queries, it first converts the ontology to a Jena Model, then adds the model to a Jena Dataset. On my laptop, for the GO build, the first part takes ~30 seconds, then the second step takes ~225 seconds. So it is important not to do this repeatedly if you are concerned about time.

@jamesaoverton
Copy link
Collaborator

@cmungall ROBOT calling an arbitrary SPARQL endpoint is worth discussing, but I'm not 100% convinced yet. Maybe a shared Python script would be just as good.

@balhoff Yes, by default ROBOT loads the ontology with OWLAPI, then converts to a Turtle string and loads into Jena (then converts back to OWLAPI after SPARQL Update). This is slow but allows for reading any OWL format and chaining. But with the robot query --tdb true flag we try to be smarter and load directly into Jena. If we can do it faster, let me know.

@balhoff
Copy link

balhoff commented Mar 31, 2020

@jamesaoverton

If we can do it faster, let me know.

It may be possible to speed up the model creation slightly by avoiding the intermediate string, but adding to the dataset takes by far the most time, and it is a single method call into Jena. I'm not sure if there are any options for speeding that up.

This is the slow part: https://github.com/ontodev/robot/blob/cca7343b5c17a29fc99b5e3ac81b75c9d73158a1/robot-core/src/main/java/org/obolibrary/robot/QueryOperation.java#L103

@wdduncan
Copy link
Member Author

@cmungall
I will edit make file with new steps.

@wdduncan
Copy link
Member Author

@jamesaoverton @balhoff
From the discussion I'm not sure if sure if robot query --tdb true would help with loading and querying ncit. Originally, I had hoped the ontology could be queried just referencing the purl and not having to go through a download step.
I agree that loading robot multiple times is not efficient. The largest ontologies that will need to be queried are ncit, go, and (eventually) snomed. Perhaps I could put these into tdb once and then query when the appropriate goals are triggered in the Makefile.

@wdduncan
Copy link
Member Author

@jamesaoverton

Ran the following in my makefile to merge three ontologies:

test-robot: exposure-peco.obo  exposure-xco.obo  exposure-zeco.obo
	robot convert -i exposure-peco.obo -o exposure-peco-tmp.obo
	robot convert -i exposure-xco.obo -o exposure-xco-tmp.obo
	robot convert -i exposure-zeco.obo -o exposure-zeco-tmp.obo


	mv exposure-peco-tmp.obo exposure-peco.obo
	mv exposure-xco-tmp.obo exposure-xco.obo
	mv exposure-zeco-tmp.obo exposure-zeco.obo

	robot merge $(patsubst %,-i %, $^) -o merge-test.obo

The purpose of the first three commands was to check if the robot would read and output the obo files.
This runs fine.

However, when I run the robot merge $(patsubst %,-i %, $^) -o merge-test.obo, I receive the error:

OBO STRUCTURE ERROR Ontology does not conform to OBO structure rules:
multiple def tags not allowed. in frame:Frame(ENVO:00010483 id( ENVO:00010483)synonym( portion of environmental material EXACT)name( environmental material)namespace( plant_experimental_conditions_ontology)def( A portion of environmental material is a fiat object part which forms the medium or part of the medium of an environmental system.[DOI:10.1186/2041-1480-4-43 MA:ma ORCID:0000-0002-4366-3088 URL:http://ontology.buffalo.edu/smith/articles/niches.html ])def( A portion of environmental material is a fiat object which forms the medium or part of the medium of an environmental system.[DOI:10.1186/2041-1480-4-43 MA:ma ORCID:0000-0002-4366-3088 URL:http://ontology.buffalo.edu/smith/articles/niches.html ])is_a( BFO:0000040)is_a( BFO:0000024))

I also tested running:

robot merge -i exposure-peco.obo -i exposure-xco.obo -i exposure-zeco.obo -o merge-test.obo

and received the same error. Any thoughts on how to address this?

Going to stick with obo-cat.pl approach for now.

@jamesaoverton
Copy link
Collaborator

jamesaoverton commented Mar 31, 2020

@wdduncan That is weird, and I can't think of a workaround off the top of my head that doesn't involve just working with .owl files. I'm really not the guy to ask about OBO format stuff.

Regarding your previous comment: One call to robot query can run multiple SPARQL queries. If you can't run all your queries at once, then yes it should be much more efficient to reuse TDB directories.

@matentzn
Copy link
Contributor

matentzn commented Apr 1, 2020

Maybe --check false works on merge as well? Or, if not, sed the culprits from OBO files. Thats what I usually do if I need valid OBO (the error pertains to the fact that you can gave

def: "Definition 1"
def: "Definition 2"

OBO only allows one def: tag, so the other needs to be removed. Ah, and one alternative is to merge them into owl, and then use convert --check false to transform back into OBO as a seconds step.

@wdduncan
Copy link
Member Author

wdduncan commented Apr 2, 2020

I've removed a lot from the makefile. Please let me know what I need to add back in.
Also, the makefile now includes a snomed mapping, but the this info isn't being picked up by rdfmatch.

@cmungall help?

format-version: 1.2
ontology: rad/axioms

[Term]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, a chemical entity should never be equivalent to an exposure to a chemical entity. We should filter these

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that need to be addressed before this PR can be merged?

@cmungall cmungall merged commit 4c7ce71 into EnvironmentOntology:master Apr 9, 2020
@cmungall
Copy link
Member

cmungall commented Apr 9, 2020

@wdduncan - I moved my-Makefile to Makefile but it seems to be have syntax errors:

$ make slim-exposure.obo
Makefile:118: *** missing separator.  Stop.

cmungall added a commit that referenced this pull request Apr 9, 2020
@wdduncan
Copy link
Member Author

I update the Makefile in master to be the correct one.

@wdduncan wdduncan deleted the issue-73 branch April 10, 2020 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants