SPARQL
"It is hardly surprising that the science they turned to for an explanation of things was divination, the science that revealed connections between words and things, proper names and the deductions that could be drawn from them...
-- Henri-Jean Martin,
The History and Power of Writing
Time Berners-Lee (Web Inventor): "Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL."
SPARQL was not designed to query relational data, but to query data conforming to the RDF data model.
This book'sprimary goal is to quickly get you comfortable using SPARQL to retrieve and update data and to make the best use of that retrieved data.
Subject (s)-> Predicate (p) -> Object (o) (see:RDF Triple: :Obi-Wan_Kenobi - dbo:occupation -> :Jedi)
(see:02.03.05 Blank Nodes and Why They're Useful)
SPARQL is a recursive acronym for "SPARQL Protocol and RDF Query Language", which is described by a set of specifications from the W3C.
(see:00.2 SPARQL 1.1 Query Language Specification)
(see:Recap RDF from Chapter 1)
RDF - Resource Description Framework - is a general model of how any piece of data, and representations of knowledge, can be expressed as so called triples.
RDF Triples can be aggregated into graphs withsubjects andobjects as nodes, andpredicates as arcs.
In this data model, you express facts with three-part statements known astriples.
(see:Semantic Triple: Subject (s)-> Predicate (p) -> Object (o))
A URI may look like a URL, and there may actually be a web page at that address, but there might not be; its primary job is to provide an unique name for something, not to tell you about a web page where you can send your browser.
The SPARQL Query Language specification recommends that files storing SPARQL queries have an extension of .rq, in lowercase.
A SPARQL query typically says "I want these pieces of information from the subset of the data that meets these conditions."
You describe the conditions withtriple patterns, which are similar to RDF triples but may include variables to add flexibility in how they match against the data.
(see:Command: arq --data datafile.ttl --query queryfile.rq)
Without theOPTIONAL keyword, a SPARQL processor will only return data for a graph pattern if it can matchevery single triple patternin that graph pattern. -- this is the key reason lead to a query returning nothing.
(see:Public DBpedia SPARQL endpoint: https://dbpedia.org/snorql/)
A SPARQL Endpoint is a Point of Presence on an HTTP network that's capable of receiving and processing SPARQL Protocol requests.
The semantic web isa set ofstandards andbest practices for sharing data and the semantics of that data over the Web for use byapplications.
Use Linked Data as a set of best practices for sharing data across the web infrastructure so that applications (not human beings!) can more easily retrieve data from public sites with no need for screen scraping.
The Linked Open Data Cloud -- http://lod-cloud.net/
"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila: The Semantic Web, Scientific American, 284(5), pp. 34-43(2001)
The Semantic Web is anExtension of the traditional Web
The meaning of information (Semantics) is made explicit byformal (structured) and standardized knowledge representations (Ontologies).
toprocess the meaning of information automatically
torelate andintegrateheterogeneous (异质) data
todeduceimplicit (not evident) information from existing (evident) information in an automated way
The Semantic Web is kind of aglobal database that contains auniversal network of semantic propositions.
(see:A Web of Data)
Obi-Wan Kenobi: https://dbpedia/org/resource/Obi-Wan_Kenobi
:Obi-Wan_Kenobi - dbo:occupation -> :Jedi
RDFS - RDF Schema: enable the capability to form models (https://dbpedia.org/ontology/Agent)
URL: Uniform Resource Locator, usually using specific protocols to locate resource within World Wide Web is the most common type of URI.
URN: Universal Resource Names, one of the URI type with urn:scheme and the typical use example is ISBN system.
RDF-related syntaxes such as Turtle, N3, and SPARQL use the <> brackets to tell a processor that something is an actual URI and not just some string of characters that begins with "http://"
The URIs that identify RDF resources are like the unique ID fields of relational database tables, except that they'reuniversally unique, which lets you link data from different sources around the world instead of just linking data from different tables in the same database.
IRI: Internationalized Resource Identifier, compare to URI, it's characters are Unicode which means including Chinese, Japanese and Koreans etc.
The SPARQL Query Language specification refers to IRIs when it talks about naming resources, and not to URIs or URLs, because IRI is the most inclusive term.
Withnamespace term (a set of names used for a particular purpose), it is possible to distinguish between different senses of a word
W3C: released a spec describing how XML developers could say that certain terms come from specific namespaces, then they could distinguish between different senses of a word
(see:02.03 The Resources Description Framework (RDF))
(see:The basics of RDF)
RDF is a data model in which the basic unit of information is known as atriple
A triple consists of asubject, apredicate, and anobject. You can also think of these as aresource identifier, anattribute or property name, and anattribute or property value
To remove any ambiguity from the information stated by a given triple, the triple's subject and predicate must beURIs. (We can use prefixed names in place of URIs)
The technical term for saving RDF as a string of bytes that can be saved on a disk isserialization.
All RDF serializations so far have beentext files that used different syntaxes to represent the triples.
The serialization format is calledTurtle.
The simplest format is called N-Triples, in which you write out complete URIs inside of angle brackets and strings inside of quotation marks.Each triple is on its own line with a period at the end. (N-Triple has extention as .nt)
The oldest RDF serialization,RDF/XML, was part of the original RDF specification in 1999.
The best way to store RDF is to a database manager optimized for RDF triples, we call this atriplestore.
https://arangodb.com/community-server/
(see:ArangoDB Community Edition)
(see:GraphDB Lite (OntoText))
(see:AnzoGraph DB (Cambridge Semantics)DataStax Enterprise GraphMemgraph Enterprise/Cloud EditionNeo4j graph databaseONgDB EnterpriseRedisGraph)
(see:Memgraph Community Edition)
(see:Neo4j Community)
(see:ONgDB Community)
Apache’s Fuseki, along with the entire Jena project and all its plugins, is still actively developed as of October 2020. It supports the SPARQL 1.1 update and gets new features and enhancements with each new release, which takes place every quarter or so. We know that Fuseki can scale loading the entire Wikidata dump.
Blazegraph, previously known as Bigdata , is a great triplestore that scales to billions of triples with thousands of proven use cases. In fact, it was so good that AWS bought the Blazegraph trademark almost five years ago and hired some of its staff, including the CEO. Unfortunately, that meant that most of Blazegraph’s development experience was used to create a competing product: Amazon Neptune. Although the official releases of Blazegraph have slowed down, it still supports SPARQL 1.1 and is by no means outdated.
(see:Amazon Neptune)
BrightstarDB is an RDF triple store. It does not require the definition of a database schema, and with the RDF data model model , it can easily add and integrate data of all shapes. The core libraries have a small footprint and install with zero configuration for embedded applications.
Cayley is an open-source graph inspired by the graph database behind Freebase and Google's Knowledge Graph.
Filament is a graph persistence framework and associated toolkits based on a navigational query style. A default persistence engine is included for storing graph objects and properties into simple relational tables but the actual storage model is pluggable.
GraphDB Lite is a free RDF triplestore that allows to store up to 100 million triples on a desktop computer. This version of GraphDB can be easily deployed using JAVA. SPARQL 1.1 queries are performed in memory, not using files based indices. Reasoning operations for inferencing are supported in GraphDB Lite.
Graph Engine
= RAM Store + Computation Engine + Graph Model
Graph Engine (GE) is a distributed in-memory data processing engine, underpinned by a strongly-typed RAM store and a general distributed computation engine.
The distributed RAM store provides a globally addressable high-performance key-value store over a cluster of machines. Through the RAM store, GE enables the fast random data access power over a large distributed data set.
The capability of fast data exploration and distributed parallel computing makes GE a natural large graph processing platform. GE supports both low-latency online query processing and high-throughput offline analytics on billion-node large graphs.
HyperGraphDB is a general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs designed mostly for knowledge management, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes.
- Gremlin Query Language
MapGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MapGraph's CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.
Neo4j is an open-source graph database, implemented in Java described as embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables.
OrientDB is a 2nd Generation Distributed Graph Database with the flexibility of Documents in one product. It can store 220,000 records per second on common hardware. Even for a Document based database, the relationships are managed as in Graph Databases with direct connections among records.
Orly is a non-relational database, meant to be fast and to scale for billions of users. Orly provides a single path to data and will eliminate our need for memcache due to its speed and high concurrency.
sones GraphDB is an object-orientated graph data storage for a large amount of highly connected semi-structured data in a distributed environment.
Weaver is a distributed graph store that provides horizontal scalability, high-performance, and strong consistency.Weaver enables users to execute transactional graph updates and queries through a simple python API
(see:03.04 Searching Further in the Data)
The understore "_" prefix means that this is a special kind of node known asblank node orbnode.
WHERE { ?s <...> [ <> ?c ] }
[ ... content ... ] is a convenience in turtle and sparql to introduce a bnode (typically in an object position) and add some properties of that bnode
When you assign a name to a set of triples, you can then assign metadata to that set of triples, which we called named subsets of the graph
In a RDF database, a named graph is what we call a subset of our data that has been given a unique label (name). A graph database can contain any number of named graphs alongside its default graph, and each fact can be present in or absent from any graph.
RDF Schema and the RDF based Web Ontology Language (OWL) add a typing mechanism to classify subjects and objects into hierarchies
(see:Online Ontology)
- S rdfs:isDefinedBy O
Google, Bing and Yahoo use OWL publish a joint vocabulary, example: http://schema.org/City
Without defining a large, complex ontology, many RDF developers use just a few classes and properties from OWL to add metadata to their triples
(see:Use Linked Data as a set of best practices for sharing data across the web infrastructure so that applications (not human beings!) can more easily retrieve data from public sites with no need for screen scraping.)
by Tim Berners-Lee 2006
by Tim Berners-Lee, 2010
**: sharing it in a machine-readable formate (as opposed to a scan of a fax), regardless of the format
***: sharing data on the Web using a nonproprietary format, such as comma-separated values instead of Microsoft Excel spreadsheets
****: putting shared data in a Linked Data format, in which concepts were identified by URLs so that we could more easily cross-reference them with other data
*****: connecting the data to other data, by providing links to related data, especially links that make use of the URLs in the data
03.01.01 Using the Labels Provided by DBpedia - https://dbpedia.org/snorql/
(see:01.05 What Could Go Wrong?07.01.02 OPTIONAL Is Very Optional)
(see:02.03.05 Blank Nodes and Why They're Useful)
(see:01.04 Searching for Strings)
(see:02.03.06 Named Graphs)
(see:01.04 Searching for Strings)
(see:02.03.06 Named Graphs)