Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch performance #16

Merged
merged 29 commits into from
Feb 20, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
8eabf8a
timing notes
cboettig Feb 16, 2018
21e50ae
Much faster, cleaner parsing of SPARQL returns
cboettig Feb 17, 2018
e0e90b1
testing
cboettig Feb 17, 2018
3b4c5f9
tweaking
cboettig Feb 17, 2018
c11eb84
Successful & fast rdf-join :rocket: :sparkles:
cboettig Feb 17, 2018
86376d8
move ex notebook to notebook/
cboettig Feb 17, 2018
702dac3
datalake
cboettig Feb 17, 2018
47875f4
data lake showing gh api example
cboettig Feb 17, 2018
b2d07ad
clean up tmp
cboettig Feb 17, 2018
9fed41c
run results using full lake
cboettig Feb 17, 2018
8d0a4b8
be better about cleaning up temp files
cboettig Feb 17, 2018
0d5c02b
add libs, run full data ex
cboettig Feb 18, 2018
06170d4
make 'data-lake.Rmd' into vignette
cboettig Feb 19, 2018
3bf5216
data lake example
cboettig Feb 19, 2018
26f1beb
suggest nycflights13 data
cboettig Feb 19, 2018
a49fb6a
rdf_add can handle NA as a blank node
cboettig Feb 20, 2018
e8f2927
c() method use turtle to save disk space
cboettig Feb 20, 2018
887294a
parser and serializer will guess format
cboettig Feb 20, 2018
39074cf
cleaning up as_rdf methods
cboettig Feb 20, 2018
481a329
datatype should not be assigned to blank nodes
cboettig Feb 20, 2018
5574938
use rdflib_base_uri throughout
cboettig Feb 20, 2018
3c3aceb
avoid c() by passing rdf arg
cboettig Feb 20, 2018
38598ea
option to reconnect to an existing database
cboettig Feb 20, 2018
7a02e04
indicate storage type in rdf() constructor instead
cboettig Feb 20, 2018
8d55f78
tests
cboettig Feb 20, 2018
7fe8fb6
good practice
cboettig Feb 20, 2018
4332540
newline
cboettig Feb 20, 2018
9efce89
update pkgdown
cboettig Feb 20, 2018
ec81511
skip has_bdb on appveyor
cboettig Feb 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
datalake
  • Loading branch information
cboettig committed Feb 17, 2018
commit 702dac3cf5410bc000dc0859597cf8f1cf5beb06
112 changes: 112 additions & 0 deletions inst/notebook/data-lake.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Data Lake RDF
output: github_document
---

```{r include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning = FALSE)
```

```{r libraries}
library(nycflights13)
library(tidyverse)
library(rdflib)

## experimental methods
source(system.file("examples/as_rdf.R", package="rdflib"))
```

```{r options}
#options(rdflib_storage = "BDB") ## may also be much slower...
options(rdflib_storage = "memory")
```



```{r include = FALSE}
# Use a smaller dataset if we do not have a BDB backend:
#if(!rdf_has_bdb()){
flights <- flights %>%
filter(distance > 2600) # try smaller dataset
#}
```



## Tidyverse Style

Operations in `dplyr` on the `nyflights13` dataset are easy to write and fast to execute

```{r tidyverse}
df <- flights %>%
left_join(airlines) %>%
left_join(planes, by="tailnum") %>%
select(carrier, name, manufacturer, model) %>%
distinct()
head(df)
```

## RDF Data Lake

In RDF, we simply toss all of our data into the triplestore, or to use a more evocative metaphor, the "Data Lake." We can then extract whatever tabular structure we need
by querying the data lake using SPARQL, something sometimes referred to as "schema-on-read,"
since we are specifying the desired format of the data when we pull it out of the lake.

This can serve as a very effective means of data integration (provided a reasonably conistent and dilgent use of URIs in identifying subjects and properties (predicates)), since just about any data can be added to the lake without worrying about whether it comes in a schema that matches the existing architecture of the database. It is this flexibility not to have to define your database schema at the start that is the primary strength of the RDF approach.

Okay, let's dump the `nyflights13` into the data lake. First, the foreign keys in any table must be represented as URIs and not literal strings:

```{r}
as_uri <- function(x, base_uri = "x:") paste0(base_uri, x)
uri_flights <- flights %>%
mutate(tailnum = as_uri(tailnum),
carrier = as_uri(carrier))
```


Similiarly, when reading into RDF we have to declare the key column for the table,
and again establish a `base_uri` which will allow RDF methods to distinguish between URIs (subjects, predicates, and foreign keys) and literal strings.

```{r write_rdf, results='hide', message=FALSE, warning=FALSE}
system.time(

rdf <- c(
as_rdf(airlines, "carrier", "x:"),
as_rdf(planes, "tailnum", "x:"),
as_rdf(uri_flights, NULL, "x:"))

)
```

Note that flights does not have a natural key (somewhat surprisingly, `flight` number is not a unique key for this table, as the same flight number is reused on the same route at different times.) So, we will treat each row as a unique anonymous key by setting the key to `NULL`.

## Schema on read

We simply define the columns we want and we immediately get back the desired `data.frame`:


```{r query}
s <-
'SELECT ?carrier ?name ?manufacturer ?model ?dep_delay
WHERE {
?flight <x:tailnum> ?tailnum .
?flight <x:carrier> ?carrier .
?flight <x:dep_delay> ?dep_delay .
?tailnum <x:manufacturer> ?manufacturer .
?tailnum <x:model> ?model .
?carrier <x:name> ?name
}'

system.time(
df <- rdf_query(rdf, s)
)

head(df)
```

Note that in place of joins, we give more semantically meaningful statements about the data:
e.g. `manufacturer` is a property of a `tailnum` (corresponding to a particular physical aircraft), not of a `flight` number. Departure delay `dep_delay` is a property of a flight, not of an aircraft (`tailnum`).

This is reminiscent of the way in which these data are organized in the relational database tables to begin with: we find `deb_delay` in the `flights` table and `manufacturer` in the `planes` table. Good relational design encourages this, but to work with the data the user is often left having to do the required joins, which also creates tables where these semantics are less clear.

Tabular formats can often be sloppy about what is a key and what is a literal value, and also whether a column with the same name in different tables means the same thing in both. Both of these things pose challenges for later use when joining data. RDF representation encourages greater discipline through the use of URIs (though we've run a bit roughshod over that with the cavilier use of `x:` here.)
114 changes: 114 additions & 0 deletions inst/notebook/data-lake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
Data Lake RDF
================

``` r
library(nycflights13)
library(tidyverse)
library(rdflib)

## experimental methods
source(system.file("examples/as_rdf.R", package="rdflib"))
```

``` r
#options(rdflib_storage = "BDB") ## may also be much slower...
options(rdflib_storage = "memory")
```

Tidyverse Style
---------------

Operations in `dplyr` on the `nyflights13` dataset are easy to write and fast to execute

``` r
df <- flights %>%
left_join(airlines) %>%
left_join(planes, by="tailnum") %>%
select(carrier, name, manufacturer, model) %>%
distinct()
head(df)
```

## # A tibble: 4 x 4
## carrier name manufacturer model
## <chr> <chr> <chr> <chr>
## 1 HA Hawaiian Airlines Inc. AIRBUS A330-243
## 2 UA United Air Lines Inc. BOEING 767-424ER
## 3 UA United Air Lines Inc. <NA> <NA>
## 4 UA United Air Lines Inc. BOEING 757-222

RDF Data Lake
-------------

In RDF, we simply toss all of our data into the triplestore, or to use a more evocative metaphor, the "Data Lake." We can then extract whatever tabular structure we need by querying the data lake using SPARQL, something sometimes referred to as "schema-on-read," since we are specifying the desired format of the data when we pull it out of the lake.

This can serve as a very effective means of data integration (provided a reasonably conistent and dilgent use of URIs in identifying subjects and properties (predicates)), since just about any data can be added to the lake without worrying about whether it comes in a schema that matches the existing architecture of the database. It is this flexibility not to have to define your database schema at the start that is the primary strength of the RDF approach.

Okay, let's dump the `nyflights13` into the data lake. First, the foreign keys in any table must be represented as URIs and not literal strings:

``` r
as_uri <- function(x, base_uri = "x:") paste0(base_uri, x)
uri_flights <- flights %>%
mutate(tailnum = as_uri(tailnum),
carrier = as_uri(carrier))
```

Similiarly, when reading into RDF we have to declare the key column for the table, and again establish a `base_uri` which will allow RDF methods to distinguish between URIs (subjects, predicates, and foreign keys) and literal strings.

``` r
system.time(

rdf <- c(
as_rdf(airlines, "carrier", "x:"),
as_rdf(planes, "tailnum", "x:"),
as_rdf(uri_flights, NULL, "x:"))

)
```

Note that flights does not have a natural key (somewhat surprisingly, `flight` number is not a unique key for this table, as the same flight number is reused on the same route at different times.) So, we will treat each row as a unique anonymous key by setting the key to `NULL`.

Schema on read
--------------

We simply define the columns we want and we immediately get back the desired `data.frame`:

``` r
s <-
'SELECT ?carrier ?name ?manufacturer ?model ?dep_delay
WHERE {
?flight <x:tailnum> ?tailnum .
?flight <x:carrier> ?carrier .
?flight <x:dep_delay> ?dep_delay .
?tailnum <x:manufacturer> ?manufacturer .
?tailnum <x:model> ?model .
?carrier <x:name> ?name
}'

system.time(
df <- rdf_query(rdf, s)
)
```

## user system elapsed
## 0.096 0.004 0.100

``` r
head(df)
```

## # A tibble: 6 x 5
## carrier name manufacturer model dep_delay
## <chr> <chr> <chr> <chr> <int>
## 1 x:HA Hawaiian Airlines Inc. AIRBUS A330-243 6
## 2 x:HA Hawaiian Airlines Inc. AIRBUS A330-243 14
## 3 x:UA United Air Lines Inc. BOEING 767-424ER 18
## 4 x:UA United Air Lines Inc. BOEING 767-424ER 1
## 5 x:HA Hawaiian Airlines Inc. AIRBUS A330-243 - 8
## 6 x:UA United Air Lines Inc. BOEING 767-424ER 2

Note that in place of joins, we give more semantically meaningful statements about the data: e.g. `manufacturer` is a property of a `tailnum` (corresponding to a particular physical aircraft), not of a `flight` number. Departure delay `dep_delay` is a property of a flight, not of an aircraft (`tailnum`).

This is reminiscent of the way in which these data are organized in the relational database tables to begin with: we find `deb_delay` in the `flights` table and `manufacturer` in the `planes` table. Good relational design encourages this, but to work with the data the user is often left having to do the required joins, which also creates tables where these semantics are less clear.

Tabular formats can often be sloppy about what is a key and what is a literal value, and also whether a column with the same name in different tables means the same thing in both. Both of these things pose challenges for later use when joining data. RDF representation encourages greater discipline through the use of URIs (though we've run a bit roughshod over that with the cavilier use of `x:` here.)