Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace vignette #52

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
288 changes: 288 additions & 0 deletions vignettes/namespaces.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
---
title: "Hold Namespace For Me"
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Hold Namespace For Me}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r}
library("tinkr")
library("magrittr")
library("commonmark")
library("xml2")
library("xslt")
library("purrr")
```


## XML namespaces

XML namespaces are a lot like package namespaces in R: they allow you to avoid
clashes of names [for example, table can represent data or furniture](https://www.w3schools.com/XML/xml_namespaces.asp).

By default, nodes in XML do not have namespaces unless you give them one, which
means that when you use XPath search, you can use the node names by default:

```{r}
d <- xml2::read_xml("<document>
<paragraph>
<text>hello there</text>
<text> ello here</text>
</paragraph>
</document>")
xml2::xml_ns(d)
xml2::xml_find_all(d, "//document")
xml2::xml_find_all(d, "//text[contains(text(), 'hello')]")
```

However if there is a namespace added to a node, all of its descendants will
inherit the namespace, which affects your XPath expressions:
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

```{r}
d <- xml2::read_xml("<document>
<paragraph xmlns='http://commonmark.org/xml/1.0'>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid URL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's going to be a fun one for CRAN. That's the namespace we have for the stylesheet:

xmlns:md="http://commonmark.org/xml/1.0">

Fun fact that I need to include in the vignette: namespaces must be a valid URI, but do not have to be a valid URL. The documentation on this is painfully obtuse: https://www.w3.org/TR/xml-names/#sec-namespaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably obfuscate this by adding a function that creates the URI so that CRAN doesn't pick it up in it's scans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I now realize it won't be found cf https://github.com/wch/r-source/blob/277dc7c97155e7dcc3f0649bc1bc7731a9f26b74/src/library/tools/R/urltools.R#L78 (since the URL won't be a link in the HTML file).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However in the text we might want to explain it's an URI?

<text>hello there</text>
<text> ello here</text>
</paragraph>
</document>")
xml2::xml_ns(d)
xml2::xml_find_all(d, "//document")
xml2::xml_find_all(d, "//text[contains(text(), 'hello')]") # does not work
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
xml2::xml_find_all(d, "//d1:text[contains(text(), 'hello')]") # works; default namespace
```

When a namespace is specified with `xmlns=<URI>`, {xml2} assigns it a **default
namespace** prefix, which is `d1`. The {xml2} documentation recommends to
rename the namespace as soon as you read in a document and use the namespace
object to semantically prefix your XPath expressions:

```{r}
ns <- xml2::xml_ns(d)
ns <- xml2::xml_ns_rename(ns, d1 = "md")
ns
xml2::xml_find_all(d, "//md:text[contains(text(), 'hello')]", ns)
```

You might be wondering, why don't we prefix the namespace from the start to
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
avoid needing to rename and specify the namespace? Here's an example. Let's take
our previous example and modify the namespace attribute to have an `md` prefix:

```{r}
dc <- as.character(d)
cat(dc <- gsub("xmlns=", "xmlns:md=", dc))
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
dc <- xml2::read_xml(dc)
xml2::xml_ns(dc)
xml2::xml_find_all(dc, "//document")
xml2::xml_find_all(dc, "//text[contains(text(), 'hello')]") # works!
xml2::xml_find_all(dc, "//md:text") # prefix no longer works :(
```

Now the prefix syntax no longer works, but that's okay, because now we have the
commonmark as our default namespace, right? Unfortunately, that's not quite the
case. You might notice that we can access the `document` node AND the `text`
node without a prefix even though the `text` node is in the commonmark namespace
and the `document` node is outside of that namespace. It's because neither of
these nodes actually have a namespace!

You can see this when we add a new node with a namespace prefix:

```{r}
pgp <- xml2::xml_find_first(dc, "//paragraph")
xml2::xml_add_child(pgp, "md:text", "hello from the md namespace")
xml2::xml_find_all(dc, "//md:text") # only one text element in the md namespace
xml2::xml_find_all(dc, "//text") # md namespace does not show up here
```

If a namespace is defined in the document with a prefix, *only nodes with that
prefix are considered to be inside the namespace*. This becomes important when
we want to pass our XML document through a stylesheet that expects the incoming
nodes to have a specific namespace, which is exactly how we transform the XML
representation of markdown back to markdown.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

## Commonmark

The {tinkr} package streamlines the process of markdown to xml and back again.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
We use `commonmark::markdown_xml()` as a starting point to generate valid XML:


```{r cmark}
cat(cmk <- commonmark::markdown_xml("this is a **test**"))
xml <- xml2::read_xml(cmk)
xml
```

### It's default of the namespace
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

You can see from the commonmark output that it has a **default namespace** that
resolves to `http://commonmark.org/xml/1.0`, which means that we need to use
the default namespace if we want to munge the data:

```{r munge}
xml2::xml_find_all(xml, "//d1:text")
```

### Who is up for semantics?

To make things more semantic, we could rename the namespace to have the "md"
prefix and carry around that object. Note: an `xml_namespace` object is a named
character vector, so we can create it with `structure()` and use it to introduce
semantically sensible XPath queries

```{r}
ns <- structure(c(md = "http://commonmark.org/xml/1.0"), class = "xml_namespace")
xml2::xml_find_all(xml, "//md:text", ns)
```

Of course, now if we want to make any semantic XPath query, we need to include
both a prefix and a namespace object.

### Transformers!

The commonmark namespace allows us to transform our document to markdown using
an XSLT stylesheet, which is---that's right---an XML document:

```{r stysh}
sty <- xml2::read_xml(tinkr::stylesheet())
sty
```

Each `xsl:template` node in this stylesheet matches against a specific node in
the commonmark namespace (prefix: `md`) and emits text based on that node. This
allows us to write back to markdown:

```{r}
cat(xslt::xml_xslt(xml, sty))
```

We can in this way programatically transform the content of the markdown:

```{r}
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()

xml2::xml_find_all(xml, "//md:strong", ns) %>%
xml2::xml_set_name("code") %>%
xml2::xml_set_text("r cat('_test_')")

sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
```

### Perils: adding nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "perils" a common word? I understand it because it looks like the French word, but might it be a better idea to use "risks"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think peril gives a better sense of "something that can not be avoided on this path" as opposed to risk, which has a random component.


A default namespace is all fun and games until you need to add new nodes. Take
for example the situation where we want to add a code block. In commonmark, it's
a `code_block` node with an `info` attribute stating the language and the text
inside is the code.

```{r}
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
xml2::xml_find_all(xml, "//md:code_block", ns)
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
```

By all means, the node should have added correctly, but because we did not
specify a namespace, *it is not recognized as part of the `md` namespace* even
though we added it as a child of the document. The [best way to handle this
situation is to reparse the document][reparse]:

```{r reread}
xml %>%
as.character() %>%
xml2::read_xml() %>%
xslt::xml_xslt(sty) %>%
cat()
```

We *could* also try adding the namespace to the node when we add it:

```{r}
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()
xml2::xml_add_child(xml, "code_block",
xmlns = "http://commonmark.org/xml/1.0", info = "{r}", "1 + rnorm(1)\n")
xml
xml2::xml_find_all(xml, "//md:code_block", ns)
cat(xslt::xml_xslt(xml, sty))
```

It works, but let's take a look at our namespaces:

```{r}
xml2::xml_ns(xml)
```

Every node we add with an unnamed namespace adds another default and in the end,
if we are doing a lot of substitution, we can end up with hundreds of namespaces.
maelle marked this conversation as resolved.
Show resolved Hide resolved


## No Namespace?

What if we just tried to use no namespace?

```{r}
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml() %>%
xml2::xml_ns_strip()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
xml2::xml_find_all(xml, "//code_block")
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
```

We can now add new nodes and use XPath without namespace prefixes or objects,
but we have lost the ability to use our stylesheet :(

But! Maybe we can do this by adding the namespace at the last minute!

```{r}
xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
xml
cat(xslt::xml_xslt(xml, sty))
```

## Harnessing the power of namespaces

When you know that namespaces with prefixes will only respond to nodes with that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a practical use case (in words, not code necessarily) for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I use the alternate namespace in {pegboard} to help me identify and label pandoc fenced-div sections by adding pairs of equivalently labeled tags that are not part of the markdown document so I can easily parse the content with find_between()

Otherwise, I can see the masking pattern useful if you wanted to create several versions of the same prose in a single document (e.g. if you were creating a quiz that you wanted randomized per student).

prefix and all other nodes have no namespace, then you can add in nodes that can
serve as anchors in your document or hiding markdown elements. Let's say we
wanted to hide all markdown elements _except_ for code blocks. One way we could
do this is to set up a namespace and add a prefix to all non-code-block nodes:

```{r}
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml() %>%
xml2::xml_ns_strip()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
# Set the prefixed namespace in your document
xml2::xml_set_attr(xml, "xmlns:tnk", "https://docs.ropensci.org/tinkr")
# Find all nodes that are not code blocks
nocode <- xml2::xml_find_all(xml, ".//*[not(self::code_block)]")
nocode
# Change the namespace of these nodes
purrr::walk(nocode, xml2::xml_set_namespace, "tnk", "https://docs.ropensci.org/tinkr")
xml
xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
```


[reparse]: https://community.rstudio.com/t/adding-nodes-in-xml2-how-to-avoid-duplicate-default-namespaces/84870/2?u=zkamvar