ropensci · zkamvar · May 27, 2021 · May 27, 2021 · May 28, 2021 · May 28, 2021
diff --git a/vignettes/namespaces.Rmd b/vignettes/namespaces.Rmd
@@ -0,0 +1,288 @@
+---
+title: "Hold Namespace For Me"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Hold Namespace For Me}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r}
+library("tinkr")
+library("magrittr")
+library("commonmark")
+library("xml2")
+library("xslt")
+library("purrr")
+```
+
+
+## XML namespaces
+
+XML namespaces are a lot like package namespaces in R: they allow you to avoid
+clashes of names [for example, table can represent data or furniture](https://www.w3schools.com/XML/xml_namespaces.asp). 
+
+By default, nodes in XML do not have namespaces unless you give them one, which
+means that when you use XPath search, you can use the node names by default:
+
+```{r}
+d <- xml2::read_xml("<document>
+    <paragraph>
+      <text>hello there</text>
+      <text> ello  here</text>
+    </paragraph>
+  </document>")
+xml2::xml_ns(d)
+xml2::xml_find_all(d, "//document")
+xml2::xml_find_all(d, "//text[contains(text(), 'hello')]")
+```
+
+However if there is a namespace added to a node, all of its descendants will 
+inherit the namespace, which affects your XPath expressions: 
+
+```{r}
+d <- xml2::read_xml("<document>
+    <paragraph xmlns='http://commonmark.org/xml/1.0'>
 xmlns:md="http://commonmark.org/xml/1.0"> 
 xmlns:md="http://commonmark.org/xml/1.0"> 
+      <text>hello there</text>
+      <text> ello  here</text>
+    </paragraph>
+  </document>")
+xml2::xml_ns(d)
+xml2::xml_find_all(d, "//document")
+xml2::xml_find_all(d, "//text[contains(text(), 'hello')]") # does not work
+xml2::xml_find_all(d, "//d1:text[contains(text(), 'hello')]") # works; default namespace
+```
+
+When a namespace is specified with `xmlns=<URI>`, {xml2} assigns it a **default
+namespace** prefix, which is `d1`. The {xml2} documentation recommends to
+rename the namespace as soon as you read in a document and use the namespace
+object to semantically prefix your XPath expressions:
+
+```{r}
+ns <- xml2::xml_ns(d)
+ns <- xml2::xml_ns_rename(ns, d1 = "md") 
+ns
+xml2::xml_find_all(d, "//md:text[contains(text(), 'hello')]", ns)
+```
+
+You might be wondering, why don't we prefix the namespace from the start to 
+avoid needing to rename and specify the namespace? Here's an example. Let's take
+our previous example and modify the namespace attribute to have an `md` prefix:
+
+```{r}
+dc <- as.character(d)
+cat(dc <- gsub("xmlns=", "xmlns:md=", dc))
+dc <- xml2::read_xml(dc)
+xml2::xml_ns(dc)
+xml2::xml_find_all(dc, "//document")
+xml2::xml_find_all(dc, "//text[contains(text(), 'hello')]") # works!
+xml2::xml_find_all(dc, "//md:text") # prefix no longer works :(
+```
+
+Now the prefix syntax no longer works, but that's okay, because now we have the
+commonmark as our default namespace, right? Unfortunately, that's not quite the
+case. You might notice that we can access the `document` node AND the `text`
+node without a prefix even though the `text` node is in the commonmark namespace
+and the `document` node is outside of that namespace. It's because neither of
+these nodes actually have a namespace! 
+
+You can see this when we add a new node with a namespace prefix:
+
+```{r}
+pgp <- xml2::xml_find_first(dc, "//paragraph")
+xml2::xml_add_child(pgp, "md:text", "hello from the md namespace")
+xml2::xml_find_all(dc, "//md:text") # only one text element in the md namespace
+xml2::xml_find_all(dc, "//text")    # md namespace does not show up here
+```
+
+If a namespace is defined in the document with a prefix, *only nodes with that
+prefix are considered to be inside the namespace*. This becomes important when
+we want to pass our XML document through a stylesheet that expects the incoming
+nodes to have a specific namespace, which is exactly how we transform the XML
+representation of markdown back to markdown.
+
+## Commonmark
+
+The {tinkr} package streamlines the process of markdown to xml and back again.
+We use `commonmark::markdown_xml()` as a starting point to generate valid XML:
+
+
+```{r cmark}
+cat(cmk <- commonmark::markdown_xml("this is a **test**"))
+xml <- xml2::read_xml(cmk)
+xml
+```
+
+### It's default of the namespace
+
+You can see from the commonmark output that it has a **default namespace** that
+resolves to `http://commonmark.org/xml/1.0`, which means that we need to use
+the default namespace if we want to munge the data:
+
+```{r munge}
+xml2::xml_find_all(xml, "//d1:text")
+```
+
+### Who is up for semantics?
+
+To make things more semantic, we could rename the namespace to have the "md"
+prefix and carry around that object. Note: an `xml_namespace` object is a named
+character vector, so we can create it with `structure()` and use it to introduce
+semantically sensible XPath queries
+
+```{r}
+ns <- structure(c(md = "http://commonmark.org/xml/1.0"), class = "xml_namespace")
+xml2::xml_find_all(xml, "//md:text", ns)
+```
+
+Of course, now if we want to make any semantic XPath query, we need to include
+both a prefix and a namespace object.
+
+### Transformers!
+
+The commonmark namespace allows us to transform our document to markdown using 
+an XSLT stylesheet, which is---that's right---an XML document:
+
+```{r stysh}
+sty <- xml2::read_xml(tinkr::stylesheet())
+sty
+```
+
+Each `xsl:template` node in this stylesheet matches against a specific node in
+the commonmark namespace (prefix: `md`) and emits text based on that node. This
+allows us to write back to markdown:
+
+```{r}
+cat(xslt::xml_xslt(xml, sty))
+```
+
+We can in this way programatically transform the content of the markdown:
+
+```{r}
+xml <- commonmark::markdown_xml("this is a **test**") %>%
+  xml2::read_xml()
+
+xml2::xml_find_all(xml, "//md:strong", ns) %>%
+  xml2::xml_set_name("code") %>%
+  xml2::xml_set_text("r cat('_test_')")
+
+sty <- xml2::read_xml(tinkr::stylesheet())
+cat(xslt::xml_xslt(xml, sty))
+```
+
+### Perils: adding nodes
+
+A default namespace is all fun and games until you need to add new nodes. Take
+for example the situation where we want to add a code block. In commonmark, it's
+a `code_block` node with an `info` attribute stating the language and the text
+inside is the code.
+
+```{r}
+xml <- commonmark::markdown_xml("this is a **test**") %>%
+  xml2::read_xml() 
+xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
+xml
+xml2::xml_find_all(xml, "//md:code_block", ns)
+sty <- xml2::read_xml(tinkr::stylesheet())
+cat(xslt::xml_xslt(xml, sty))
+```
+
+By all means, the node should have added correctly, but because we did not 
+specify a namespace, *it is not recognized as part of the `md` namespace* even
+though we added it as a child of the document. The [best way to handle this
+situation is to reparse the document][reparse]:
+
+```{r reread}
+xml %>%
+  as.character() %>%
+  xml2::read_xml() %>%
+  xslt::xml_xslt(sty) %>%
+  cat()
+```
+
+We *could* also try adding the namespace to the node when we add it:
+
+```{r}
+xml <- commonmark::markdown_xml("this is a **test**") %>%
+  xml2::read_xml() 
+xml2::xml_add_child(xml, "code_block", 
+  xmlns = "http://commonmark.org/xml/1.0", info = "{r}", "1 + rnorm(1)\n")
+xml
+xml2::xml_find_all(xml, "//md:code_block", ns)
+cat(xslt::xml_xslt(xml, sty))
+```
+
+It works, but let's take a look at our namespaces:
+
+```{r}
+xml2::xml_ns(xml)
+```
+
+Every node we add with an unnamed namespace adds another default and in the end, 
+if we are doing a lot of substitution, we can end up with hundreds of namespaces.
+
+
+## No Namespace?
+
+What if we just tried to use no namespace?
+
+```{r}
+xml <- commonmark::markdown_xml("this is a **test**") %>%
+  xml2::read_xml() %>%
+  xml2::xml_ns_strip()
+xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
+xml
+xml2::xml_find_all(xml, "//code_block")
+sty <- xml2::read_xml(tinkr::stylesheet())
+cat(xslt::xml_xslt(xml, sty))
+```
+
+We can now add new nodes and use XPath without namespace prefixes or objects,
+but we have lost the ability to use our stylesheet :(
+
+But! Maybe we can do this by adding the namespace at the last minute!
+
+```{r}
+xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
+xml
+cat(xslt::xml_xslt(xml, sty))
+```
+
+## Harnessing the power of namespaces
+
+When you know that namespaces with prefixes will only respond to nodes with that
+prefix and all other nodes have no namespace, then you can add in nodes that can
+serve as anchors in your document or hiding markdown elements. Let's say we
+wanted to hide all markdown elements _except_ for code blocks. One way we could
+do this is to set up a namespace and add a prefix to all non-code-block nodes:
+
+```{r}
+xml <- commonmark::markdown_xml("this is a **test**") %>%
+  xml2::read_xml() %>%
+  xml2::xml_ns_strip()
+xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
+xml
+# Set the prefixed namespace in your document
+xml2::xml_set_attr(xml, "xmlns:tnk", "https://docs.ropensci.org/tinkr")
+# Find all nodes that are not code blocks
+nocode <- xml2::xml_find_all(xml, ".//*[not(self::code_block)]")
+nocode
+# Change the namespace of these nodes
+purrr::walk(nocode, xml2::xml_set_namespace, "tnk", "https://docs.ropensci.org/tinkr")
+xml
+xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
+sty <- xml2::read_xml(tinkr::stylesheet())
+cat(xslt::xml_xslt(xml, sty))
+```
+
+
+[reparse]: https://community.rstudio.com/t/adding-nodes-in-xml2-how-to-avoid-duplicate-default-namespaces/84870/2?u=zkamvar
+
+