-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Namespace vignette #52
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting 🤓 👏
My most important comment would be that the vignette lacks an introduction (what's the problem here?) and conclusion.
|
||
```{r} | ||
d <- xml2::read_xml("<document> | ||
<paragraph xmlns='http://commonmark.org/xml/1.0'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invalid URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that's going to be a fun one for CRAN. That's the namespace we have for the stylesheet:
tinkr/inst/extdata/xml2md_gfm.xsl
Line 5 in 935ed21
xmlns:md="http://commonmark.org/xml/1.0"> |
Fun fact that I need to include in the vignette: namespaces must be a valid URI, but do not have to be a valid URL. The documentation on this is painfully obtuse: https://www.w3.org/TR/xml-names/#sec-namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably obfuscate this by adding a function that creates the URI so that CRAN doesn't pick it up in it's scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I now realize it won't be found cf https://github.com/wch/r-source/blob/277dc7c97155e7dcc3f0649bc1bc7731a9f26b74/src/library/tools/R/urltools.R#L78 (since the URL won't be a link in the HTML file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However in the text we might want to explain it's an URI?
Co-authored-by: Maëlle Salmon <maelle.salmon@yahoo.se>
This is a really good point and really highlights my writing style, it's a bit like building a sandwich inside-out: I start with a salad and add the bread at the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, mostly nitpicky comments. 😺
cat(xslt::xml_xslt(xml, xslt_style)) | ||
``` | ||
|
||
Read on to find out more about XML namespaces and their implications on your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tinkering" 😁
cat(xslt::xml_xslt(xml, sty)) | ||
``` | ||
|
||
### Perils: adding nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "perils" a common word? I understand it because it looks like the French word, but might it be a better idea to use "risks"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think peril gives a better sense of "something that can not be avoided on this path" as opposed to risk, which has a random component.
|
||
## Harnessing the power of namespaces | ||
|
||
When you know that namespaces with prefixes will only respond to nodes with that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a practical use case (in words, not code necessarily) for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I use the alternate namespace in {pegboard} to help me identify and label pandoc fenced-div sections by adding pairs of equivalently labeled tags that are not part of the markdown document so I can easily parse the content with find_between()
Otherwise, I can see the masking pattern useful if you wanted to create several versions of the same prose in a single document (e.g. if you were creating a quiz that you wanted randomized per student).
standardizing their markdown documents. | ||
|
||
[reparse]: https://community.rstudio.com/t/adding-nodes-in-xml2-how-to-avoid-duplicate-default-namespaces/84870/2?u=zkamvar | ||
[^1]: Well, mostly just Zhian. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The truth is you took the time to understand them, I just had to read your vignette!
|
||
While developing {tinkr} we[^1] struggled a lot with understanding namespaces. | ||
This guide was our attempt at demystifying working with namespaces in {xml2}. | ||
For the casual user of {tinkr} who is interested in extracting data from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could still write the take-home message for them i.e. what they need to know for normal use?
Co-authored-by: Maëlle Salmon <maelle.salmon@yahoo.se>
This will address #48. Here is the rendered version via
knitr::purl()
andreprex::reprex()
Draft of vingette text (updated 2021-05-28)
Introduction
This document was written to address common confusions about XML namespaces and
their implications in constructing XPath queries, adding new XML nodes, and
converting XML to markdown. This guide is written for the user who is
comfortable with XPath queries and wants to understand more about how to handle
and manpiulate their XML representation of markdown.
Motivation
The underlying motivation for {tinkr} was to wrap the process of converting
markdown documents to XML and back again. This process uses {commonmark} and
{xml2} to translate and read in the markdown to an XML document.
We use the
xslt
package to to the conversion from XML back to markdown.One of the downsides of this conversion is that commonmark provides a default
namespace, which means that nodes in XPath queries must have a prefix that
defines the namespace. For example, an XPath query to select all paragraphs
that have executable R code looks like the following query:
The reason why we add
d1
is because that’s the prefix for the defaultnamespace in {xml2}.
The {tinkr} difference
The XML document that {tinkr} generates has no namespace by default because
operations on an XML document without a namespace becomes easier than if there
were a default or a prefixed namespace.
However, removing the namespace has implications for exporting XML objects
because namespaces are important. For example, this document namespace-less
document no longer can be converted with our XSLT stylesheet, which expects a
commonmark namespace:
To alleviate this, we add the namespace just before it’s converted in
tinkr::to_md()
.Read on to find out more about XML namespaces and their implications on your
tinkering.
XML namespaces
XML namespaces are a lot like package namespaces in R: they allow you to avoid
clashes of names for example, table can represent data or furniture.
By default, nodes in XML do not have namespaces unless you give them one, which
means that when you use XPath search, you can use the node names by default:
However if there is a namespace added to a node, all of its descendants will
inherit the namespace, which affects your XPath expressions.
Below we had the namespace of commonmark to the paragraph node.
Using the same XPath query as before no longer works, our call to
xml2::xml_find_all()
returns nothing.When a namespace is specified with
xmlns=<URI>
, {xml2} assigns it adefault namespace prefix, which is
d1
. Therefore editing our XPath querylike so will work:
But is it a good idea to use
d1
as a namespace prefix? No, the {xml2}documentation recommends to rename the namespace as soon as you read in a
document and use the namespace object to semantically prefix your XPath
expressions:
Now we can modify our XPath query to use
md
as a prefix, but we also need tosupply the namespace as an argument to the command:
You might be wondering, why isn’t it recommended to prefix the namespace from
the start to avoid needing to rename and specify the namespace? The reason is
because the prefixed namespaces only apply to nodes with that prefix. Here’s
an example. Let’s take our previous example and modify the namespace attribute
to have an
md
prefix:We can see that the XPath query without the prefix works.
However, the XPath query with the prefix no longer works.
You might be wondering, when we specified the prefix earlier with a default
namespace, the prefixed XPath query worked, but now with a namespace that
explicitly defines the prefix, that query is no longer working. Isn’t everything
below the
paragraph
node in the commonmark namespace?You might notice that we can access the
document
node AND thetext
node without a prefix even though the
text
node is in the commonmark namespaceand the
document
node is outside of that namespace. It’s because neither ofthese nodes actually have a namespace!
This is demonstrated when we add a new node with the
md
prefixNow we can see that there are three text nodes, one of which has the
md
namespace prefix. If we select the nodes with that prefix and without the
prefix, we will get one and two nodes, respectively.
If a namespace is defined in the document with a prefix, only nodes with that
prefix are considered to be inside the namespace. This becomes important when
we want to pass our XML document through a stylesheet that expects the incoming
nodes to have a specific namespace, which is exactly how we transform the XML
representation of markdown back to markdown.
Commonmark
The {tinkr} package streamlines the process of markdown to xml and back again.
We use
commonmark::markdown_xml()
as a starting point to generate valid XML:Commonmark uses a default namespace
You can see from the commonmark output that it has a default namespace that
resolves to
http://commonmark.org/xml/1.0
, which means that we need to usethe default namespace if we want to munge the data:
Using a semantic prefix with the default namespace
To make things more semantic, we could rename the namespace to have the “md”
prefix and carry around that object. Note: an
xml_namespace
object is a namedcharacter vector, so we can create it with
structure()
and use it to introducesemantically sensible XPath queries
Of course, now if we want to make any semantic XPath query, we need to include
both a prefix and a namespace object.
Transforming XML to markdown with XSLT
The commonmark namespace allows us to transform our document to markdown using
an XSLT stylesheet, which is—that’s right—an XML document:
Each
xsl:template
node in this stylesheet matches against a specific node inthe commonmark namespace (prefix:
md
) and emits text based on that node. Thisallows us to write back to markdown:
We can in this way programatically transform the content of the markdown. In
this example, we can change the
**test**
to be an inline R code chunk thatemits
_test_
.Perils: adding nodes
A default namespace is all fun and games until you need to add new nodes. Take
for example the situation where we want to add a code block. In commonmark, it’s
a
code_block
node with aninfo
attribute stating the language and the textinside is the code.
By all means, the node should have added correctly, but because we did not
specify a namespace, it is not recognized as part of the
md
namespace eventhough we added it as a child of the document. The best way to handle this
situation is to reparse the document:
We could also try adding the namespace to the node when we add it:
It works, but let’s take a look at our namespaces:
Every node we add with an unnamed namespace adds another default and in the end,
if we are doing a lot of substitution, we can end up with hundreds of namespaces.
No Namespace?
What if we just tried to use no namespace?
We can now add new nodes and use XPath without namespace prefixes or objects,
but we have lost the ability to use our stylesheet :(
But! Maybe we can do this by adding the namespace at the last minute!
Harnessing the power of namespaces
When you know that namespaces with prefixes will only respond to nodes with that
prefix and all other nodes have no namespace, then you can add in nodes that can
serve as anchors in your document or hiding markdown elements. Let’s say we
wanted to hide all markdown elements except for code blocks. One way we could
do this is to set up a namespace and add a prefix to all non-code-block nodes:
Conclusion
While developing {tinkr} we[1] struggled a lot with understanding namespaces.
This guide was our attempt at demystifying working with namespaces in {xml2}.
For the casual user of {tinkr} who is interested in extracting data from
markdown documents, this guide is not very useful, but we hope that this
guide provies useful for the user who wants to use this for cleaning and
standardizing their markdown documents.
[1] Well, mostly just Zhian.
Created on 2021-05-28 by the reprex package (v2.0.0)