[Question] Correct way to parse HTML with defusedxml? #100

dgtlmoon · 2023-10-17T09:11:31Z

Heya, thanks for the nice project!

    from defusedxml import ElementTree as etree
    tree = etree.fromstring(bytes(html_content, encoding='utf-8'))

results in mismatched tag: line 47, column 4, however the previous lxml implementation was automatically tolerant of mismatched tags

I dug around in the issue queue and found #31 , #24 and others, but it's not quite clear what the process should be

-    from lxml import etree, html
+    from defusedxml import ElementTree as etree
 
-    tree = html.fromstring(bytes(html_content, encoding='utf-8'))
+    tree = etree.fromstring(bytes(html_content, encoding='utf-8'))

# ( for some background, I'm trying to parse the HTML tree then execute a xpath query 
r = tree.xpath(xpath_filter.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'})

Is there a way to make lxml use defusedxml here for HTML? or am I going about it the right way? Or is it that my understanding of defusedxml sits is wrong?

And what about in the case of arbitrary HTML "from the wild" where there could be hanging/open tags etc? Do I need to supply some kind of "recover=true" ( https://lxml.de/api/lxml.etree.XMLParser-class.html ) ?

thanks again!

The text was updated successfully, but these errors were encountered:

tiran · 2023-10-17T12:57:59Z

HTML is not XML. defusedxml only deals with XML and does not support HTML parsing. lxml is the tool for HTML parsing. Please reach out to lxml project if you have security concerns with lxml.

tiran closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Correct way to parse HTML with defusedxml? #100

[Question] Correct way to parse HTML with defusedxml? #100

dgtlmoon commented Oct 17, 2023 •

edited

Loading

tiran commented Oct 17, 2023

[Question] Correct way to parse HTML with defusedxml? #100

[Question] Correct way to parse HTML with defusedxml? #100

Comments

dgtlmoon commented Oct 17, 2023 • edited Loading

tiran commented Oct 17, 2023

dgtlmoon commented Oct 17, 2023 •

edited

Loading