Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Correct way to parse HTML with defusedxml? #100

Closed
dgtlmoon opened this issue Oct 17, 2023 · 1 comment
Closed

[Question] Correct way to parse HTML with defusedxml? #100

dgtlmoon opened this issue Oct 17, 2023 · 1 comment

Comments

@dgtlmoon
Copy link

dgtlmoon commented Oct 17, 2023

Heya, thanks for the nice project!

    from defusedxml import ElementTree as etree
    tree = etree.fromstring(bytes(html_content, encoding='utf-8'))

results in mismatched tag: line 47, column 4, however the previous lxml implementation was automatically tolerant of mismatched tags

I dug around in the issue queue and found #31 , #24 and others, but it's not quite clear what the process should be

-    from lxml import etree, html
+    from defusedxml import ElementTree as etree
 
-    tree = html.fromstring(bytes(html_content, encoding='utf-8'))
+    tree = etree.fromstring(bytes(html_content, encoding='utf-8'))

# ( for some background, I'm trying to parse the HTML tree then execute a xpath query 
r = tree.xpath(xpath_filter.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'})

Is there a way to make lxml use defusedxml here for HTML? or am I going about it the right way? Or is it that my understanding of defusedxml sits is wrong?

And what about in the case of arbitrary HTML "from the wild" where there could be hanging/open tags etc? Do I need to supply some kind of "recover=true" ( https://lxml.de/api/lxml.etree.XMLParser-class.html ) ?

thanks again!

@tiran
Copy link
Owner

tiran commented Oct 17, 2023

HTML is not XML. defusedxml only deals with XML and does not support HTML parsing. lxml is the tool for HTML parsing. Please reach out to lxml project if you have security concerns with lxml.

@tiran tiran closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants