You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from defusedxml import ElementTree as etree
tree = etree.fromstring(bytes(html_content, encoding='utf-8'))
results in mismatched tag: line 47, column 4, however the previous lxml implementation was automatically tolerant of mismatched tags
I dug around in the issue queue and found #31 , #24 and others, but it's not quite clear what the process should be
- from lxml import etree, html
+ from defusedxml import ElementTree as etree
- tree = html.fromstring(bytes(html_content, encoding='utf-8'))
+ tree = etree.fromstring(bytes(html_content, encoding='utf-8'))
# ( for some background, I'm trying to parse the HTML tree then execute a xpath query
r = tree.xpath(xpath_filter.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'})
Is there a way to make lxml use defusedxml here for HTML? or am I going about it the right way? Or is it that my understanding of defusedxml sits is wrong?
And what about in the case of arbitrary HTML "from the wild" where there could be hanging/open tags etc? Do I need to supply some kind of "recover=true" ( https://lxml.de/api/lxml.etree.XMLParser-class.html ) ?
thanks again!
The text was updated successfully, but these errors were encountered:
HTML is not XML. defusedxml only deals with XML and does not support HTML parsing. lxml is the tool for HTML parsing. Please reach out to lxml project if you have security concerns with lxml.
Heya, thanks for the nice project!
results in
mismatched tag: line 47, column 4
, however the previous lxml implementation was automatically tolerant of mismatched tagsI dug around in the issue queue and found #31 , #24 and others, but it's not quite clear what the process should be
Is there a way to make lxml use defusedxml here for HTML? or am I going about it the right way? Or is it that my understanding of defusedxml sits is wrong?
And what about in the case of arbitrary HTML "from the wild" where there could be hanging/open tags etc? Do I need to supply some kind of "
recover=true
" ( https://lxml.de/api/lxml.etree.XMLParser-class.html ) ?thanks again!
The text was updated successfully, but these errors were encountered: