Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing .text() of a <script> brings empty string on 1.0.0-rc.2 #1050

Closed
Tracked by #1145
aletorrado opened this issue Jul 6, 2017 · 19 comments · Fixed by #2509
Closed
Tracked by #1145

Doing .text() of a <script> brings empty string on 1.0.0-rc.2 #1050

aletorrado opened this issue Jul 6, 2017 · 19 comments · Fixed by #2509
Labels

Comments

@aletorrado
Copy link

Hi, I've been doing .text() of tags for scraping purposes, and the latest release candidate breaks this. jQuery 1~3 and cheerio ~0 both works fine.

@jugglinmike
Copy link
Member

Thanks for the report! This is a tricky problem.

The updated behavior was merged to master (i.e. not just the 1.0.0 release branch) in gh-1018. Judging from the discussion in that patch (and in bug reports gh-986 and gh-)924), it seems to have been motivated by comparisons with anecdotal evidence of observed jQuery behavior. Missing from those discussions is evidence from the jQuery documentation of $.fn.text, but the docs do indeed support this:

The .text() method cannot be used on form inputs or scripts. To set or get
the text value of input or textarea elements, use the .val() method. To
get the value of a script element, use the .html() method.

Interestingly, this is not observable from Chrome or Firefox--the method returns the text contents of script elements in both of those browsers. The caveat is likely in place due to practical constraints for Internet Explorer. IE versions 6 through 8 do not provide text contents for script tags. I imagine the jQuery team opted to recommend using html because smoothing over IE's behavior had negative performance implications.

From this perspective, the change is technically valid. To promote portability between Cheerio and jQuery and between all web browsers, Cheerio should similarly recommend that consumers use html for this use case.

That said, such a policy definitely feels like a lowest common denominator solution, one that accounts for web browsers that are over eight years old and have negligible market share. It's also important to note that jQuery itself no longer supports Internet Explorer releases prior to version 9. (That detail tends to suggest a jQuery documentation change. I've opened an issue to discuss this with the jQuery
maintainers
.)

So with all of that said, I'm leaning towards reverting gh-1018. What do you think, @fb55?

@fb55
Copy link
Member

fb55 commented Aug 1, 2017

This is an interesting case. I was assuming that .text() was an alias for innerText, when it is using textContent under the hood. I'm fine with reverting #1018, even though the current behaviour is probably more useful when scraping pages (as both tags contents usually aren't visible).

@ghost
Copy link

ghost commented Aug 14, 2017

A bit bias ;) but the current behavior does support the specifications in the jQuery docs as you mentioned.

Having both options is best? Users wanting to scrape script tags can use .html() and users wanting to ignore script tags can use .text(). Reverting #1018 will give users only one option.

@FdezRomero
Copy link

FdezRomero commented Aug 23, 2017

I spent a lot of time wondering why I wasn't able to get the text() of a <script type="application/ld+json"> and finally found this issue. It should be documented somewhere as a breaking change is it's not reverted.

I agree with @fb55, I was confused as both innerText and textContent would return the content of the script (at least in Chrome).

I'm okay using html() instead if this behavior is more aligned with the jQuery API, as long as the breaking change is documented.

@tenbits
Copy link

tenbits commented Jan 22, 2019

I was also confused, that the text() method on script elements returns empty string. As for jQuery 3.1 the jQuery(script).text() returns the textContent of the element.

@tegaguru
Copy link

Honestly I think I just pulled out every strand of hair on my head head while trying to figure why I kept getting an empty string!

@fb55
Copy link
Member

fb55 commented Sep 11, 2020

Let's revert #1018, the current behavior seems pretty surprising.

@ljharb
Copy link

ljharb commented Sep 11, 2020

Before publishing that fix, please see if enzyme's test suite passes with it :-)

@fb55
Copy link
Member

fb55 commented Sep 11, 2020

@ljharb Do you have a suspicion that things might break? Would love to hear another take on this 😄

@ljharb
Copy link

ljharb commented Sep 12, 2020

It’s possible it’d only be conceptual and not actual breakage, but I’d hope enzyme has enough test coverage that it’d notice the change.

@matthewmueller
Copy link
Member

matthewmueller commented Sep 22, 2020

Next steps:

  • Create a PR reverting this change.
  • Test that PR against enzyme's test suite.
  • If failing tests in enzyme, discuss with @ljharb.
  • Assuming all good, update our tests, document breaking change, merge.

@fb55
Copy link
Member

fb55 commented Dec 20, 2020

@ljharb The last commit on the enzyme master branch is two years old, even though tags still seem to be pushed. What is the best source for enzyme's source code, so we can run the test suite?

@ljharb
Copy link

ljharb commented Dec 20, 2020

@fb55 thats the date the commit was authored; i landed it a week ago. The master branch is quite up to date.

@fb55
Copy link
Member

fb55 commented Dec 20, 2020

Ah, interesting 😄 Thanks for clarifying!

@fb55
Copy link
Member

fb55 commented Dec 21, 2020

I decided to punt on this for now. For reference, enzyme actually does not have a test covering it.

@ljharb
Copy link

ljharb commented Dec 22, 2020

@fb55 that is true; the second checkbox in the list above never happened. I can certainly try to add a test, or at least open a PR/branch with one, if that would help you fix this in cheerio?

@fb55 fb55 added the To Do label Dec 22, 2020
@codeyourwayup
Copy link

Yes. For all, please use html() instead of text() for now, it works!

@ljharb
Copy link

ljharb commented May 15, 2021

It won’t work if you need the text instead of the HTML.

@fb55
Copy link
Member

fb55 commented Apr 30, 2022

It only took five years, but this has been resolved. There are now textContent and innerText props that allow users to choose which of the two behaviours they want, and the text() method has been reverted to the old behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

9 participants