Skip to content
This repository has been archived by the owner on Apr 12, 2024. It is now read-only.

google.com search results are linking doc pages without stylesheets #16432

Closed
1 of 3 tasks
hayzey opened this issue Feb 1, 2018 · 48 comments
Closed
1 of 3 tasks

google.com search results are linking doc pages without stylesheets #16432

hayzey opened this issue Feb 1, 2018 · 48 comments
Assignees

Comments

@hayzey
Copy link

hayzey commented Feb 1, 2018

I'm submitting a ...

  • bug report
  • feature request
  • other

Current behavior:

The CSS on the docs page for angular.element doesn't seem to be loading: https://docs.angularjs.org/partials/api/ng/function/angular.element.html

image

Expected / new behavior:

The page would have CSS applied as usual.

Minimal reproduction of the problem with instructions:

Go to https://docs.angularjs.org/partials/api/ng/function/angular.element.html

AngularJS version: 1.x.y

stable

Browser: [all | Chrome XX | Firefox XX | Edge XX | IE XX | Safari XX | Mobile Chrome XX | Android X.X Web Browser | iOS XX Safari | iOS XX UIWebView | iOS XX WKWebView ]

Chrome 63 in macOS

Anything else:

These are the only network requests I'm seeing:

image

@frederikprijck
Copy link
Contributor

frederikprijck commented Feb 2, 2018

You don't want to use partials in the url, but instead point to the page (not the partial) https://docs.angularjs.org/api/ng/function/angular.element

If you search for angular.element on https://docs.angularjs.org/, you'll be redirect to https://docs.angularjs.org/api/ng/function/angular.element as well.

Partials are not supposed to have CSS.
If you'd inspect the source, you'd also see this isn't even a valid html page, but it's a partial instead (as you can tell from looking at the URL).

@Narretz
Copy link
Contributor

Narretz commented Feb 2, 2018

@frederikprijck is right. How did you get the partial url?

@hayzey
Copy link
Author

hayzey commented Feb 2, 2018

@frederikprijck The CSS loads if I access it with that link.

@Narretz I just searched "angular.element docs" on Google and that was that first link that popped up.

@frederikprijck
Copy link
Contributor

frederikprijck commented Feb 2, 2018

@Narretz @hayleytom

Wow that's correct and that's NOT good.

https://www.google.be/search?q=angular.element+docs&oq=angular.element+docs&aqs=chrome..69i57j69i60l2j69i61j0.198j0j7&sourceid=chrome&ie=UTF-8

So despite the fact CSS is not expected to be loaded on that page, it shouldn't be indexed by google! 😟

@Narretz
Copy link
Contributor

Narretz commented Feb 2, 2018

I see. Something went wrong with the robots.txt during migration I assume

@Narretz Narretz self-assigned this Feb 2, 2018
@petebacondarwin
Copy link
Member

Weird the robots.txt is:

User-agent: *

Disallow: /components/
Disallow: /examples/
Disallow: /img/
Disallow: /js/
Disallow: /partials/
Disallow: /ptore2e/
Disallow: /*.js$
Disallow: /*.map$
Disallow: /Error404.html

which looks correct to me. Any ideas?

@petebacondarwin
Copy link
Member

Perhaps there was a random crawling during the switchover to Firebase Hosting that missed the robots file? Can we trigger a new crawl @Narretz - I don't have access to the Search console for this site.

@Narretz
Copy link
Contributor

Narretz commented Feb 2, 2018

I just pushed an update that added this robots.txt. 😊 Before there was actually none. I'll look if I have access.

@Narretz
Copy link
Contributor

Narretz commented Feb 2, 2018

@petebacondarwin I don't have docs.angularjs.org in the search console either. Probably @IgorMinar must trigger the re-crawl.

@IgorMinar
Copy link
Contributor

IgorMinar commented Feb 3, 2018

It's strange that the crawler seems to to be not honoring the #! contract. We are correctly setting the meta tag.

I don't know if this is related, but I was just saw a notification that https://code.angularjs.org/snapshot/docs/js/all-versions-data.js is being banned via robots.txt and this prevents the crawler that indexes the site after the js executed from working.

@IgorMinar
Copy link
Contributor

IgorMinar commented Feb 3, 2018

some other issues I found:

@Narretz can you please take a look?

@IgorMinar
Copy link
Contributor

except for robots.txt the other two issues still seem unresolved.

@Narretz I though you said that the sitemap was fixed a few days ago, did I misunderstand you?

@Narretz
Copy link
Contributor

Narretz commented Feb 8, 2018

@IgorMinar I did, and when I checked it was there. Not sure what's going on. The sitemap is visible on the docs folders for the snapshots. (Not that we need them there, just saying that the build produces them)

@Narretz
Copy link
Contributor

Narretz commented Feb 8, 2018

Ah I see, it was NOT copied to the deploy folder. Fixing right now ...

@Narretz
Copy link
Contributor

Narretz commented Feb 9, 2018

@IgorMinar the site map is up: https://docs.angularjs.org/sitemap.xml

The link to the ajax crawling scheme above says this is deprecated. I assume as long as we allow the js crawler access we don't need this.

And it looks like the crawler hasn't reindexed the site at all. Because at least the partial url should not be in the results but it still is.

@IgorMinar
Copy link
Contributor

I requested a recrawl using the sitemap via search console.

the issue with partials is that we used to use the fragment scheme before google had a js-enabled crawler, now that js-enabled crawler is a thing the fragment scheme is in the way... :-(

we spent a ton of time fixing js-crawler related issues for angular.io and I'm not sure if we want to go though the same effort on docs.angularjs.org. it would be better if we could just make the fragment scheme work well enough and not touch it any more.

@Narretz
Copy link
Contributor

Narretz commented Feb 9, 2018

Okay, I actually wasn't aware that we used the escaped fragment rule, and so there is no rewrite rule (yet) for serving the escaped fragment. Shouldn't take too long to add it.

However, the site also says the ajax schemes will be discontinued in summer 2018, so I think we still need the site to be crawlable by the JS bot.

@IgorMinar IgorMinar changed the title CSS is broken on angular.element docs page google.com search results are linking doc pages without stylesheets Feb 9, 2018
@IgorMinar
Copy link
Contributor

right, but I think we are now in the state where the escaped fragments are considered as results to be served to users, that's why we see bad search results.

We should restore the escaped fragments functionality so that the results go back to normal - even though this functionality has been deprecated by google's crawler team. supporting the js-crawler is a big undertaking especially compared to restoring the escaped fragment route.

@Narretz
Copy link
Contributor

Narretz commented Feb 10, 2018

Okay, so here's what we need to do afaict:

Sound good?

@Narretz
Copy link
Contributor

Narretz commented Feb 10, 2018

Clarification:

Not sure what the better approach is.

@IgorMinar
Copy link
Contributor

IgorMinar commented Feb 10, 2018

yeah. that sounds good, I think, with a few comments:

I don't remember if google requests: https://docs.angularjs.org/api/ng/function/angular.element?_escaped_fragment_= or if the query param has some value. I think you got it right. There was some weirdness about this because of our use of "html5" urls rather than hashbang urls. Can you look at the old server config to confirm that this is right?

with regards to serving - since there is a finite number of urls to serve, wouldn't it be simpler to have dgeni generate the firebase rewrites and then we don't need to deal with serving these files ourselves via functions or what not. Dgeni already does something very very similar when it generates the sitemap. I'm just looking for the most reliable and low maintenance solution...

@Narretz
Copy link
Contributor

Narretz commented Feb 11, 2018 via email

@Narretz
Copy link
Contributor

Narretz commented Feb 11, 2018

Okay, so ? is a glob wildcard for a single character. However, it doesn't seem possible to escape it.

Actually, since ? matches any single character, it should still match ? in the url. I guess that means the query parameters are not matched to the rewrite after all.

@IgorMinar
Copy link
Contributor

OK. So let's go with the cloud function.

Can we get this fixed asap. We are starting to see increased number of people being affected by this. See: https://news.ycombinator.com/item?id=16353676

Narretz added a commit to Narretz/angular.js that referenced this issue Feb 12, 2018
This commit restores serving the plain partials (content) when a docs
page is accessed with ?_escaped_fragment_=.
The Google Ajax Crawler accesses these urls when the page has
`<meta type="fragment" content="!">` is set.

During the migration to Firebase, this was lost, which resulted in Google
dropping the docs almost completely from the index.

We are using a Firebase cloud function to serve the partials. Since
we cannot access the static hosted files from the function, we have to
deploy them as part of the function directory instead, from which they
can be read.

Related to angular#16432
Related to angular#16417
Narretz added a commit to Narretz/angular.js that referenced this issue Feb 12, 2018
This commit restores serving the plain partials (content) when a docs
page is accessed with ?_escaped_fragment_=.
The Google Ajax Crawler accesses these urls when the page has
`<meta type="fragment" content="!">` is set.

During the migration to Firebase, this was lost, which resulted in Google
dropping the docs almost completely from the index.

We are using a Firebase cloud function to serve the partials. Since
we cannot access the static hosted files from the function, we have to
deploy them as part of the function directory instead, from which they
can be read.

Related to angular#16432
Related to angular#16417
@gkalpak
Copy link
Member

gkalpak commented Feb 12, 2018

#16452 should fix it.

@gkalpak
Copy link
Member

gkalpak commented Feb 12, 2018

@Narretz
Copy link
Contributor

Narretz commented Feb 12, 2018

The firebase deployment failed? (Because the job passed). Anyway let's call it a night. I'll take a look tomorrow

@gkalpak
Copy link
Member

gkalpak commented Feb 12, 2018

It failed, because the firebase functions dependencies are not installed (and are apparently necessary).
I'm giving it another try: #16453

@Narretz
Copy link
Contributor

Narretz commented Feb 13, 2018

Thanks to @gkalpak the snapshots are here! https://docs.angularjs.org/guide/databinding?_escaped_fragment_=

@gkalpak
Copy link
Member

gkalpak commented Feb 13, 2018

@IgorMinar
Copy link
Contributor

yay! awesome! thanks @gkalpak

I tried to confirm that it actually worked, but the search console ui is confusing for the urls crawled via "escaped fragment" method. This is what I get:

screen shot 2018-02-12 at 4 39 32 pm

Note the the UI actually renders fine via the js-enabled-crawler, except that the crawler is not allowed to index that view because several URLs are being black listed in the robots.txt. @Narretz can you please fix them? This is the list:

URL Type Reason Severity  
https://docs.angularjs.org/js/angular-bootstrap/dropdown-toggle.min.js Script Blocked High robots.txt Tester
https://docs.angularjs.org/js/current-version-data.js Script Blocked High robots.txt Tester
https://docs.angularjs.org/js/pages-data.js Script Blocked High robots.txt Tester
https://docs.angularjs.org/js/nav-data.js Script Blocked High robots.txt Tester
https://docs.angularjs.org/js/docs.min.js Script Blocked High robots.txt Tester
https://docs.angularjs.org/components/google-code-prettify-1.0.1/src/prettify.js Script Blocked Medium robots.txt Tester
https://docs.angularjs.org/components/marked-0.3.6/marked.min.js Script Blocked Low robots.txt Tester
https://docs.angularjs.org/components/lunr-0.7.2/lunr.min.js Script Blocked Low robots.txt Tester
https://docs.angularjs.org/components/google-code-prettify-1.0.1/src/lang-css.js Script Blocked Low robots.txt Tester
https://docs.angularjs.org/img/angularjs-for-header-only.svg Image Blocked Low robots.txt Tester

@IgorMinar
Copy link
Contributor

If I'm testing it correctly, then I think the fix in prod is working well. Example:

url: https://docs.angularjs.org/api/ng/directive
_escaped_fragment url: https://docs.angularjs.org/api/ng/directive?_escaped_fragment_=

both work!

@Narretz
Copy link
Contributor

Narretz commented Feb 13, 2018

both work!

I would have been devastated if only one of these worked after all this :>

@IgorMinar I've updated the robots.txt to allow access to the js and images that are used by the docs app. (96bee0c) I've tested it with this tool: https://technicalseo.com/seo-tools/robots-txt/ That's not google specific though.

Can you also please check what the crawler sees for the direct partials/ urls like https://docs.angularjs.org/partials/api/ng/function/angular.element.html ? Because the current robots.txt excludes them, but the site is still indexed ...

@IgorMinar
Copy link
Contributor

IgorMinar commented Feb 14, 2018 via email

@Narretz
Copy link
Contributor

Narretz commented Feb 14, 2018

The googlebot can now crawl the docs pages better then before, but now it reports the partials/ as blocked :/ I need to find out if it's safe to allow the partials.

angularelementfetch

(Sorry, the page is in German)

@IgorMinar
Copy link
Contributor

The results seem to be improving:

2018-02-13 at 11:28am PT:
screen shot 2018-02-13 at 11 28 48 am

2018-02-14 at 8:14am PT:
screen shot 2018-02-14 at 8 14 08 am

@IgorMinar
Copy link
Contributor

@Narretz I think it's fine to unblock the partial html. Since we provide a sitemap, the crawler should understand that that html is not a url we want to publicize.

@IgorMinar
Copy link
Contributor

The traffic is still down, but there seems to be a slight hint of improvement. We need more data to know for sure...

screen shot 2018-02-14 at 8 21 58 am

@Narretz
Copy link
Contributor

Narretz commented Feb 14, 2018

@IgorMinar good to see the traffic is recovering.

For the partials I will add a noindex header which should complement the sitemap

@IgorMinar
Copy link
Contributor

IgorMinar commented Feb 14, 2018 via email

Narretz added a commit to Narretz/angular.js that referenced this issue Feb 14, 2018
…ials/

The sitemap.xml might also prevent the indexing, as the partials are not
listed.

Related to angular#16432
Narretz added a commit to Narretz/angular.js that referenced this issue Feb 14, 2018
The sitemap.xml might also prevent the indexing, as the partials are not
listed.

Related to angular#16432
Narretz added a commit that referenced this issue Feb 15, 2018
The sitemap.xml might also prevent the indexing, as the partials are not
listed.

Related to #16432 
Closes #16457
@Narretz
Copy link
Contributor

Narretz commented Feb 15, 2018

The original search for angular.element docs now returns a good result:

https://www.google.be/search?q=angular.element+docs&oq=angular.element+docs&aqs=chrome..69i57j69i60l2j69i61j0.198j0j7&sourceid=chrome&ie=UTF-8

unblocking the partials from the crawler has done the trick I think. I have requested a re-index of docs.angularjs.org/api and its direct links.

Remaining issues:

  • The crawler cannot render docs.angularjs.org correctly - For docs.angularjs.org we don't serve the production version - problem with rewrite urls.
  • the crawler wants to render the examples in iframes as well. This is reported as low priority though.
  • Soft 404s for pages that no longer exist
  • some partials are accessed without the html file ending, which leads to wrong includes and 500 errors

Narretz added a commit that referenced this issue Feb 15, 2018
The sitemap.xml might also prevent the indexing, as the partials are not
listed.

Related to #16432 
Closes #16457
@IgorMinar
Copy link
Contributor

hourly stats:
screen shot 2018-02-21 at 10 21 36 am

daily stats:
screen shot 2018-02-21 at 10 28 35 am

we are slowly recovering but we are not back to normal yet. it looks like we are still ~20% off for daily session count.

@IgorMinar
Copy link
Contributor

I don't think that there is any action we need to take. Let's just monitor this further.

@IgorMinar
Copy link
Contributor

one more graph that shows that we are recovering this time from search console - note that this one is delayed - the last datapoint is from Feb 18:

screen shot 2018-02-21 at 10 33 53 am

Narretz added a commit to Narretz/angular.js that referenced this issue Feb 23, 2018
The sitemap.xml might also prevent the indexing, as the partials are not
listed.

Related to angular#16432
Closes angular#16457

Closes angular#16446
@Narretz
Copy link
Contributor

Narretz commented Apr 4, 2018

This is resolved - the numbers are still lower than before the migration, but that is possibly an affect of the LTS announcement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants