-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add basic permissive robots.txt #1777
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to just allow every bot to index every part of our site. I think we want to consider what parts to disallow indexing, e.g. Disallow: /projects/
and several other URLs.
I also think we might want to disallow certain bots specifically, copying what other people have done. E.g., https://en.wikipedia.org/robots.txt has various bots blocked and comments about specifically why they're blocked. We don't really have time to do extensive research, so copying what bots sites have chosen to block might be a good plan.
However, if you want to just allow all bots now and edit the robots.txt later on to disallow certain ones, I won't be offended if you go ahead and dismiss this review and merge this as-is.
Our implicit policy has historically been "index what you want" since we had no specific robots.txt file with instructions otherwise. The fact that google changed their stance on "we won't index you without explicit allowance defined in robots.txt" doesn't mean that we have changed our policy. There are only a few pages that are publicly accessible (and that we want indexed). All of the the LF content is behind a login and cannot be indexed anyway. When we arrive at a point where we have public projects with public data, we will surely want that indexed as well. So, I'd like to move forward with the simple robots.txt as proposed for expediency sake, and address other concerns in a separate PR. I do have some concerns about simply copying a robots.txt from a large org, as we may have different goals and the rules/policies are all based upon observed behavior of various bots which inevitably change over time, so I cannot say that blocking various bots which have acted badly in the past with respect to one org is necessarily a good move on our part - just some more thought on that. |
It's strange that there is a code formatting failure for files that weren't touched by this PR. I will ignore that for this PR and address in a separate PR. |
with permission, moving forward with this PR for expediency
Fixes #1776
Description
It appears that one of the reasons languageforge.org does not show up in google search results is the absence of a robots.txt file (who knew?)
Screenshots
From google search console:
Checklist
Testing
This PR needs to be merged and testing can really only be done once the robots.txt has shipped to production.