Skip to content
This repository has been archived by the owner on Jul 5, 2021. It is now read-only.

fix: exclude wikipedia published without canonical urls #334

Merged
merged 2 commits into from
Sep 12, 2019

Conversation

lidel
Copy link
Contributor

@lidel lidel commented Sep 2, 2019

This PR supersedes #330 – see discussion there to understand why we don't exclude entire gateway

This PR adds /robots.txt that protect us from robots and exclude specific popular Wikipedia snapshots published without link rel=canonical (ipfs/distributed-wikipedia-mirror#48)

Context: ipfs/distributed-wikipedia-mirror#48, openzim/mwoffliner#963
Closes #328

Context: ipfs/distributed-wikipedia-mirror#48

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>

# better safe than sorry
User-agent: *
Disallow: /harming/humans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will work this around with 0th law. ;p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 PR welcome 🙃

Copy link
Contributor

@jessicaschilling jessicaschilling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.A. approves. (Also LGTM)

static/robots.txt Outdated Show resolved Hide resolved
#334 (comment)

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
@lidel lidel requested a review from olizilla September 4, 2019 11:02
@jessicaschilling
Copy link
Contributor

@lidel -- Did we reach a decision on what to to about this?

@lidel
Copy link
Contributor Author

lidel commented Sep 12, 2019

@jessicaschilling IMO this PR is safe to merge: excludes selected wikipedia mirrors, but leaves everything else crawlable.

@jessicaschilling jessicaschilling merged commit b388c28 into master Sep 12, 2019
@jessicaschilling jessicaschilling deleted the fix/exclude-old-wiki-via-robots.txt branch September 12, 2019 17:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hotfix: Missing robots.txt
4 participants