Block internet search engines from indexing the mirror #48

bardiharborow · 2018-01-30T00:22:32Z

If possible, are you able to make your mirror non-indexed by internet search engines? There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

Stebalien · 2018-01-30T02:03:48Z

There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

The benefit is entirely for clearnet users. Tor users, for example, will (almost) always be able to access Wikipedia over tor so they'll see little benefit.

We should probably add a rel=canonical link pointing to Wikipedia to the head of each page but I haven't thought through the possible ramifications/downsides of this approach.

bardiharborow · 2018-01-30T02:35:33Z

The benefit is entirely for clearnet users. Tor users, for example, will (almost) always be able to access Wikipedia over tor so they'll see little benefit.

If anonymity is the concern, then accessing IPFS through the ipfs.io endpoint is no more anonymous to large scale surveillance than accessing Wikipedia directly, and if anything I trust the Wikimedia Foundation to handle server logs better than ipfs.io. Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

We should probably add a rel=canonical link pointing to Wikipedia to the head of each page but I haven't thought through the possible ramifications/downsides of this approach.

Doing so would have the intended effect of removing the mirror from Google search results, and it is actually the preferred way to implement this.

Stebalien · 2018-01-30T18:44:17Z

@bardiharborow

If anonymity is the concern, then accessing IPFS through the ipfs.io endpoint is no more anonymous to large scale surveillance than accessing Wikipedia directly, and if anything I trust the Wikimedia Foundation to handle server logs better than ipfs.io. Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

Ah, I think the confusion may be around the definition of "clearnet". IPFS is a clearnet. That is, it's not a darknet (it provides no anonymity at the moment). Darknets get no benefit because the exit nodes tend to be in countries with strong free speech laws.

Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

Unlikely. We don't have any IPFS search mechanisms and rely entirely on web search engines. That's probably one of the reasons we don't use rel=canonical links.

rameshvarun · 2018-04-02T07:20:52Z

+1 for setting rel='canonical' links. I'm starting to see the mirror pop up frequently on the first page of Google results just from normal everyday use. Canonical links should avoid this duplication and make the mirror a good web citizen.

nemobis · 2018-10-22T12:53:38Z

I agree with adding the rel="canonical": it's annoying to see search duplicates. By not indexing outdated content, you'll also alleviate the concerns with other issues such as #55 #49 .

Actually, what's the purpose of indexing all the pages at all? A noindex meta tag may be appropriate.

wesleylima · 2019-02-13T04:45:37Z

The lack of canonical tag comes from the htmls generated by kiwix's mwoffiler. I opened an issue openzim/mwoffliner#564

nemobis · 2019-04-30T07:03:34Z

The lack of canonical tag comes from the htmls generated by kiwix's mwoffiler.

I understand, but you can also add a canonical link in the webserver response headers.

Context: ipfs/distributed-wikipedia-mirror#48 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel · 2019-09-02T19:36:47Z

I fixed this upstream (openzim/mwoffliner#963) 👌
Old snapshots are about to be excluded via /robots.txt (ipfs-inactive/website#334)

Remaining steps before this issue can be closed:

mwoffliner 1.9 to be released with the fix
wiki snapshots at http://wiki.kiwix.org/wiki/Content_in_all_languages are made with updated mwoffliner and include <link rel="canonical"
snapshots are put on ipfs + pinned on a reliable cluster
snapshot-hashes.yml are updated to versions with canonical links

OR:

While Add canonical link header openzim/mwoffliner#963 solves problem for new snapshots, it is still possible the script will be run against an old ZIM without the header. Before adding to IPFS the script should check if root document contains the header, and if not manually add it to every document.
Filled Add rel="canonical" for search engines #65 to track this

I will be checking on mwoffliner/kiwix situation, but if someone has spare bandwidth and can to speed things up, please contribute upstream & post updates here.

lidel · 2021-02-15T15:44:04Z

This has been fixed by #65 and will be solved upstream when new snapshots are published as part of #60 #61.

This was referenced Aug 28, 2019

Add canonical link openzim/mwoffliner#564

Closed

fix: add robots.txt to exclude gateway paths ipfs-inactive/website#330

Closed

Add canonical link header openzim/mwoffliner#963

Merged

Hotfix: Missing robots.txt ipfs-inactive/website#328

Closed

lidel added a commit to ipfs-inactive/website that referenced this issue Sep 2, 2019

fix: exclude wikipedia published without canonical urls

81ce3ef

Context: ipfs/distributed-wikipedia-mirror#48 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel mentioned this issue Sep 2, 2019

fix: exclude wikipedia published without canonical urls ipfs-inactive/website#334

Merged

lidel added a commit to ipfs-inactive/website that referenced this issue Sep 2, 2019

fix: exclude wikipedia published without canonical urls

e0a083d

Context: ipfs/distributed-wikipedia-mirror#48 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

This was referenced Sep 9, 2019

Update tr.wikipedia-on-ipfs.org #60

Closed

Update en.wikipedia-on-ipfs.org #61

Closed

momack2 mentioned this issue Sep 10, 2019

Add all the other wikipedia snapshots #63

Open

17 tasks

lidel mentioned this issue Sep 27, 2019

Add rel="canonical" for search engines #65

Closed

derhuerst mentioned this issue Oct 28, 2019

serve with canonical link to "normal" Wikipedia derhuerst/wikipedia-feed-ui#1

Open

lidel closed this as completed Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block internet search engines from indexing the mirror #48

Block internet search engines from indexing the mirror #48

bardiharborow commented Jan 30, 2018

Stebalien commented Jan 30, 2018

bardiharborow commented Jan 30, 2018 •

edited

Loading

Stebalien commented Jan 30, 2018

rameshvarun commented Apr 2, 2018

nemobis commented Oct 22, 2018

wesleylima commented Feb 13, 2019

nemobis commented Apr 30, 2019

lidel commented Sep 2, 2019 •

edited

Loading

lidel commented Feb 15, 2021

Block internet search engines from indexing the mirror #48

Block internet search engines from indexing the mirror #48

Comments

bardiharborow commented Jan 30, 2018

Stebalien commented Jan 30, 2018

bardiharborow commented Jan 30, 2018 • edited Loading

Stebalien commented Jan 30, 2018

rameshvarun commented Apr 2, 2018

nemobis commented Oct 22, 2018

wesleylima commented Feb 13, 2019

nemobis commented Apr 30, 2019

lidel commented Sep 2, 2019 • edited Loading

lidel commented Feb 15, 2021

bardiharborow commented Jan 30, 2018 •

edited

Loading

lidel commented Sep 2, 2019 •

edited

Loading