From acffccf36cf97274fac9c3da909b3298cfea3875 Mon Sep 17 00:00:00 2001 From: Jonah Aragon Date: Sun, 28 May 2023 09:51:00 -0500 Subject: [PATCH 1/2] Create 4021-archive-controls.md --- proposals/4021-archive-controls.md | 36 ++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 proposals/4021-archive-controls.md diff --git a/proposals/4021-archive-controls.md b/proposals/4021-archive-controls.md new file mode 100644 index 00000000000..fe3a005b740 --- /dev/null +++ b/proposals/4021-archive-controls.md @@ -0,0 +1,36 @@ +# MSC4021: Archive client controls + +The creation of archive.matrix.org indicates that search engine indexing of public Matrix rooms is a goal, but more +granular control over how rooms should be indexed and displayed in search engine results should be granted to room +admins. + +The current solution determines indexing eligibility based on `world_readable` `public` history visibility, but this is +not an ideal solution because these settings only imply world readability within regular Matrix clients to most users, +as opposed to the wider internet. Most alternative social media platforms provide separate settings for profile +visibility and search engine indexing, for example. + + +## Proposal + +Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your +room to be crawled. The room directory must relay this information to clients. + +| key | type | value | description | required +|--|--|--|--|-- +| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`. +| `via` | string | Hostname | A hostname which should be set as the canonical archive URL. e.g. `"archive.matrix.org"`. + +Public-facing clients like [matrix-public-archive](https://github.com/matrix-org/matrix-public-archive) should validate +these rules before returning them in a response. + +When `via` is specified, the client should return a [rel=canonical link element](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-link-method) +and/or a [rel=canonical HTTP header](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-header-method) +with the response pointing to the archive URL on the specified hostname. This prevents the Matrix.org public archive +from returning duplicate content or taking precedence in search results over an organization's self-hosted archive. + +For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at +https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header: + +``` +Link: ; rel="canonical" +``` From 144fff4030a2b662627cc9aba05d89d2998591b8 Mon Sep 17 00:00:00 2001 From: Jonah Aragon Date: Sun, 28 May 2023 14:35:44 -0500 Subject: [PATCH 2/2] Update 4021-archive-controls.md --- proposals/4021-archive-controls.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/proposals/4021-archive-controls.md b/proposals/4021-archive-controls.md index fe3a005b740..ddb349564d0 100644 --- a/proposals/4021-archive-controls.md +++ b/proposals/4021-archive-controls.md @@ -13,16 +13,22 @@ visibility and search engine indexing, for example. ## Proposal Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your -room to be crawled. The room directory must relay this information to clients. +room to be crawled. The [/publicRooms API](https://spec.matrix.org/v1.7/client-server-api/#get_matrixclientv3publicrooms) +must relay this information to clients. | key | type | value | description | required |--|--|--|--|-- +| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public | | `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`. | `via` | string | Hostname | A hostname which should be set as the canonical archive URL. e.g. `"archive.matrix.org"`. Public-facing clients like [matrix-public-archive](https://github.com/matrix-org/matrix-public-archive) should validate these rules before returning them in a response. +When `archive` is `false`, clients which display a room directory intended for public internet consumption (e.g. +matrix-public-archive or matrix-static) should exclude that room from being displayed. Clients which provide access +to native Matrix users (e.g. Element) should ignore this setting. + When `via` is specified, the client should return a [rel=canonical link element](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-link-method) and/or a [rel=canonical HTTP header](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-header-method) with the response pointing to the archive URL on the specified hostname. This prevents the Matrix.org public archive @@ -34,3 +40,13 @@ https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this ``` Link: ; rel="canonical" ``` + + +## Alternatives + +- [MSC2219](https://github.com/matrix-org/matrix-spec-proposals/pull/2291) could provide an alternative method of + specifying this information. However, this proposal includes the web archive metadata in the room directory API, + in order to access this information efficiently (this is a [requirement](https://github.com/matrix-org/matrix-public-archive/issues/47#issuecomment-1536938601) + for the matrix-public-archive project, for example). This proposal also allows rooms to opt-out of publicly accessible + room directories without clients like matrix-public-archive needing to join the room to read the state, and should + be interpreted by any client built for public web crawler access rather than [specific bots/clients](https://github.com/matrix-org/matrix-spec-proposals/pull/2291/files#diff-2b62d9e1c5ef21f7e10959da64da4000a69069b4dfb5d436db30d12c6bd23cb7R21-R23)