-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4021: Archive client controls #4021
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# MSC4021: Archive client controls | ||
|
||
The creation of archive.matrix.org indicates that search engine indexing of public Matrix rooms is a goal, but more | ||
granular control over how rooms should be indexed and displayed in search engine results should be granted to room | ||
admins. | ||
|
||
The current solution determines indexing eligibility based on `world_readable` `public` history visibility, but this is | ||
not an ideal solution because these settings only imply world readability within regular Matrix clients to most users, | ||
as opposed to the wider internet. Most alternative social media platforms provide separate settings for profile | ||
visibility and search engine indexing, for example. | ||
|
||
|
||
## Proposal | ||
|
||
Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For example, people building a blog or forum on Matrix would use similar Maybe we only need to be generic with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe, but I didn't want this to be confused for controls over Matrix chat/integration bots, which this really isn't. It's more of a control over a specific class of clients in my mind, which I wasn't sure how to refer to. Unless you think this has purpose outside of clients which are intended for public unauthenticated access, but I think a comments system on a blog would also fall under that category. |
||
room to be crawled. The [/publicRooms API](https://spec.matrix.org/v1.7/client-server-api/#get_matrixclientv3publicrooms) | ||
must relay this information to clients. | ||
|
||
| key | type | value | description | required | ||
|--|--|--|--|-- | ||
| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public | | ||
| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`. | ||
Comment on lines
+21
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the Matrix Public Archive, there are kind of two things to consider:
The For the display decision, it's less clear whether
And the Matrix Public Archive really just allows you to view a public Matrix room with some potential caching on top (it doesn't store anything). But this might be an overloaded usage of Depending on the answer here, the Perhaps the display should be keyed off something else entirely anyway. |
||
| `via` | string | Hostname | A hostname which should be set as the canonical archive URL. e.g. `"archive.matrix.org"`. | ||
|
||
Public-facing clients like [matrix-public-archive](https://github.com/matrix-org/matrix-public-archive) should validate | ||
these rules before returning them in a response. | ||
|
||
When `archive` is `false`, clients which display a room directory intended for public internet consumption (e.g. | ||
matrix-public-archive or matrix-static) should exclude that room from being displayed. Clients which provide access | ||
to native Matrix users (e.g. Element) should ignore this setting. | ||
|
||
When `via` is specified, the client should return a [rel=canonical link element](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-link-method) | ||
and/or a [rel=canonical HTTP header](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-header-method) | ||
with the response pointing to the archive URL on the specified hostname. This prevents the Matrix.org public archive | ||
from returning duplicate content or taking precedence in search results over an organization's self-hosted archive. | ||
|
||
For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at | ||
https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to assume that all archivers will have the same URL format, which may not be true. If they all run matrix-public-archive, then that may be, but it's possible that some other archiving software may use a different format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative could be to have
It would miss out on features like date pagination, although it now occurs to me that for the purposes of web indexing, this might actually be preferable behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem with this alternative is that it might be more difficult for the self-hosted client at archive.example.net to parse and not include this canonical link header, because I don't think it would be ideal for the canonical archive to return this header. So I don't know, maybe that's something to leave up to client interpretation, maybe a standard URL format should be part of the spec? 🤷♂️ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like |
||
|
||
``` | ||
Link: <https://archive.example.net/r/main:example.net/date/2023/05/28>; rel="canonical" | ||
``` | ||
|
||
|
||
## Alternatives | ||
|
||
- [MSC2219](https://github.com/matrix-org/matrix-spec-proposals/pull/2291) could provide an alternative method of | ||
specifying this information. However, this proposal includes the web archive metadata in the room directory API, | ||
in order to access this information efficiently (this is a [requirement](https://github.com/matrix-org/matrix-public-archive/issues/47#issuecomment-1536938601) | ||
for the matrix-public-archive project, for example). This proposal also allows rooms to opt-out of publicly accessible | ||
room directories without clients like matrix-public-archive needing to join the room to read the state, and should | ||
be interpreted by any client built for public web crawler access rather than [specific bots/clients](https://github.com/matrix-org/matrix-spec-proposals/pull/2291/files#diff-2b62d9e1c5ef21f7e10959da64da4000a69069b4dfb5d436db30d12c6bd23cb7R21-R23) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2291
Should be added as an alternative.
Also this likely would need to depend on fed peeking since currently you need to join a room to access the info which some people may find bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that 2291 is an alternative; I think the goals are different. 2291 indicates whether the bot is allowed to crawl the room, whereas it looks like the intent for this one is to communicate to search engines whether they are allowed to index the room. For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho the indexing is already a form of crawling a room. That's my reasoning. And the other msc can also be used for this case imho. It's a little more generic than this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a desire (matrix-org/matrix-viewer#47 (comment)) to have the room directory API include this sort of information directly, which is why I'm not sure 2291 will work here. I edited the doc to expand upon this and add 2291 as an alternative. Because this is intended to function more similarly to
m.room.join_rules
I don't think fed peeking is an issue, but I'm not sure.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uhoreg With MSC2291, I think this could be achieved with a mix of
messages
andlog
in them.room.robots
event. Am I misinterpreting?m.room.robots
The names are slightly confusing to what they actually do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MSC2291,
messages
is intended to indicate whether the bot itself is allowed to index messages, whereas this proposal is intended to communicate preferences to other crawlers that crawl the bot's logs. This may be able to be done with an addition to 2291 (e.g. add a new property), but 2291 itself doesn't do this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like there might not be any difference between the bot itself (Matrix Public Archive) and a different crawler that crawls the bots logs (a search engine). They're both accessing the same information (same or derived) and feels like
messages
from MSC2291 to control indexing of messages covers that. In other words, ifmessages: false
, the archive can't index messages and neither can search engines.Basically, any bot preference should probably be passed down for other bots to follow?
I think if it's a wildcard
*
, it should apply to downstream bots. It's less clear how things should flow if someone specified an app. Perhaps it wouldn't flow in the specific app case but could use the*
rules to govern how search engines look at it.And maybe we want to define some generic "search_engines" key for example since it might be common. But not all of the preferences are applicable since we can't pass along all of this preference detail seamlessly (impedance mismatch).