Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4021: Archive client controls #4021

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions proposals/4021-archive-controls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# MSC4021: Archive client controls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2291

Should be added as an alternative.

Also this likely would need to depend on fed peeking since currently you need to join a room to access the info which some people may find bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that 2291 is an alternative; I think the goals are different. 2291 indicates whether the bot is allowed to crawl the room, whereas it looks like the intent for this one is to communicate to search engines whether they are allowed to index the room. For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho the indexing is already a form of crawling a room. That's my reasoning. And the other msc can also be used for this case imho. It's a little more generic than this one

Copy link
Author

@jonaharagon jonaharagon May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a desire (matrix-org/matrix-viewer#47 (comment)) to have the room directory API include this sort of information directly, which is why I'm not sure 2291 will work here. I edited the doc to expand upon this and add 2291 as an alternative. Because this is intended to function more similarly to m.room.join_rules I don't think fed peeking is an issue, but I'm not sure.

Copy link
Contributor

@MadLittleMods MadLittleMods Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

@uhoreg With MSC2291, I think this could be achieved with a mix of messages and log in the m.room.robots event. Am I misinterpreting?

m.room.robots

{
  "*": {
    "messages": false,
    "log": true
  }
}
  • messages: (boolean) whether the bot is allowed to index the room's
    messages
    . Default: true if m.room.history_visibility is
    world_readable, and false otherwise.
  • log: (boolean) whether the bot is allowed to display logs of the room to
    users. This will be false if messages is false. Default: true if
    m.room.history_visibility is world_readable, and false otherwise.

The names are slightly confusing to what they actually do.

Copy link
Member

@uhoreg uhoreg Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MSC2291, messages is intended to indicate whether the bot itself is allowed to index messages, whereas this proposal is intended to communicate preferences to other crawlers that crawl the bot's logs. This may be able to be done with an addition to 2291 (e.g. add a new property), but 2291 itself doesn't do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there might not be any difference between the bot itself (Matrix Public Archive) and a different crawler that crawls the bots logs (a search engine). They're both accessing the same information (same or derived) and feels like messages from MSC2291 to control indexing of messages covers that. In other words, if messages: false, the archive can't index messages and neither can search engines.

Basically, any bot preference should probably be passed down for other bots to follow?

I think if it's a wildcard *, it should apply to downstream bots. It's less clear how things should flow if someone specified an app. Perhaps it wouldn't flow in the specific app case but could use the * rules to govern how search engines look at it.

And maybe we want to define some generic "search_engines" key for example since it might be common. But not all of the preferences are applicable since we can't pass along all of this preference detail seamlessly (impedance mismatch).


The creation of archive.matrix.org indicates that search engine indexing of public Matrix rooms is a goal, but more
granular control over how rooms should be indexed and displayed in search engine results should be granted to room
admins.

The current solution determines indexing eligibility based on `world_readable` `public` history visibility, but this is
not an ideal solution because these settings only imply world readability within regular Matrix clients to most users,
as opposed to the wider internet. Most alternative social media platforms provide separate settings for profile
visibility and search engine indexing, for example.


## Proposal

Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your
Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.room.archive_controls feels very specific to the archive use case and we may want to be more generic.

For example, people building a blog or forum on Matrix would use similar robots controls (see other beyond chat applications for Matrix)

Maybe we only need to be generic with a m.room.robots state event and other archive specific event types would still be useful.

Copy link
Author

@jonaharagon jonaharagon Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I didn't want this to be confused for controls over Matrix chat/integration bots, which this really isn't. It's more of a control over a specific class of clients in my mind, which I wasn't sure how to refer to.

Unless you think this has purpose outside of clients which are intended for public unauthenticated access, but I think a comments system on a blog would also fall under that category.

room to be crawled. The [/publicRooms API](https://spec.matrix.org/v1.7/client-server-api/#get_matrixclientv3publicrooms)
must relay this information to clients.

| key | type | value | description | required
|--|--|--|--|--
| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public |
| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`.
Comment on lines +21 to +22
Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Matrix Public Archive, there are kind of two things to consider:

  • Whether you want to show up in the archive at all (display)
  • Whether you want to allow search engines to index that content (indexing)

The robots field definately covers the search engine indexing decision by being able to opt out with noindex

For the display decision, it's less clear whether robots can cover it. But noarchive sounds pretty decent just by name and also because of what it means:

noarchive

Requests the search engine not to cache the page content.

-- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta/name#other_metadata_names

And the Matrix Public Archive really just allows you to view a public Matrix room with some potential caching on top (it doesn't store anything). But this might be an overloaded usage of noarchive since caching is not the same as displaying which the archive also does at its core.

Depending on the answer here, the archive field may be redundant compared to what can be specified in robots

Perhaps the display should be keyed off something else entirely anyway.

| `via` | string | Hostname | A hostname which should be set as the canonical archive URL. e.g. `"archive.matrix.org"`.

Public-facing clients like [matrix-public-archive](https://github.com/matrix-org/matrix-public-archive) should validate
these rules before returning them in a response.

When `archive` is `false`, clients which display a room directory intended for public internet consumption (e.g.
matrix-public-archive or matrix-static) should exclude that room from being displayed. Clients which provide access
to native Matrix users (e.g. Element) should ignore this setting.

When `via` is specified, the client should return a [rel=canonical link element](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-link-method)
and/or a [rel=canonical HTTP header](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-header-method)
with the response pointing to the archive URL on the specified hostname. This prevents the Matrix.org public archive
from returning duplicate content or taking precedence in search results over an organization's self-hosted archive.

For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at
https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume that all archivers will have the same URL format, which may not be true. If they all run matrix-public-archive, then that may be, but it's possible that some other archiving software may use a different format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be to have via be a full URI, like https://archive.example.net/r/main:example.net, and then https://archive.matrix.org/r/main:example.net/date/2023/05/28 would return:

Link: <https://archive.example.net/r/main:example.net>; rel="canonical"

It would miss out on features like date pagination, although it now occurs to me that for the purposes of web indexing, this might actually be preferable behavior?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this alternative is that it might be more difficult for the self-hosted client at archive.example.net to parse and not include this canonical link header, because I don't think it would be ideal for the canonical archive to return this header. So I don't know, maybe that's something to leave up to client interpretation, maybe a standard URL format should be part of the spec? 🤷‍♂️

Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like org.matrix.archive.canonical to convey this information.


```
Link: <https://archive.example.net/r/main:example.net/date/2023/05/28>; rel="canonical"
```


## Alternatives

- [MSC2219](https://github.com/matrix-org/matrix-spec-proposals/pull/2291) could provide an alternative method of
specifying this information. However, this proposal includes the web archive metadata in the room directory API,
in order to access this information efficiently (this is a [requirement](https://github.com/matrix-org/matrix-public-archive/issues/47#issuecomment-1536938601)
for the matrix-public-archive project, for example). This proposal also allows rooms to opt-out of publicly accessible
room directories without clients like matrix-public-archive needing to join the room to read the state, and should
be interpreted by any client built for public web crawler access rather than [specific bots/clients](https://github.com/matrix-org/matrix-spec-proposals/pull/2291/files#diff-2b62d9e1c5ef21f7e10959da64da4000a69069b4dfb5d436db30d12c6bd23cb7R21-R23)