diff --git a/docs/faq.md b/docs/faq.md index fb5a8275..3c5fda0b 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -17,19 +17,32 @@ And with the introduction of the jump to date API via [MSC3030](https://github.com/matrix-org/matrix-spec-proposals/pull/3030), we could show messages from any given date and day-by-day navigation. -## How do I opt out and keep my room from being indexed by search engines? - -All public Matrix rooms are accessible to view in the Matrix Public Archive. But only -rooms with history visibility set to `world_readable` are indexable by search engines. - -Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better -opt out controls. +## Why did the archive bot join my room? + +Only public Matrix rooms with `shared` or `world_readable` [history +visibility](https://spec.matrix.org/latest/client-server-api/#room-history-visibility) are +accessible in the Matrix Public Archive. In some clients like Element, the `shared` +option equates to "Members only (since the point in time of selecting this option)" and +`world_readable` to "Anyone" under the **room settings** -> **Security & Privacy** -> +**Who can read history?**. + +But the archive bot (`@archive:matrix.org`) will join any public room because it doesn't +know the history visibility without first joining. Any room without `world_readable` or +`shared` history visibility will lead a `403 Forbidden`. And if the public room is in +the room directory, it will be listed in the archive but will still lead to a `403 +Forbidden` in that case. + +The Matrix Public Archive doesn't hold onto any data (it's +stateless) and requests the messages from the homeserver every time. The +[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5 +minutes for the current day, and 2 days for past content. -For [archive.matrix.org](https://archive.matrix.org/), you can ban the -`@archive:matrix.org` user if you don't want your room content to be shown in the -archive at all. +The Matrix Public Archive only allows rooms with `world_readable` history visibility to +be indexed by search engines. See the [opt +out](#how-do-i-opt-out-and-keep-my-room-from-being-indexed-by-search-engines) topic +below for more details. -## Why does the archive user join rooms instead of browsing them as a guest? +### Why does the archive user join rooms instead of browsing them as a guest? Guests require `m.room.guest_access` to access a room. Most public rooms do not allow guests because even the `public_chat` preset when creating a room does not allow guest @@ -37,11 +50,22 @@ access. Not being able to view most public rooms is the major blocker on being a use guest access. The idea is if I can view the messages from a Matrix client as a random user, I should also be able to see the messages in the archive. -Keep in mind that only rooms with history visibility set to `world_readable` are -indexable by search engines. The Matrix Public Archive doesn't hold onto any data (it's -stateless) and requests the messages from the homeserver every time. The -[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5 -minutes for the current day, and 2 days for past content. +Guest access is also a much different ask than read-only access since guests can also +send messages in the room which isn't always desirable. The archive bot is read-only and +does not send messages. + +## How do I opt out and keep my room from being indexed by search engines? + +Only public Matrix rooms with `shared` or `world_readable` history visibility are +accessible to view in the Matrix Public Archive. But only rooms with history visibility +set to `world_readable` are indexable by search engines. + +Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better +opt out controls. + +As a workaround for [archive.matrix.org](https://archive.matrix.org/) today, you can ban +the `@archive:matrix.org` user if you don't want your room content to be shown in the +archive at all. ## Technical details diff --git a/server/lib/matrix-utils/ensure-room-joined.js b/server/lib/matrix-utils/ensure-room-joined.js index fc92536a..09826f43 100644 --- a/server/lib/matrix-utils/ensure-room-joined.js +++ b/server/lib/matrix-utils/ensure-room-joined.js @@ -3,14 +3,19 @@ const assert = require('assert'); const urlJoin = require('url-join'); +const StatusError = require('../errors/status-error'); const { fetchEndpointAsJson } = require('../fetch-endpoint'); const getServerNameFromMatrixRoomIdOrAlias = require('./get-server-name-from-matrix-room-id-or-alias'); +const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator'); const config = require('../config'); -const StatusError = require('../errors/status-error'); +const basePath = config.get('basePath'); +assert(basePath); const matrixServerUrl = config.get('matrixServerUrl'); assert(matrixServerUrl); +const matrixPublicArchiveURLCreator = new MatrixPublicArchiveURLCreator(basePath); + async function ensureRoomJoined( accessToken, roomIdOrAlias, @@ -43,6 +48,19 @@ async function ensureRoomJoined( method: 'POST', accessToken, abortSignal, + body: { + reason: + `Joining room to check history visibility. ` + + `If your room is public with shared or world readable history visibility, ` + + `it will be accessible at ${matrixPublicArchiveURLCreator.archiveUrlForRoom( + roomIdOrAlias + // We don't need to include the `viaServers` option here because the archive + // will already be joined to the room from this request itself and we don't + // need to make the URL any longer/noisier than it needs to be. + )}. ` + + `See the FAQ for more details: ` + + `https://github.com/matrix-org/matrix-public-archive/blob/main/docs/faq.md#why-did-the-archive-bot-join-my-room`, + }, }); assert( joinData.room_id, diff --git a/test/e2e-tests.js b/test/e2e-tests.js index 9f21fdfd..f7b4bc2b 100644 --- a/test/e2e-tests.js +++ b/test/e2e-tests.js @@ -14,6 +14,7 @@ const chalk = require('chalk'); const RethrownError = require('../server/lib/errors/rethrown-error'); const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator'); const { fetchEndpointAsText, fetchEndpointAsJson } = require('../server/lib/fetch-endpoint'); +const ensureRoomJoined = require('../server/lib/matrix-utils/ensure-room-joined'); const config = require('../server/lib/config'); const { MS_LOOKUP, @@ -999,10 +1000,11 @@ describe('matrix-public-archive', () => { // avoid problems jumping to the latest activity since we can't control the // timestamp of the membership event. const archiveAppServiceUserClient = await getTestClientForAs(); - await joinRoom({ - client: archiveAppServiceUserClient, - roomId: roomId, - }); + // We use `ensureRoomJoined` instead of `joinRoom` because we're joining + // the archive user here and want the same join `reason` to avoid a new + // state event being created (`joinRoom` -> `{ displayname, membership }` + // whereas `ensureRoomJoined` -> `{ reason, displayname, membership }`) + await ensureRoomJoined(archiveAppServiceUserClient.accessToken, roomId); // Just spread things out a bit so the event times are more obvious // and stand out from each other while debugging and so we just have