Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reason why the archive bot is joining the room #262

Merged
merged 9 commits into from
Jun 9, 2023
56 changes: 40 additions & 16 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,31 +17,55 @@ And with the introduction of the jump to date API via
[MSC3030](https://github.com/matrix-org/matrix-spec-proposals/pull/3030), we could show
messages from any given date and day-by-day navigation.

## How do I opt out and keep my room from being indexed by search engines?

All public Matrix rooms are accessible to view in the Matrix Public Archive. But only
rooms with history visibility set to `world_readable` are indexable by search engines.

Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better
opt out controls.
## Why did the archive bot join my room?

Only public Matrix rooms with `shared` or `world_readable` [history
visibility](https://spec.matrix.org/latest/client-server-api/#room-history-visibility) are
accessible in the Matrix Public Archive. In some clients like Element, the `shared`
option equates to "Members only (since the point in time of selecting this option)" and
`world_readable` to "Anyone" under the **room settings** -> **Security & Privacy** ->
**Who can read history?**.

But the archive bot (`@archive:matrix.org`) will join any public room because it doesn't
know the history visibility without first joining. Any room without `world_readable` or
`shared` history visibility will lead a `403 Forbidden`. And if the public room is in
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
the room directory, it will be listed in the archive but will still lead to a `403
Forbidden` in that case.

The Matrix Public Archive doesn't hold onto any data (it's
stateless) and requests the messages from the homeserver every time. The
[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content.

For [archive.matrix.org](https://archive.matrix.org/), you can ban the
`@archive:matrix.org` user if you don't want your room content to be shown in the
archive at all.
The Matrix Public Archive only allows rooms with `world_readable` history visibility to
be indexed by search engines. See the [opt
out](#how-do-i-opt-out-and-keep-my-room-from-being-indexed-by-search-engines) topic
below for more details.

## Why does the archive user join rooms instead of browsing them as a guest?
### Why does the archive user join rooms instead of browsing them as a guest?

Guests require `m.room.guest_access` to access a room. Most public rooms do not allow
guests because even the `public_chat` preset when creating a room does not allow guest
access. Not being able to view most public rooms is the major blocker on being able to
use guest access. The idea is if I can view the messages from a Matrix client as a
random user, I should also be able to see the messages in the archive.

Keep in mind that only rooms with history visibility set to `world_readable` are
indexable by search engines. The Matrix Public Archive doesn't hold onto any data (it's
stateless) and requests the messages from the homeserver every time. The
[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content.
Guest access is also a much different ask than read-only access since guests can also
send messages in the room which isn't always desirable. The archive bot is read-only and
does not send messages.

## How do I opt out and keep my room from being indexed by search engines?

Only public Matrix rooms with `shared` or `world_readable` history visibility are
accessible to view in the Matrix Public Archive. But only rooms with history visibility
set to `world_readable` are indexable by search engines.

Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better
opt out controls.

As a workaround for [archive.matrix.org](https://archive.matrix.org/) today, you can ban
the `@archive:matrix.org` user if you don't want your room content to be shown in the
archive at all.

## Technical details

Expand Down
20 changes: 19 additions & 1 deletion server/lib/matrix-utils/ensure-room-joined.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,19 @@
const assert = require('assert');
const urlJoin = require('url-join');

const StatusError = require('../errors/status-error');
const { fetchEndpointAsJson } = require('../fetch-endpoint');
const getServerNameFromMatrixRoomIdOrAlias = require('./get-server-name-from-matrix-room-id-or-alias');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');

const config = require('../config');
const StatusError = require('../errors/status-error');
const basePath = config.get('basePath');
assert(basePath);
const matrixServerUrl = config.get('matrixServerUrl');
assert(matrixServerUrl);

const matrixPublicArchiveURLCreator = new MatrixPublicArchiveURLCreator(basePath);

async function ensureRoomJoined(
accessToken,
roomIdOrAlias,
Expand Down Expand Up @@ -43,6 +48,19 @@ async function ensureRoomJoined(
method: 'POST',
accessToken,
abortSignal,
body: {
reason:
`Joining room to check history visibility. ` +
`If your room is public with shared or world readable history visibility, ` +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hostile wording towards the users as the term shared doesn't appear outside of the spec. Please consider using client terminology such as Element Web members-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the FAQ with the equivalents you might see in the UI but it's not possible to be exhaustive for how every client might expose these options. We're also trying to be brief and to the point with this join reason so I'm going to just let the pointer to the FAQ take the lead for people trying to understand more.

`it will be accessible at ${matrixPublicArchiveURLCreator.archiveUrlForRoom(
roomIdOrAlias
// We don't need to include the `viaServers` option here because the archive
// will already be joined to the room from this request itself and we don't
// need to make the URL any longer/noisier than it needs to be.
)}. ` +
`See the FAQ for more details: ` +
`https://github.com/matrix-org/matrix-public-archive/blob/main/docs/faq.md#why-did-the-archive-bot-join-my-room`,
},
});
assert(
joinData.room_id,
Expand Down
10 changes: 6 additions & 4 deletions test/e2e-tests.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const chalk = require('chalk');
const RethrownError = require('../server/lib/errors/rethrown-error');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');
const { fetchEndpointAsText, fetchEndpointAsJson } = require('../server/lib/fetch-endpoint');
const ensureRoomJoined = require('../server/lib/matrix-utils/ensure-room-joined');
const config = require('../server/lib/config');
const {
MS_LOOKUP,
Expand Down Expand Up @@ -999,10 +1000,11 @@ describe('matrix-public-archive', () => {
// avoid problems jumping to the latest activity since we can't control the
// timestamp of the membership event.
const archiveAppServiceUserClient = await getTestClientForAs();
await joinRoom({
client: archiveAppServiceUserClient,
roomId: roomId,
});
// We use `ensureRoomJoined` instead of `joinRoom` because we're joining
// the archive user here and want the same join `reason` to avoid a new
// state event being created (`joinRoom` -> `{ displayname, membership }`
// whereas `ensureRoomJoined` -> `{ reason, displayname, membership }`)
await ensureRoomJoined(archiveAppServiceUserClient.accessToken, roomId);

// Just spread things out a bit so the event times are more obvious
// and stand out from each other while debugging and so we just have
Expand Down