Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow previewing shared history rooms #239

Merged
merged 3 commits into from
Jun 27, 2023

Conversation

tulir
Copy link
Member

@tulir tulir commented May 30, 2023

Only world_readable can be considered as opting into having history publicly on the web. Anything else must not be archived viewable without login until there's a dedicated state event for opting into archiving.

See #47

Only `world_readable` can be considered as opting into having history publicly on the web. Anything else must not be archived until there's a dedicated state event for opting into archiving.
// Only `world_readable` or `shared` rooms that are `public` are viewable in the archive
const allowedToViewRoom =
roomData.historyVisibility === 'world_readable' ||
(roomData.historyVisibility === 'shared' && roomData.joinRule === 'public');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having public shared rooms viewable but not indexed by search engines is by design.

My reply in the opt-out issue probably explains this the best so far:

"archived" is a bit of a overloaded term here but given that this project is called "Matrix Public Archive" I can see where the confusion may be be coming from. Any public room should be viewable in Matrix Public Archive. The idea is if a random Matrix user can view the room, then it should be viewable in the archive. But only history_visibility: "world_readable" rooms are indexable by search engines.

The Matrix Public Archive doesn't hold onto any data (it's stateless) and requests the messages from the homeserver every time (it archives nothing). The archive.matrix.org instance has some caching in place, 5
minutes for the current day, and 2 days for past content.

I've tried to clarify more of this in the FAQ document and added more details on why not guest access/peeking.

Banning @archive:matrix.org will prevent the room from showing up on archive.matrix.org and the cache will expire after 5-minutes/2-days for any content that is showing there now. Adding better opt-out controls like this issue is discussing is on the list 👍. I've updated the description with the current MSC proposals out there.

-- #47 (comment)

Copy link
Contributor

@MadLittleMods MadLittleMods Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to move ahead with this PR to remove shared public rooms from the archive ⏩

The points in favor of keeping shared accessible can be summarized by the following but in the end, while the archive bot was respectful of the technical implications and doesn't expose messages to any further audience (random people), there is a social obligation to consider. This option has been represented as "members only" in many clients which doesn't leave any nuance.

Otherwise, the main idea was if I can view the messages from a Matrix client as a random user, I should also be able to see the messages in the archive. In both the native Matrix client and archive cases, it’s the same result when a random user wants to view a shared room:

  • A random Matrix user accesses the room, they see the history
  • A random user accesses the room in the archive, they see the history
  • Search engines are not allowed in either case (that only applies to world_readable rooms)

The join is mostly a technical detail to anyone trying to view the room. While I don't think the join event provides much value to the room in the normal cases, it could have benefit in tracing bad actors for moderation.

From the spec:

  • world_readable - All events while this is the m.room.history_visibility value may be shared by any participating homeserver with anyone, regardless of whether they have ever joined the room.
  • shared - Previous events are always accessible to newly joined members. All events in the room are accessible, even those sent when the member was not a part of the room.

Removing shared rooms, does mean we’re re-introducing friction for a portion people which the archive eliminates (which homeserver do I choose, which client, why do I even need an account, how do I view this on mobile, how do I reference and share this message to someone not already in the Matrix ecosystem, etc). But people can update their room to be world_readable as they see fit now to regain these benefits for their community.

@bkil

This comment was marked as off-topic.

@MadLittleMods MadLittleMods merged commit 1d3e930 into matrix-org:main Jun 27, 2023
@@ -155,7 +155,6 @@ const fetchRoomData = traceFunction(async function (
stateCanonicalAliasResDataOutcome,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and patience @tulir 🙇 🐦

MadLittleMods added a commit that referenced this pull request Jun 27, 2023
MadLittleMods added a commit that referenced this pull request Jun 28, 2023
MadLittleMods added a commit that referenced this pull request Jun 28, 2023
MadLittleMods added a commit that referenced this pull request Jun 28, 2023
MadLittleMods added a commit that referenced this pull request Jun 30, 2023
Happens to address part of #271
but made primarily as a follow-up to #239

---

Only 42% rooms on the `matrix.org` room directory are `world_readable` which means we will get pages of rooms that are half-empty most of the time if we just naively fetch 9 rooms at a time.

Ideally, we would be able to just add a filter directly to `/publicRooms` in order to only grab the `world_readable` rooms and still get full pages but the filter option doesn't allow us to slice by `world_readable` history visibility.

Instead, we have to paginate until we get a full grid of 9 rooms, then make a final `/publicRooms` request to backtrack to the exact continuation point so next page won't skip any rooms in between.

---

We had empty spaces in the grid before because some rooms in the room directory are private which we filtered out before. But that was a much more rare experience since only 2% of rooms were private .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants