This is a good idea that I hadn't fully considered. There are a few issues that ...

vidarh · on March 6, 2023

There's nothing that requires the post URLs to even be on the same hostname as the user address. Mine are not. So switching locations for where the content is served up from for a new install is always possible. My Mastodon/Fediverse address is @vidar@galaxybound.com, while my posts are all hosted on m.galaxybound.com, e.g. [1]. In this case this is a result of installing Mastodon on the latter, while the former just redirects the webfinger, but there's AFAIK also nothing preventing you from e.g. the address being @user@domain1, the profile being on domain2, and the canonical URI of the posts being on domain3.

In terms of being wasteful, that might be a consideration for large shared hosts, but in reality media storage will outpace it by a large factor for most users. E.g. when I moved my account from mastodon.social, the export had 9.6MB of media I'd posted, and the uncompressed JSON for all my posts took up just 800K. That's despite being someone who posts relatively few pictures and very rarely any video.

> Additionally, web scraping isn't perfect. It assumes static content, which may not be possible with ACLs or other behaviour. It also assumes that there's a single link entry point to the content, or that all the necessary link entry points are known.

You don't need to scrape, as Mastodon supports export (but that will not be true for every type of fediverse software), but you can scrape as for federation to work your endpoints need to be accessible to all but instances that are defederated. You can also get at every posts trivially by looking up a user in webfinger to find their outbox, retrieve their outbox and retrieve the link to every post. This works for every piece of fediverse software that cares about ensuring federation works.

[1] https://m.galaxybound.com/@vidar/109965336245254807

danpalmer · on March 6, 2023

It's true there's a disconnection between the account domain and the hosting domain, I already use this, but the Activity Pub object IDs are on the _hosting domain_, and if you want to move that you do need the IDs to continue to resolve in order to not (partially) defederate.

As far as I am aware there's no other good way to move that content between software on the same domain. Mastodon will import/export, but that data still assumes it'll be backed by Mastodon code with the Mastodon database schema, which is not correct if you're migrating to or from Mastodon.

> but you can scrape as for federation to work your endpoints need to be accessible to all but instances that are defederated

...with Activity Pub. Mostly.

This is not true for all linked data, and my post was mostly looking at the general case of linked data rather than sticking to Mastodon or Activity Pub in particular. Activity Pub can be authenticated in some cases, although I think that may be something Mastodon has implemented on top. Not sure.

vidarh · on March 6, 2023

> It's true there's a disconnection between the account domain and the hosting domain, I already use this, but the Activity Pub object IDs are on the _hosting domain_, and if you want to move that you do need the IDs to continue to resolve in order to not (partially) defederate.

Yes, that is true. There's nothing you can do about that without breaking basic web expectations of URLs staying the same. The new endpoints can serve up the old content or Announce references to them, but the old URLs do need to continue resolving and at a minimum serve up a redirect if you want maximum availability.

It would be a nice improvement to have a URL scheme that allowed referencing posts relative to a webfinger lookup to reduce the impact of that.

> As far as I am aware there's no other good way to move that content between software on the same domain. Mastodon will import/export, but that data still assumes it'll be backed by Mastodon code with the Mastodon database schema, which is not correct if you're migrating to or from Mastodon.

The export is as an archive of JSON representing ActivityPub activities and objects. There's very little in the export that is Mastodon specific in any way. I have lots of complaints about Mastodon, but the export format is not one of them. If you write ActivityPub software, just support that format.

> Activity Pub can be authenticated in some cases, although I think that may be something Mastodon has implemented on top. Not sure.

None of the retrieval endpoints for the core activity data can be authenticated in Mastodon if you want federation to keep working properly, and much more data is retrievable without auth unless the user has marked specific subsets of data private. But if you're crawling your own account, with the right Accept: header, you'll get pretty much the same content in mostly the same (ActivityPub JSON) format as you get from an export.

mariusor · on March 6, 2023

You already commit a big fallacy in my opinion implying that objects need to be related to database access.

Once an object has been created on an instance - with very few exceptions - there's no need to tamper with it. Storing it as a JSON on disk is perfectly acceptable.

danpalmer · on March 6, 2023

JSON on disk is a perfectly acceptable database format ;)

More seriously, I used "database" as a proxy for looking up in storage of some sort. Whether that's JSON files on disk, relational data in a SQL database, or anything in between, is just down to the access, consistency, durability, etc, requirements of the system.