54

I was told that I should post this here.

cross-posted from: https://lemmy.world/post/932750

Say you decide to self-host a Lemmy instance. When you create that instance, do you immediately need to download and store all the data that has ever been posted to all federated Lemmy instances? Or perhaps you only need to download and store everything that is posted to the federated Lemmy instances from that point forward? Or better yet, do you only store what the users on that instance do (i.e. their posts, and posts to the communities hosted on that instance)?

you are viewing a single comment's thread
view the rest of the comments
[-] hawkwind@lemmy.management 25 points 1 year ago* (last edited 1 year ago)

When you create that instance, do you immediately need to download and store all the data that has ever been posted to all federated Lemmy instances?

Run my own instance. @Candelestine@lemmy.world is right but there are more details. Federation is not a "sync." When your instance needs to fetch from another instance it will, but it does not get history. You can get a specific comment or post from any time however.

Or perhaps you only need to download and store everything that is posted to the federated Lemmy instances from that point forward?

This is not by default either. Only communities that your users subscribe to will be updated by their "origin" instances.

Or better yet, do you only store what the users on that instance do (i.e. their posts, and posts to the communities hosted on that instance)?

This does happen, but it also stores what your users do on remote instances as well as "copies" of what they interact with. Images (currently the only media hosted by lemmy servers) are linked to thier "origin" as well. So you are storing text of posts and comments.

[-] captain_samuel_brady@lemm.ee 15 points 1 year ago

So let’s say I’m on lemm.ee and I decide that I want to see “All.” Does that mean I’m only seeing what other users on lemm.ee are subscribed to?

[-] hawkwind@lemmy.management 21 points 1 year ago

That is exactly what that means and it's frustrating to say the least, because it's not clear that's what's happening.

[-] captain_samuel_brady@lemm.ee 13 points 1 year ago

I’m not really sure how this is supposed to work long-term, then. I can’t imagine anyone wants to be on an instance with only a fraction of the content available. It makes perfect sense when subscribing, but surfing All loses its appeal. I understand the challenges, but I hope there’s a creative solution at some point. It seems like folks will gravitate to the instances with the most stability and users.

[-] hawkwind@lemmy.management 12 points 1 year ago

I think you're right. People will gravitate to the most stable large instances because their "All" will be as close to 100% as possible without doing anything special. I wrote a script to seed instances and update subscriptions, but it uses a single account that is subscribed to everything so that other users can see everything. That's not something that would normally happen. Maybe that needs to be part of the base software?

[-] briongloid@aussie.zone 5 points 1 year ago

Knowing that instances only pull posts/comments that occur after the first subscription, it will become less and less viable to choose a small instance if Lemmy doesn't add the option of adjustable pull settings.

[-] aaron@lemmy.jcaks.net 3 points 1 year ago

Have you got a link to that script? I want to seed my local private instance!

[-] russjr08@outpost.zeuslink.net 2 points 1 year ago

I don't suppose your script is published anywhere? My comment adjacent to yours mentions how something like Mastodon's Relay system would really help solve this issue, and it sounds like what you've made is probably the closest thing we'd have to a relay system for a while (given the core devs being super super busy with the existing issues).

On a side note, I wish there was also a way to set the homepage of an instance to "All" as well (which can be done user-side, but not globally), my instance only has a meta-community for announcements, so I can imagine that it just looks like an absolute ghost town to anyone who stops by.

[-] hawkwind@lemmy.management 4 points 1 year ago
[-] aaron@lemmy.jcaks.net 3 points 1 year ago

That's cool, thanks i'll check it out. I also found https://github.com/Fmstrat/lcs

[-] Snickers@on.syrma.cc 2 points 1 year ago

That‘s so helpful, thank you for sharing it with us! One question: If I want to update my known communities in a month or so, can I just rerun the script or will that cause issues?

[-] hawkwind@lemmy.management 2 points 1 year ago* (last edited 1 year ago)

You should be able to rerun it anytime. It only gets stuff that doesn’t exist on your instance. That’s how it was designed. It is dependent on browse.feddit.de however. :(

[-] Snickers@on.syrma.cc 1 points 1 year ago

Awesome! I hope lemmy will directly integrate something like this in the future, if there‘s an open github issue i could upvote let me know.

[-] russjr08@outpost.zeuslink.net 4 points 1 year ago

It'd be really cool if we had something similar to Mastodon's "Relays" where you basically subscribe to a firehose of posts from everyone whose on a server connected to that relay (they show up in the "Federated Timeline").

I don't know exactly how this would work for Lemmy, but it seems like if we had a system like this it could really help tackle this issue.

[-] Fizz@lemmy.nz 1 points 1 year ago

I think there is already some software that fetches content but it's early in development.

[-] TurnItOff_OnAgain@lemmy.world 4 points 1 year ago

Just so I am understanding the feeds...

Subscribed - just the stuff you are subscribed to

Local - just the stuff in your instance

All - the stuff you subscribe to, the stuff in your instance, and stuff that people in your instance follow from other instances

That correct?

[-] hawkwind@lemmy.management 1 points 1 year ago* (last edited 1 year ago)

Correct. All also includes communities fetched but not subscribed to, however these are more like stubs. They are in your database but not being updated with activity since no one is subscribed. At least that’s my understanding.

[-] burtek@programming.dev 2 points 1 year ago

Would you say it makes sense to have accounts on the 2-3 instances that you're most interested in rather than 2 account and being dependent on federation?

[-] Spzi@lemm.ee 7 points 1 year ago

There's no general answer, it depends on your personal preference.

If you want to have most content available, register on an instance which has an according policy; which federates with anybody and is federated by everybody (both directions can make a difference).

The downside however is, this also opens the door to all sorts of bad actors, including bots and spam.

So I personally tried to strike a balance and am so far quite happy on lemm.ee.

This tool is pretty handy to make informed decisions: https://fba.ryona.agency/ It allows you to check federation status both ways.

[-] norgur@discuss.tchncs.de 3 points 1 year ago

Thanks for that link. Really interesting.

[-] Kalcifer@lemmy.world 3 points 1 year ago

When your instance needs to fetch from another instance it will

Meaning it will only fetch what is being actively looked at?

Only communities that your users subscribe to will be updated by their “origin” instances.

So when an external community is subscribed to from an account located on your located instance, from the point of subscribing forward, your local instance will begin downloading every single post that will ever be made to that subscribed communty, regardless of who posted it?

Or better yet, do you only store what the users on that instance do (i.e. their posts, and posts to the communities hosted on that instance)?

This does happen, but it also stores what your users do on remote instances as well as “copies” of what they interact with. Images (currently the only media hosted by lemmy servers) are linked to thier “origin” as well. So you are storing text of posts and comments.

This is the main point of confusion to me. From my current understanding, it feels as if it contradicts what you had previously said:

Only communities that your users subscribe to will be updated by their “origin” instances.

If it's already pulling in all posts and comments on that community, what use is specifically storing anything that the users do on that community? Would it not be already stored?

[-] hawkwind@lemmy.management 4 points 1 year ago

It works a lot like like email between instances. Let’s call your self hosted instance “A” and the popular remote instance “B.”

User on A searches for “poodles” and finds a community !poodles@B. When they click the search results: A sends B mail saying “send me the last 10 posts for poodles.” B sends A mail with the posts and the user sees the posts, but none have comments.

If nothing else happens then those 10 posts will just hang out doing nothing on A, but if the user clicks subscribe then A sends another mail to B saying “my user wants to follow poodles.” B replies saying “cool, I’ll send you everything from poodles now.” Now, anything a post or comment happens B checks lots list of subscribing instances and sends copies of them.

If user on A comments on !poodles@B or posts, it creates it on A but sends a mail to B saying “here is some new stuff for poodles!”

[-] Kalcifer@lemmy.world 2 points 1 year ago* (last edited 1 year ago)

Thank you for the explanation!

Unfortunately, it seems, if I understand understand correcly, that this is not sustainable in the long term for small instances/servers. If Lemmy continues to grow in popularity, then the influx of content will continue to increase, thereby pushing small servers out of participation due to lack of resources. The data storage requirements, I fear, will become a very limiting issue.

I feel that if servers only tracked what their users directly participated in (i.e. only save comments, and posts directly made by the user), this issue would not be as problematic.

For example, I would like to host my own instance with only my account on it. I was initially hoping that my data storage requirements would only be directly proportional to how much I, as a user, use Lemmy; the server would only need to store my personally created data, and nothing else. Unfortunately, however, it appears that I would also have to have enough resources to sustain everyone elses posts which is a far steeper requirement.

[-] palitu@lemmy.perthchat.org 3 points 1 year ago

Well, it really comes down to how many subscriptions there are.

A small instance may only sub to 100 communities, so it is not too bad.

But on the flip side, it means that the big instancr needs to send everything to a huge number of small instances.

In practice I do not think it will be too bad, there will be a set of medium sized instances that most will be attracted to, and they will have the p80 of communities subbed. Smaller ones will be for more technical people who will not worry that they need to ensure the content is subbed to, as they will understand how it works.

I think over time, services that aggregate community details will spring up, and be incorporated into the lemmy search, so it is easier to find things across the entire fediverse, not just your instance. I think there will be a large set of muggle-type user improvements over the next couple of months.

[-] hawkwind@lemmy.management 3 points 1 year ago

Media takes up space. The text from posts and comments is trivial. The database for lemmy.world is only 25 GB. Wikipedia text is only 21 GB.

[-] Max_P@lemmy.max-p.me 2 points 1 year ago

It's not quite as bad, because you're still being pushed what you subscribe to. So while you do indeed get a fair bit of content you might never see, it's necessary for you to be able to browse those communities and even being able to compute what threads are active/trending/hot/updated or whatever else filter you use. Because that's all computed locally on your instance.

It's also an efficiency advantage: if your instance has a lot of users, having everything locally means that you offer a much smoother experience, and also you're contributing to the remote instance not being so busy with traffic as you're not just proxying everything to it and increasing the remote's load.

For your storage concerns, there's nothing preventing you from purging content older than a week or two regularly via a cronjob.

It's not that bad so far:

8,0K    volumes/lemmy-ui
887M    volumes/pictrs
646M    volumes/postgres
1,5G    total
[-] lodion@aussie.zone 0 points 1 year ago

Your instance must be very new, very few users, very inactive... or all of the above. I stood up aussie.zone just under a month ago, Postgres DB is currently 9.6GB.

this post was submitted on 02 Jul 2023
54 points (93.5% liked)

Selfhosted

40403 readers
448 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS