26
37
submitted 3 months ago by Baku@aussie.zone to c/datahoarder@lemmy.ml

While clicking through some random Lemmy instances, I found one that's due to be shut down in about a week — https://dmv.social. I'm trying to archive what I can onto the Wayback Machine, but I'm not sure what the most efficient way to go about it is.

At the moment, what I've been doing is going through each community and archiving each sort type (except the ones under a month, since the instance was locked a month ago) with capture outlinks enabled. But is there a more efficient way to do it? I know of the Internet Archives save from spreadsheet tool, which would probably work well, but I don't know how I'd go about crawling all the links into a sitemap or csv or something similar. I don't have the know-how to setup a web crawler/spider.

Any suggestions?

27
12

Seems the SSD sometimes heats up and the content disappears from the device, mostly from my router, sometimes from my laptop.
Do you know what I should configure to put the drive to sleep or something similar to reduce the heat?

I'm starting up my datahoarder journey now that I replaced my internal nvme SSD.

It's just a 500GB one which I attached to my d-link router running openwrt. I configured it with samba and everything worked fine when I finished the setup. I just have some media files in there, so I read the data from jellyfin.

After a few days the content disappears, it's not a connection problem from the shared drive, since I ssh into the router and the files aren't shown.
I need to physically remove the drive and connect it again.
When I do this I notice the somewhat hot. Not scalding, just hot.

I also tried this connecting it directly to my laptop running ubuntu. In there the drive sometimes remains cool and the data shows up without issue after days.
But sometimes it also heats up and the data disappears (this was even when the data was not being used, i.e. I didn't configure jellyfin to read from the drive)

I'm not sure how I can be sure to let the ssd sleep for periods of time or to throttle it so it can cool off.
Any suggestion?

28
28

What in your hoard do you treasure the most? I imagine to a lot of us it is photos and videos of our families, which I'd love to hear about, but also interested in rare bits of media or information that makes your collection unique.

29
13
submitted 4 months ago* (last edited 4 months ago) by mulcahey@lemmy.world to c/datahoarder@lemmy.ml

Last year Elon Musk accidentally revealed that he has a burner account on Twitter called @ErmnMusk.

Now that account is gone.

I'm looking for an archive: its tweets, its likes, anything and everything. Does anyone know where to find?

30
37
submitted 4 months ago by otacon239@feddit.de to c/datahoarder@lemmy.ml

So I’ve been consolidating all of my storage and removing all the duplicates and junk files.

In actual physical storage, this was spread across 12TB worth of hard drives, all partially full.

After everything was said and done, I’m using 1.3TB of space if you don’t include games. ¯\_(ツ)_/¯

This is stuff dating back to 2015. Sometimes it’s actually worth it to just clean up your junk files.

31
29
submitted 4 months ago by ylai@lemmy.ml to c/datahoarder@lemmy.ml
32
-4
submitted 4 months ago by ylai@lemmy.ml to c/datahoarder@lemmy.ml
33
19
submitted 4 months ago* (last edited 4 months ago) by CorrodedCranium@leminal.space to c/datahoarder@lemmy.ml

I imagine a lot of people who are into data hoarding already know a lot of this but I thought the video was pretty neat. It briefly talks about the history of different compression formats and provides a brief blurb about why you may want to use one or the other.

I'd recommend checking it out if you want 15 minutes of background noise.


For anyone new to data compression TechQuickie and CrashCourse have videos on it. If you really want to go down the rabbit hole you could check out media compression and see how things like JPEGs and PNGs work.

34
18
submitted 4 months ago by ylai@lemmy.ml to c/datahoarder@lemmy.ml
35
36
submitted 5 months ago by Wilshire@lemmy.world to c/datahoarder@lemmy.ml
36
7

cross-posted from: https://sh.itjust.works/post/14280067

What is the best tool to get URLs for all tweets within a given date range?

The ideal behaviour I'm looking for would be something like this:

Input: https://twitter.com/SpaceX 2023-09-01 2024-02-08

Output:

  • https://twitter.com/SpaceX/status/1755763378449183003#m
  • https://twitter.com/SpaceX/status/1755759459765567825#m
  • https://twitter.com/SpaceX/status/1755752291578302545#m
  • ...

What would be the best tool to achieve this? Thanks in advance!

37
23
submitted 5 months ago* (last edited 5 months ago) by aleq@lemmy.world to c/datahoarder@lemmy.ml

Not sure if this is better fit for datahoarder or some selfhost community, but putting my money on this one.

The problem

I currently have a cute little server with two drives connected to it running a few different services (mostly media serving and torrents). The key facts here is that 1) it's cute and little, 2) it's handling pretty bulky data. Cute and little doesn't go very well with big raid setups and such, and apart from upgrading one of the drives I'm probably at my limit in terms of how much storage I can physically fit in the machine. Also if I want to reinstall it or something that's very difficult to do without downtime since I'd have to move the drive and services of to a different machine (not a huge problem since I'm the only one using it, but I don't like it).

Solution

A distributed FS would definitely solve the issue of physically fitting more drives into the chassi, since I could basically just connect drives to a raspberry pi and have this raspi join the distributed fs. Great.

I think it could also solve the issue of potential downtime if I reinstall or do maintenance, since I can have multiple services read of the same distributed FS and reroute my reverse proxy to use the new services while the old ones are taken offline. There will potentially be a disruption, but no downtime.

Candidates

I know there are many different solutions for distributed filesystems, such as ceph, moosefs, glusterfs and miniio. I'm kinda leaning towards ceph because of it's integration in proxmox, but it also seems like the most complicated solution in the bunch. Is it worth it? What are your experiences with these, and given the above description of my use-case which do you think would be the best fit?

Since I already have a lot of data it's a bonus if it's easy to migrate from my current filesystem somehow.

My current setup uses a lot of hard links as well, so it's a big bonus if the solution has something similar (i.e. some easy way of storing the same data in multiple places without duplicating it)

38
12

cross-posted from: https://lemmy.dbzer0.com/post/13532369

DDoS secrets responsible for hosting leaks such as EpikFail and BlueLeaks will stop its activities, I would like help from anyone who has space left so we can download everything and keep seeding.

Torrent download links: https://data.ddosecrets.com/

39
8

I'm going to archive some youtube videos, what's the proper way to change it from mp4 to webm to etc etc or vice versa?

In the past when I couldn't run a video file for whatever reason I would just rename the file, but I'm assuming there's better ways to do it. And is there a specific order I have to go in? (e.g. with audio going from .mp3 to .flac doesn't make sense.)

Thanks in advance.

40
20
41
6
submitted 5 months ago by kionite231@lemmy.ca to c/datahoarder@lemmy.ml

I have scraped a lot of links from instagram and threads using selenium python. It was a good learning experience. I will be running that script for few days more and will see how many more media links I can scrape from instagram and threads.

However, the problem is that the media isn't tagged so we don't know what type of media it is. I wonder if there is an AI or something that can categorize this random media links to an organized list.

if you want to download all the media from the links you can run the following command:

# This command will download file with all the links
wget -O links.txt https://gist.githubusercontent.com/Ghodawalaaman/f331d95550f64afac67a6b2a68903bf7/raw/7cc4cc57cdf5ab8aef6471c9407585315ca9d628/gistfile1.txt
# This command will actually download the media from the links file we got from the above command 
wget -i links1.txt

I was thinking about storing all of these. there is two ways of storing these. the first one is to just store the links.txt file and download the content when needed or we can download the content from the links save it to a hard drive. the second method will consume more space, so the first method is good imo.

I hope it was something you like :)

42
5
43
20
submitted 6 months ago* (last edited 6 months ago) by otp@sh.itjust.works to c/datahoarder@lemmy.ml

I've got a fairly new 14tb Seagate Expansion. It works fine, and I've been using it for a month and a bit.

I don't know how long it's been doing this, but the power supply is making a very faint alarm sound. The power supply is plugged into a Belkin surge protector powered on and with the "protected" status light lit, and it is plugged into an outlet. The HDD is currently not plugged in to a computer.

It's not a beep or electricity. It's a distinct weewooweewoo. I couldn't even determine the source until I pressed my ear against it.

Googling just points me towards typical "my HDD is making a sound, how long do I have until it dies", but nothing pointed me to the alarm sound from the power supply.

I'll check again if it makes the alarm in other conditions, but in the meanwhile, I was hoping someone here might know something.

Thanks in advance!

EDIT: The sound only happens when...

  • Power adapter is plugged into the HDD, AND the outlet
  • HDD is NOT plugged into the computer.

Plugging it into the computer stops the noise from the power adapter.

44
38

It seems like 6 or 7 years ago there was research into new forms of storage, using crystals or DNA that promised ultra high density storage. I know the read/write speed was not very fast, but I thought by now there would be more progress in the area. Apparently in 2021 there was a team that got a 16GB file stored in DNA. In the last month there's some company (Biomemory) that lets you store 1KB of data into DNA for $1,000, but if you want to read it, you have to send it to them. I don't understand why you would use that today.

I wonder if it will ever be viable for us to have DNA readers/writers... but I also wonder if there are other new types of data storage coming up that might be just as good.

If you know anything about the DNA research or other new storage forms, what do you think is the most promising one?

45
15
submitted 6 months ago* (last edited 6 months ago) by Deckweiss@lemmy.world to c/datahoarder@lemmy.ml

Sorry for not doing much research beforehand and asking a newbee question. I am looking for some entrypoint info to the question:

How would one go about datahoarding lemmy?

It seems to be a grade above what I've been doing so far (downloading video/audio from streaming platforms and backing up web articles and blogposts as pdfs) due to the distributed nature and the activitypub protocol.


Relevant stuff that I've found so far but havent studied extensively:

  1. This does not seem to store most of the data https://github.com/tgxn/lemmy-explorer
46
12
submitted 6 months ago* (last edited 6 months ago) by HiddenLayer5@lemmy.ml to c/datahoarder@lemmy.ml

So I have a nearly full 4 TB hard drive in my server that I want to make an offline backup of. However, the only spare hard drives I have are a few 500 GB and 1 TB ones, so the entire contents will not fit all at once, but I do have enough total space for it. I also only have one USB hard drive dock so I can only plug in one hard drive at a time, and in any case I don't want to do any sort of RAID 0 or striping because the hard drives are old and I don't want a single one of them failing to make the entire backup unrecoverable.

I could just play digital Tetris and just manually copy over individual directories to each smaller drive until they fill up while mentally keeping track of which directories still need to be copied when I change drives, but I'm hoping for a more automatic and less error prone way. Ideally, I'd want something that can automatically begin copying the entire contents of a given drive or directory to a drive that isn't big enough to fit everything, automatically round down to the last file that will fit in its entirety (I don't want to split files between drives), and then wait for me to unplug the first drive and plug in another drive and specify a new mount point before continuing to copy the remaining files, using as many drives as necessary to copy everything.

Does anyone know of something that can accomplish all of this on a Linux system?

47
11
submitted 6 months ago* (last edited 6 months ago) by yo_scottie_oh@lemmy.ml to c/datahoarder@lemmy.ml

Hello c/datahoarder! I need your help. Not sure whether this has been asked before—I've tried searching the web, but the only advice I can find is how to download episodes for podcasts whose feeds are still active.

The problem I'm trying to solve is that one of my favorite podcasts, Endless Boundaries Jam Radio, went offline during the pandemic. All the usual feed aggregators still show up in internet searches, but as they are not file hosts, just feed aggregators, all the episodes are now dead links (e.g. Podbay, Tunein, etc).

Thing is, I had already downloaded several episodes using the Playapod app on my iPhone. It's usable for now, but I'm very concerned about when I need to upgrade to a new phone.

Is there a trick for access the individual files on my iPhone that were downloaded through a third party app such as Playapod? TIA

EDIT: I figured out how to do what I wanted. Once I had installed ifuse and related dependencies (e.g. libimobiledevice) on my Linux PC, I could connect my iPhone to my PC via USB and browse the files on my iPhone in my distro's default file browser. Many folders are named as GUIDs, making it harder to tell what's what by just looking at their names, but I narrowed down the right folder by opening up the Disk Usage Analyzer app in Linux. In my case, the Playapod app is one of very few apps with more than a gigabyte of data. I still have to go through and figure out which episode each mp3 file is, but that's still better than having nothing at all.

Thanks to everyone who responded. I hope this info helps anyone else in a similar predicament!

48
22
What to do with extra HDDs (discuss.tchncs.de)

Hey guys, I'm setting up my NAS (openmediavault) and very much enjoying it! It now runs my Nextcloud and a couple of services. I got a mirror ZFS setup of two 8TB drives.

I got another two 8TB drives and am doubting whether I should add them as an extra mirror vdev, or create a new pool for extra backup. I'm not sure if that extra backup is necessary though, since I got a cloud backup already every day. My drives are only used 14% so I'm not even sure if I should already put them in the pool. What do you guys think?

49
27
submitted 7 months ago* (last edited 5 months ago) by Lemmchen@feddit.de to c/datahoarder@lemmy.ml
50
4
submitted 7 months ago* (last edited 7 months ago) by Nogami@lemmy.world to c/datahoarder@lemmy.ml

Just wondering if anyone knows which SAS connectors on the SAS826A backplane control which ports?

On my current setup only ports 8-11 are working so got some troubleshooting ahead of me.

The online manuals show the connectors but unhelpfullyndont indicate which ports are being used for each.

Also, anyone know what the ribbon cable beside the SAS wires is used for on supermicro cables? I don’t recall seeing it on other SAS cables.

view more: ‹ prev next ›

datahoarder

6272 readers
2 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 4 years ago
MODERATORS