I was being cut off, I manage it with chunking techniques. They unfortunately took down the file so now I have no source to pull from.
Arthas
I was, and that is why it was taking so long for me to download as I use my custom downloader which uses various techniques to chunk the download. Unfortunately it seems like they've now removed the file completely so my downloader has no source to pull from and is stopped at 36gb.
some bad news, it looks like the data 9 zip file link doesn't work anymore. They appear to have removed the file so my download stopped at 36gb. I'm not familiar with their site so is this normal for them to remove the files and maybe put them back again once they've reorganized them and at the same link location? or are we having to do the scrape of each pdf like another user has been doing?
yeah still chugging away slowly, it may take me a few days actually, it's quite slow but so far it appears to be getting it.
I have various chunking techniques that I use. I adaptively modify the request size of the chunks as I've noticed at times the CDN will give large amounts then micro amounts. I haven't figured out the exact backoff rate but I have retry mechanisms in place. The CDN is very annoying but so far my methods are working, just slow.
Ok great. As for comparing files. I would likely do a hash check. That shouldn't be difficult to identify truly unique files. It'll take a few days for a decent computer to generate all the hashes but it should be pretty automated. I'll reach out once I have it completed.
I am downloading dataset 9 and should have the full 180gb zip done in a day. To confirm, the link on DOJ to the dataset 9 zip is now updated to be clean of CSAM or not? As much as I wish to help the cause, I do not want any of that type of material on my server unless permission has been given to host it for credible researchers only that need access to all files for their investigation, but I have no way of understanding what’s within legal rights to assist with redistributing the files to legitimate investigators and thus my plans to help create a torrent may be squashed. Please let me know.
i analyzed with AI my 36gb~ that I was able to download before they erased the zip file from the server.
I haven't looked into downloading the partials from archive.org yet to see if I have any useful files that archive.org doesn't have yet from dataset 9.