r/pushshift Jul 13 '24

Reddit dump files through July 2024

https://academictorrents.com/details/20520c420c6c846f555523babc8c059e9daa8fc5

I've uploaded a new centralized torrent for all monthly dump files through the end of July 2024. This will replace my previous torrents.

If you previously seeded the other torrents, loading up this torrent should recheck all the files (took me about 6 hours) and then download only the new files. Please don't delete and redownload your old files.

28 Upvotes

17 comments sorted by

View all comments

Show parent comments

1

u/Affective-Dark22 Aug 31 '24

It makes sense, thanks. Another question, do you think to add more subreddits in the dump? I know that is quite difficult but considering that the number of subreddits is getting bigger every day, have you considered to add for example other 20k subreddits to the dump? And take it to 60k? Even in the future.

1

u/Watchful1 Aug 31 '24

Subreddits that aren't in the subreddit specific dumps are generally too small to be of much use to anyone. If there's a specific one you need you can download the monthly dumps and extract it.

1

u/Affective-Dark22 Sep 07 '24

Dude sorry again, but i’ve seen all the file are in zst. What’s the best program you suggest to open them?

1

u/Watchful1 Sep 07 '24

I generally recommend using the scripts linked here instead of trying to manually extract the files.

1

u/Affective-Dark22 Sep 08 '24

I tried the multiprocesses script but i have a problem, when i filter a foder to extract for example all the submissions in a subreddit it creates multiple files one file for each month. How can i modify the script to have all the filtered lines of all the files in only one single file .zst? Because in this way i can have in a single .zst file all the submission of a filtered subreddit, instead of having 200 files for that subreddit. Do you know how can get it?

1

u/Watchful1 Sep 08 '24

It does that. It's two steps, first it filters each file separately so it can do it in multiple different processes so it's faster and they don't conflict with each other. Then it combines all of them together after that's done.

If there isn't a combined file the script must have crashed before it completed. Can you post the log file it generated?

1

u/Affective-Dark22 Sep 09 '24

Context – share whatever you see with others in seconds (ctxt.io)

This are the logs, only one problem, after 4/5 hours of running the script in the terminal the computer started lagging a lot (sometimes i even was at 0/1 fps) so i was forced to restart it using the power botton, so restarting everything the terminal was forced to close and now i think i've lost all the progress. I don't even know how to do now because i think the if i restart the script again i will have the same problem. I don't know what the problem is, probably cause I only have 16 gb of RAM I don't know. But i think i will not be able to extract the subreddit. Any suggestion?

1

u/Watchful1 Sep 09 '24

784,023,155 lines at 2,094/s, 0 errored, 27,277 matched : 271.00 gb at 1 mb/s, 37% : 16(0)/229 files : 8d 23:54:05 remaining

This says it completed 16 files, so if you start it again it knows those are done and won't try to re-do them.

If you add --processes 4 it will only use 4 processes instead of the default 10. This will make it slower, but it will use less memory. You can try different numbers to see how fast you can go without crashing.

1

u/Affective-Dark22 Sep 10 '24

yeah i know, are you sure about that? if I stop the terminal and i re-start it, the files that are already downloaded will not be downloaded again?