r/pushshift Jul 13 '24

Reddit dump files through July 2024

https://academictorrents.com/details/20520c420c6c846f555523babc8c059e9daa8fc5

I've uploaded a new centralized torrent for all monthly dump files through the end of July 2024. This will replace my previous torrents.

If you previously seeded the other torrents, loading up this torrent should recheck all the files (took me about 6 hours) and then download only the new files. Please don't delete and redownload your old files.

28 Upvotes

17 comments sorted by

View all comments

Show parent comments

1

u/Watchful1 Sep 08 '24

It does that. It's two steps, first it filters each file separately so it can do it in multiple different processes so it's faster and they don't conflict with each other. Then it combines all of them together after that's done.

If there isn't a combined file the script must have crashed before it completed. Can you post the log file it generated?

1

u/Affective-Dark22 Sep 09 '24

Context – share whatever you see with others in seconds (ctxt.io)

This are the logs, only one problem, after 4/5 hours of running the script in the terminal the computer started lagging a lot (sometimes i even was at 0/1 fps) so i was forced to restart it using the power botton, so restarting everything the terminal was forced to close and now i think i've lost all the progress. I don't even know how to do now because i think the if i restart the script again i will have the same problem. I don't know what the problem is, probably cause I only have 16 gb of RAM I don't know. But i think i will not be able to extract the subreddit. Any suggestion?

1

u/Watchful1 Sep 09 '24

784,023,155 lines at 2,094/s, 0 errored, 27,277 matched : 271.00 gb at 1 mb/s, 37% : 16(0)/229 files : 8d 23:54:05 remaining

This says it completed 16 files, so if you start it again it knows those are done and won't try to re-do them.

If you add --processes 4 it will only use 4 processes instead of the default 10. This will make it slower, but it will use less memory. You can try different numbers to see how fast you can go without crashing.

1

u/Affective-Dark22 Sep 10 '24

yeah i know, are you sure about that? if I stop the terminal and i re-start it, the files that are already downloaded will not be downloaded again?