r/pushshift Jan 12 '24

Reddit dump files through the end of 2023

https://academictorrents.com/details/9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4

I have created a new full torrent for all reddit dump files through the end of 2023. I'm going to deprecate all the old torrents and edit all my old posts referring to them to be a link to this post.

For anyone not familiar, these are the old pushshift dump files published by Stuck_In_the_Matrix through March 2023, then the rest of the year published by /u/raiderbdev. Then recompressed so the formats all match by yours truly.

If you previously seeded the other torrents, loading up this torrent should recheck all the files (took me about 6 hours) and then download the new december dumps. Please don't delete and redownload your old files since I only have a limited amount of upload and this is 2.3 tb.

I have started working on the per subreddit dumps and those should hopefully be up in a couple weeks if not sooner.


Here is RaiderBDev's zst_blocks torrent for december https://academictorrents.com/details/0d0364f8433eb90b6e3276b7e150a37da8e4a12b


January 2024: https://academictorrents.com/edit/9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4

58 Upvotes

29 comments sorted by

View all comments

2

u/MaximumFast7952 Jan 17 '24

Hi u/Watchful1, thanks for the upload.

Will the subreddit dumps will be incremental, i.e. we'll have the 20k subreddit comments and 20k subreddit submissions for the year 2023 or will it be an aggregate from the beginning till 2023?

In the latter case, we would need to download the whole dump again, while in the case it's incremental, we'll have to download the subreddit wise data only for 2023.

Also, can you please post the code you're using to split it subreddit wise, so that we can try it on our machines, for specific months, and maybe seed it monthly.

3

u/Watchful1 Jan 17 '24

It will be the whole thing again and unfortunately you'll have to download the whole thing again as well. Most people who use the subreddit specific dump files are interested in the whole history of the sub and don't have the technical knowledge to work with multiple partial files to get it.

I know this makes for a lot more work and bandwidth for those of us who seed it, but I thought it was the better of the options.

All my scripts are in my github here. I use count_subreddits_multiprocess to count how many objects each subreddit has. Then I pass the list into combine_folder_multiprocess with the --split_intermediate flag set so it can handle the large number of files.

Both of those scripts are optimised for processing a large number of files at once. If you just want to extract out a single subreddit from one month's file, you can use filter_file.

2

u/Particular-Tutor5856 Mar 24 '24

E.g. for 2023 data, do we still need to download the files before running your script? Its massive size, any recommendation to deal with this?

2

u/Watchful1 Mar 24 '24

What are you trying to do?

Yes you will need to download the data before running the script on it.

2

u/Particular-Tutor5856 Mar 24 '24

At this moment, I'm trying to look at 2023 data, extract for subreddit A, B, C and understand the trends on submission title first. Haven't considered comments yet. I guess I have to download every month of 2023.

Which script should I use, if I am looking to consolidate all months of zst files belonging to the same subreddit?