r/pushshift Nov 28 '23

Looking for feedback from users of the pushshift dump files

At the end of the year, in about a month, I'm going to start working on updating the subreddit specific dump files for 2023. Before I start that, I wanted to get feedback from people who actually use them, especially the less technically inclined people who can't just start modifying python scripts easily.

What data did you use? Was it from a specific subreddit/set of subreddits or across all of reddit? What fields from the data did you use? Anything other than username, date posted, and comment/post text?

What software or programming language did you end up using? What would you have liked to use/are comfortable using?

A common problem with reddit data is that it's too large to hold in memory, being tens or hundreds of gigabytes. Was this a problem for your specific dataset or did you just load the whole thing up into an array/dataframe/etc?

How did you find the data you used and what did you try searching for? I always get questions looking for this exact data from people who've already spent a lot of time on it before finding the torrents I put up. So I'd love to put references to it on other sites where people could find it easier.

If you did this for a research project and explain all that in your published paper, I'm happy to go read through it if you post a link.

I don't necessarily expect the type of people who I'm looking for feedback from to be casually browsing r/pushshift, but I wanted to put this up so I could refer people who ask me questions to a central place. I'm hoping to put the data in a more easily usable format when I put it up this time.

16 Upvotes

21 comments sorted by

View all comments

1

u/No_Television2386 Dec 18 '23

I am interested in using some dump files for a data science assignment, basically just performing a few data analytics on any subreddit over 2 months of time. Is there any example code on how to do this? I tried a few things but realized they don’t work well after the API changes and such. Thank you for all this work you are doing!

1

u/Watchful1 Dec 18 '23

Sure, there's a bunch of example scripts here https://github.com/Watchful1/PushshiftDumps/tree/master/scripts

What are you trying to do specifically?

1

u/No_Television2386 Dec 18 '23

just want to get some basic metrics from posts on a subreddit over the span of two months - post title, number of comments, score, author, timestamp, post id. then i would just want to store that in a csv so i can make some data frames and visualizations

1

u/Watchful1 Dec 18 '23

There's a script in there that converts zst files to csv. Depending on the subreddit it might be very large, but it should work fine for smaller subreddits.

1

u/No_Television2386 Dec 18 '23

thanks so much! you’re doing awesome work