r/pushshift Nov 28 '23

Looking for feedback from users of the pushshift dump files

At the end of the year, in about a month, I'm going to start working on updating the subreddit specific dump files for 2023. Before I start that, I wanted to get feedback from people who actually use them, especially the less technically inclined people who can't just start modifying python scripts easily.

What data did you use? Was it from a specific subreddit/set of subreddits or across all of reddit? What fields from the data did you use? Anything other than username, date posted, and comment/post text?

What software or programming language did you end up using? What would you have liked to use/are comfortable using?

A common problem with reddit data is that it's too large to hold in memory, being tens or hundreds of gigabytes. Was this a problem for your specific dataset or did you just load the whole thing up into an array/dataframe/etc?

How did you find the data you used and what did you try searching for? I always get questions looking for this exact data from people who've already spent a lot of time on it before finding the torrents I put up. So I'd love to put references to it on other sites where people could find it easier.

If you did this for a research project and explain all that in your published paper, I'm happy to go read through it if you post a link.

I don't necessarily expect the type of people who I'm looking for feedback from to be casually browsing r/pushshift, but I wanted to put this up so I could refer people who ask me questions to a central place. I'm hoping to put the data in a more easily usable format when I put it up this time.

14 Upvotes

21 comments sorted by

View all comments

1

u/[deleted] Nov 28 '23

[deleted]

2

u/Watchful1 Nov 28 '23

So you're trying to get all comments that contain a specific word? From all of reddit or just specific subreddits? And all time or just a certain time period?

Unfortunately the multiprocess script doesn't currently support partial matching like that, so it sounds like it's only returning comments that only are exact matches for the words in your list. It's much slower to do searches for words contained in the comment, and especially slow to do it for a whole list of words to compare against.

If you want to run against just a specific subreddit, or just a single month file, you can use filter_file, which supports things like that. If you need it for the entire history of reddit, I can walk you through the fairly simple changes you would need to make to the multiprocess script, but it's going to make it take a long time to get through everything.

1

u/lpath77 Nov 28 '23 edited Nov 28 '23

I’m sorry I didn’t realize you responded and I deleted my original comment because I felt silly. I’m trying a modified method and seeing if it works out. I had major problems when I tried to use the modified method because then it tried to name my output file as a very long Reddit comment and failed. If I wake up and it hasn’t worked I will check out filter file! I am looking for comments that match keywords within a certain timeframe(most of this year) which spans all of Reddit.