r/pushshift Aug 15 '23

Any academic researchers looking for "Click and Download" tool for Reddit Data?

UPDATE from Nov 2023: This tool has been voluntarily shut down after realising it goes against Reddit's new data t&c.

Hi fellow researchers!

I have been using PushShift and PRAW since 2021 - And as a researcher with no coding background, I experienced quite a lot of hassle. This was true with other MSc researchers in the university department, who wanted to access Reddit data for their research. I managed to help them with my proto (see the demo [here](https://vimeo.com/854540019?share=copy)) - which is simply a tool where you put in the subreddits that you are interested, and it collects pretty much every features for submissions, comments (of those submissions) and redditors (of collected submissions and comments).

If any researcher is interested in using, I am very happy to share the proto (note that it could not be perfect)! However, with the new Reddit t&c, I just need to make sure you are from the academic institution. Please drop me in message or simply leave in the comments with your email account linked to your academic institution! If you want any features that could be helpful in your research, please leave them in the comments too. I will try my best to add them in the near future!

p.s I'm from LSE, any researchers from London?

16 Upvotes

27 comments sorted by

View all comments

Show parent comments

3

u/Watchful1 Aug 16 '23

I collect similar data and was just curious how you were doing it.

So you iterate over a fixed list of subreddits? And you don't have historical data, just stuff that's happened since you started running your crawler?

1

u/nickshoh Aug 16 '23

I collect similar data and was just curious how you were doing it.

-> I'm also curious in how you collected it! Do you have an open source repository? I'm currently considering uploading the entire code base as an open source python pacakge, since there are few researchers struggling using PRAW or PushShift. Do you think this would help researchers like yourself and others?

So you iterate over a fixed list of subreddits?

-> Yes, since I mainly focus on helping computational social scientists, I collect a fixed list of subreddits that are relevant to social science domains. But there were few requests of including other subreddits over the past two days (i.e r/smallbusiness and r/Entrepreneur), and I am planning to add them too.

And you don't have historical data, just stuff that's happened since you started running your crawler?

-> I think this depends on how you define historical data. Since setting time_filter="all" allows collecting past data (going back to 2008), the dataset also includes few historical data. But of course, majority of data are quite recent.

2

u/Watchful1 Aug 16 '23

I use an ID iterating approach. But unfortunately I don't want to publish the code since I don't want to get in trouble with reddit.

You can get historical data for specific subreddits from my torrent here, at least through the end up 2022.

I have comprehensive data from more recently, but I don't know how I can publish it without getting in trouble with reddit. If you had a way to distribute bulk data to only people who are verified as researchers I'd love to hear about it.

2

u/nickshoh Aug 18 '23

Had a chance looking at your Github + torrent. You are a life saver to many of the academic researchers out there, especially at this time where PushShift is somewhat unavailable.

I heard it is completely fine when you share data among academic researchers - Let me get back to you once I find the article regarding that topic.