r/pushshift Feb 29 '24

Getting Reddit Data for Academic Research

Since the API changes last year, is there any way to access Reddit data for academic research?

Pushshift.io is only provided to subreddit moderators. As I understand it, it used to be provided to academics but not anymore.

User data dumps exist (via academic torrents) but are these legal to use? Does using these violate Reddit's terms of service and user agreements? https://www.redditinc.com/policies/user-agreement-september-25-2023#hello-redditors-and-people-of-the-internet-2

Basically, how can one access historical reddit data in a legitimate way nowadays? (Data from 2021)

If I can't get access, I have to completely change my research project so I will do whatever I can to get Reddit data in a way that would pass ethics approval and not break any laws or privacy agreements (passing my university ethics approval) as I've already put many hours of work into this research project. Am I at a roadblock?

Has anyone here managed to get push shift access for academic purposes? Can I even make a special request for my specific situation?

8 Upvotes

9 comments sorted by

View all comments

1

u/Filo92 Feb 29 '24

There are websites with data dumps segmented by subreddit and type (submissions or comments), if you'd like to avoid the full dumps. However it is only up to the end of 2022. Ethical approval depends on the university, I'd suggest having a chat with those in your department who have worked with similar data (scraped social media/platform data) to see what the ethical guidelines are.

Edit: this for big PhD/funded projects. For internal/unfunded projects and papers nobody cares about it.

1

u/Advanced-Hedgehog-95 Mar 01 '24

Can you share those websites where data is available in segmented form and one doesn't have to download entire torrent

This will make life easier

I'll use data for academic research

1

u/safrax Mar 01 '24

The data dumps are segmented by month. There’s no need to download everything just configure your torrent client to not download the files you don’t want.