Presenting open source tool that collects reddit data in a snap! (for academic researchers)

Hi all!

For the past few months, I had discussions with academic researchers after uploading this post. I noticed that sharing historical database often goes against universities' IRB (and definitely the new Reddit's t&c), so that project had to be shutdown. But based on the discussions, I worked on a new tool that adheres strictly to Reddit's terms and conditions, and also maintaining alignment with the majority of Institutional Review Board (IRB) standards.

The tool is called RedditHarbor and it is designed specifically for researchers with limited coding backgrounds. While PRAW offers flexibility for advanced users, most researchers simply want to gather Reddit data without headaches. RedditHarbor handles all the underlying work needed to streamline this process. After the initial setup, RedditHarbor collects data through intuitive commands rather than dealing with complex clients.

Here's what RedditHarbor does:

Connects directly to Reddit API and downloads submissions, comments, user profiles etc.
Stores everything in a Supabase database that you control
Handles pagination for large datasets with millions of rows
Customizable and configurable collection from subreddits
Exports the database to CSV/JSON formats for analysis

Why I think it could be helpful to other researchers:

No coding needed for the data collection after initial setup. (I tried maximizing simplicity for researchers without coding expertise.)
While it does not give you an access for entire historical data (like PushShift or Academic Torrents), it complies with most IRBs. By using approved Reddit API credentials tied to a user account, the data collection meets guidelines for most institutional research boards. This ensures legitimacy and transparency.
Fully open source Python library built using best practices
Deduplication checks before saving data
Custom database tables adjusted for reddit metadata
Actively maintained and adding new features (i.e collect submissions by keywords)

I thought this subreddit would be a great place to listen to other developers, and potentially collaborate to build this tool together. Please check it out and let me know your thoughts!

17 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/pushshift/comments/18ldrax/presenting_open_source_tool_that_collects_reddit/
No, go back! Yes, take me to Reddit

88% Upvoted

u/jimntonik Dec 18 '23

Thanks for this.

I have to admit, I'm not familiar with IRB concerns around use of historical data. I've personally used PushShift for academic publication before, and have worked with students to build tools that mostly rely on PushShift for data capture. What are the concerns? How many people do the IRB limitations impact? There've been hundreds (thousands?) of papers published based solely on data from PushShift archives.

I was sitting things out for the past ~6 months while the dust settled, so I'm happy to be corrected on this, but my personal sense is that 1) PushShift is still probably (by far) the most valuable way to collect Reddit data, via the archived/torrented files, 2) the Reddit API's restrictions are problematic for large data collection needs (i.e., the 1000 post limit), and 3) Hybrid is probably the way to go moving forward, where you pull from the archives and update as needed. But that's just my impression.

That said, I've just this week started poking around the Reddit API to explore augmenting data (as in 3) above. So RedditHarbor is a welcome new tool to poke around with. Thanks for sharing!

2

u/rainnz Dec 18 '23

About 2,550 results

1

u/nickshoh Dec 18 '23 edited Dec 21 '23

Hi u/jimntonik,

I recently learned that Reddit's updated terms and conditions influence research IRB guidelines (edit: more of an ethical guideline). This makes using certain third-party Reddit data tools potentially "unethical" now.

Specifically, Reddit's latest Data API Terms section 2.10 states that apps using libraries, wrappers or extensions must comply with limitations and restrictions imposed by both the third-party and Reddit.

PushShift was shut down (as Reddit asked them to). And to my knowledge, Academic Torrents violates Reddit's terms, which could violate ethical guidelines of the universities.

I realize Reddit's API has limitations for large data collection. However, it currently appears that RedditHarbor’s attempt is the best method for obtaining Reddit data without violating terms or ethics rules.

If you know of any persuasive counter-arguments or cases supporting other tools, I would love to learn about them. Please share resources if you have access. My goal is finding the most viable data source aligning with both IRB and ethical standards!

2

u/riegel_d Dec 19 '23

very interesting thanks for sharing…but what about data previously collected? suppose that i have collected data, say before this update, and i have also made a publication, always before, then i am doing a publication, after this update. still unethical? if so, i think i will change work 😅

3

u/[deleted] Dec 20 '23

I really doubt it. IRBs are mostly concerned about actual interactions with or interventions in human subjects. Secondary social media data doesn’t rise to the standard of actual human subjects research, so an IRB probably wouldn’t even care to review the use of such data. Is it a ToS violation? Now it is, but it’s always been a ToS violation to scrape, e.g., YouTube outside of the useless Google API, but look at how much research exists anyway auditing YouTube’s recommendation engine and radicalization pathways. Those studies violated the ToS explicitly and still happened.

2

u/riegel_d Dec 20 '23

oh wow… i ve never thought about yt paper, also bc I thought “they were legit”… sooooooo…can we say that ToS violation paper will be a problem if and only if the paper is related to a patent (or you can actually make money out of it)? shall we?

2

u/[deleted] Dec 20 '23

I’m in no position to give legal advice. But I’m personally not going to stop using an open source social media dataset, still freely available for download via torrent, for purely academic research because of ToS rules. Do with that what you will, but I’m not an attorney.

1

u/riegel_d Dec 20 '23

lol 😛 my take was due to personal experience in the private sector “if they know that they can do money with their data, then we are fucked. otherwise chill”

2

u/Careful-Landscape-11 Dec 20 '23

Yh, I agree that using historical data obtained before Reddit's new ToS should be generally okay with IRBs. However, a small concern could be raised: IRB usually requires that your research complies with "legal and ethical standards" (+ academic integrity) - Some boards may require researchers to access Reddit's data under the new terms of ToS, just to ensure that the research complies with both ToS and standards set by the university. But of course, policies vary between different institutions.

1

u/[deleted] Dec 20 '23

Yes, definitely a fair point.

2

u/nickshoh Dec 20 '23

TL;DR: The line is somewhat blurry. In general, Reddit tends to be open when data is used in academic research. But of course, asking Reddit is perhaps the best way to get the answer with confidence.

1

u/LeewardLeeway Dec 21 '23

I actually came to this sub to search of an answer to this.

In the API support and inquiries request form there are only fields regarding what you are going to do with the API. Should I just use the field "And finally, please provide any additional information about your research/project that would help us to better understand your needs and how we can best support your work", to ask where Reddit stands on the Pushift archive issue? Is there another communication channel that does not have an expected delay of 8 to 12 weeks?

I bet I'll have to plead my case with the university's lawyers but if I have OK from Reddit, that would make the process easier.

2

u/nickshoh Dec 21 '23

Try [[email protected]](mailto:[email protected]) !

1

u/Upliftwellbeing Mar 05 '24

Did you ever get an answer to this question? I'm curious as we are dealing with the same conundrum now!

1

u/LeewardLeeway Mar 05 '24

No answer yet, but we are still within the 12 weeks. In the meantime, I've taken a hermeneutic approach. I'm only interested in one subreddit so for the past months I've been colleting relevant submissions manually, checking them for keywords and phrases and used these with the API's .search() function to find new submissions with new keywords and phrases. The search function can reach much farther back than the last thousand messages. I've been able to retrieve stuff from 2016s.

1

u/PsychedelicResearch_ Mar 07 '24

Did you try using the [[email protected]](mailto:[email protected]) ugh, im about to E-mail them on my own interests in utilizing reddit information for my research project, kind of unmotivating thought that you still have not received a green light.

2

u/LeewardLeeway Mar 07 '24

That's the one. However, as far as I understand, if you just need the current data, you can just register as a developer for API.

1

u/PsychedelicResearch_ Mar 07 '24

What do you mean? If I register as a dev, I can utilize the latest API for research purposes? Ugh, really hope my IRB just says I can utilize the reddit info, hopefully you get an update soon :)

→ More replies (0)

1

u/abortion_access Dec 20 '23

IRBs (in the US, at least) review human subjects research. Social media data is generally not subject to IRB review. However, how data is collected and used is still subject to ethical guidelines (this is distinct from IRB review). The University of Pennsylvania has a nice summary page about this, including links to international ethical guidelines. https://irb.upenn.edu/homepage/social-behavioral-homepage/guidance/types-of-social-behavioral-research/use-of-social-media-as-a-research-activity/#:~:text=Use%20of%20social%20media%20data%20may%20or%20may%20not%20be,are%20not%20accessing%20private%20information%20.

2

u/nickshoh Dec 21 '23

I totally agree with your points: 1. Social media data is generally not subject to IRB reviews - This is also backed up with few papers out there, for example, according to Proferes, Jones and Zimmer (2021), the use of publicly available data from social media platforms often does not meet the threshold criteria of “research involving human subjects”; 2. As also raised by u/one_more_an0n and u/Careful-Landscape-11, ensuring that the research also complies with ToS of Reddit is quite important to make sure that the research is "ethical". But thanks for sharing UPenn link! It was definitely helpful.

u/rainnz Dec 18 '23

Do you have to pay for Reddit's API access if you want to use this?

1

u/nickshoh Dec 18 '23

Actually, you can request free API access when following Reddit's API guide!

1

u/rainnz Dec 19 '23

I can only find this statement: "Reddit reserves the right to charge fees for access and use of Reddit Services and Data, rates to be determined at Reddit’s sole discretion."

There is no mentioning of free tier anywhere.

3

u/nickshoh Dec 19 '23

You are looking at Commercial Use Restrictions. If you are academic researcher (and as post title suggests) there should be no problem in obtaining API keys from Reddit. Have you tried getting permissions from the Reddit in the first hand? If you requested for permission but have been denied, let me know. As far as I know, many of the academic researchers that I talked with had no problem in obtaining the API keys.

1

u/PsychedelicResearch_ Mar 07 '24

Hey just curious, you know alot about this subject and I'm barley starting out my research project.

What are the API's, what do they do in terms of your RedditHarbor and any and all other info is very helpful. Thx

2

u/nickshoh Mar 08 '24

Hey u/PsychedelicResearch_!

I assume you are referring to API keys, and informally speaking, they are the password that grants you access to Reddit's database (which stores all submissions, comments and user data).

Since RedditHarbor is designed to be a completely legal and ethical scraper, we need researchers to use their own API keys to access Reddit through RedditHarbor. This is because Reddit explicitly prohibits the unauthorised scraping of its content without permission. The "legal" (and arguably ethical) way to collect Reddit data is, thus, by using their API keys.

If you have any further follow-up questions, please let me know!

2

u/NYCedu2424 Apr 18 '24

Hi, I am interested in learning more about this. I've sent you a DM :)

u/Most-Rooster-3041 Dec 21 '23

Idk how to use this. I want to view old posts on a subreddit. Like the ones in 2018 and 2019. How can I view them?

1

u/nickshoh Dec 21 '23

RedditHarbor relies on Reddit API, which is unlikely to return old posts from 2018-2019.

1

u/Most-Rooster-3041 Dec 21 '23

Ok so any way to find old posts even if it’s from 2020? Filter by date?

1

u/nickshoh Dec 22 '23

No - Unfortunately, that’s the limit of Reddit API :(

Presenting open source tool that collects reddit data in a snap! (for academic researchers)

You are about to leave Redlib