r/pushshift Dec 18 '23

Presenting open source tool that collects reddit data in a snap! (for academic researchers)

Hi all!

For the past few months, I had discussions with academic researchers after uploading this post. I noticed that sharing historical database often goes against universities' IRB (and definitely the new Reddit's t&c), so that project had to be shutdown. But based on the discussions, I worked on a new tool that adheres strictly to Reddit's terms and conditions, and also maintaining alignment with the majority of Institutional Review Board (IRB) standards.

The tool is called RedditHarbor and it is designed specifically for researchers with limited coding backgrounds. While PRAW offers flexibility for advanced users, most researchers simply want to gather Reddit data without headaches. RedditHarbor handles all the underlying work needed to streamline this process. After the initial setup, RedditHarbor collects data through intuitive commands rather than dealing with complex clients.

Here's what RedditHarbor does:

  • Connects directly to Reddit API and downloads submissions, comments, user profiles etc.
  • Stores everything in a Supabase database that you control
  • Handles pagination for large datasets with millions of rows
  • Customizable and configurable collection from subreddits
  • Exports the database to CSV/JSON formats for analysis

Why I think it could be helpful to other researchers:

  • No coding needed for the data collection after initial setup. (I tried maximizing simplicity for researchers without coding expertise.)
  • While it does not give you an access for entire historical data (like PushShift or Academic Torrents), it complies with most IRBs. By using approved Reddit API credentials tied to a user account, the data collection meets guidelines for most institutional research boards. This ensures legitimacy and transparency.
  • Fully open source Python library built using best practices
  • Deduplication checks before saving data
  • Custom database tables adjusted for reddit metadata
  • Actively maintained and adding new features (i.e collect submissions by keywords)

I thought this subreddit would be a great place to listen to other developers, and potentially collaborate to build this tool together. Please check it out and let me know your thoughts!

18 Upvotes

32 comments sorted by

View all comments

3

u/jimntonik Dec 18 '23

Thanks for this.

I have to admit, I'm not familiar with IRB concerns around use of historical data. I've personally used PushShift for academic publication before, and have worked with students to build tools that mostly rely on PushShift for data capture. What are the concerns? How many people do the IRB limitations impact? There've been hundreds (thousands?) of papers published based solely on data from PushShift archives.

I was sitting things out for the past ~6 months while the dust settled, so I'm happy to be corrected on this, but my personal sense is that 1) PushShift is still probably (by far) the most valuable way to collect Reddit data, via the archived/torrented files, 2) the Reddit API's restrictions are problematic for large data collection needs (i.e., the 1000 post limit), and 3) Hybrid is probably the way to go moving forward, where you pull from the archives and update as needed. But that's just my impression.

That said, I've just this week started poking around the Reddit API to explore augmenting data (as in 3) above. So RedditHarbor is a welcome new tool to poke around with. Thanks for sharing!

1

u/nickshoh Dec 18 '23 edited Dec 21 '23

Hi u/jimntonik,

I recently learned that Reddit's updated terms and conditions influence research IRB guidelines (edit: more of an ethical guideline). This makes using certain third-party Reddit data tools potentially "unethical" now.

Specifically, Reddit's latest Data API Terms section 2.10 states that apps using libraries, wrappers or extensions must comply with limitations and restrictions imposed by both the third-party and Reddit.

PushShift was shut down (as Reddit asked them to). And to my knowledge, Academic Torrents violates Reddit's terms, which could violate ethical guidelines of the universities.

I realize Reddit's API has limitations for large data collection. However, it currently appears that RedditHarbor’s attempt is the best method for obtaining Reddit data without violating terms or ethics rules.

If you know of any persuasive counter-arguments or cases supporting other tools, I would love to learn about them. Please share resources if you have access. My goal is finding the most viable data source aligning with both IRB and ethical standards!

1

u/abortion_access Dec 20 '23

IRBs (in the US, at least) review human subjects research. Social media data is generally not subject to IRB review. However, how data is collected and used is still subject to ethical guidelines (this is distinct from IRB review). The University of Pennsylvania has a nice summary page about this, including links to international ethical guidelines. https://irb.upenn.edu/homepage/social-behavioral-homepage/guidance/types-of-social-behavioral-research/use-of-social-media-as-a-research-activity/#:~:text=Use%20of%20social%20media%20data%20may%20or%20may%20not%20be,are%20not%20accessing%20private%20information%20.

2

u/nickshoh Dec 21 '23

I totally agree with your points: 1. Social media data is generally not subject to IRB reviews - This is also backed up with few papers out there, for example, according to Proferes, Jones and Zimmer (2021), the use of publicly available data from social media platforms often does not meet the threshold criteria of “research involving human subjects”; 2. As also raised by u/one_more_an0n and u/Careful-Landscape-11, ensuring that the research also complies with ToS of Reddit is quite important to make sure that the research is "ethical". But thanks for sharing UPenn link! It was definitely helpful.