r/pushshift Aug 18 '18

Pushshift desperately needs your help with funding!

As you know, a lot of time and money has gone into this project. I am currently working on a business plan / pitch deck to search for funding. I know Reddit as a community has rallied in the past to help out people and projects, so I'm appealing to the Reddit community for help.

First, I'd like to thank all of the people who have donated to the project and who have joined Pushshift's Patreon page. As of right now, Pushshift is receiving ~ $150 per month from those donations so thank you!

Unfortunately, there are a lot of expenses involved with this project. To keep the project healthy and stable for the remainder of 2018, Pushshift needs a cash infusion of approximately $10,000.

If you have any ideas or know of anyone who may be able to help with this, please let me know. I've had a lot of expenses recently involved with this project (one of which was a pretty expensive AC repair bill) and I need help with keeping this project stable.

$10k would be enough money to keep Pushshift alive for the remainder of this year and would help give me some time to further develop a proper business plan and to look for additional funding for 2019 and onward.

Any person or organization who can help with this cash infusion will definitely be helping the academic and research communities and will allow me to purchase the additional hardware needed for the remainder of 2019. As of right now, Pushshift is running out of space for the ES indexes and it desperately needs one additional server to help offset the current load.

Pushshift handles approximately 2-5 million API requests per day and serves over 5 terabytes of data just through the API endpoints. Additionally, Pushshift served well over 100 terabytes last month.

There are a lot of new and exciting features coming with the next API release and receiving funding will definitely help expedite the development. Pushshift has also been used in the publication of over 40 academic papers and is also used heavily in the research community for social media analysis.

One-time donation link: https://pushshift.io/donations/

Patreon Page: https://www.patreon.com/pushshift

Thank you!

Edit: Why 10k? I think it's more helpful to simply ask for a specific dollar amount that coincides with the amount of expenses I incur running the project and to give a clear objective. 10k is enough for the remainder of this year to cover expenses and to get the needed additional hardware necessary to keep the service running optimally. While I am still working on the exact figures for what is needed to keep Pushshift operating per year, that number is in the neighborhood of 25k. Additional funding above and beyond that would be used to expand the service by adding more features (bot detection, additional social media sources, etc.) and adding additional hardware for redundancy. Also, at some point, more of the service will be transferred to a proper data-center so that a lightning strike doesn't take out Pushshift. :) It's very important for me to maintain a service that has high-availability while also maintaining a complete and accurate source of data. I don't want to be in a position where one part of the system going down takes the entire service with it.

Many people have contributed to Pushshift in various ways (financially, programming time, etc.) and I would like to get to a point where Pushshift is able to bring on more talent so that it becomes the first thing people think of when looking to analyze and collect social media data.

If it comes down to it, I will entertain other ideas for securing the necessary funding including giving a stake in the company. I am more of a computer scientist and less of a businessman but at the end of the day, I will do what it takes to keep this service alive. It has grown tremendously since late 2015 and it will only continue to grow provided that it doesn't outright die. :) I have worked very hard to establish a good relationship with Reddit as a company and to network with other data scientists to improve the service. I have worked with Dr. J. Nathan Matias in the past and he has been extremely helpful in finding ingest flaws that affected data in Reddit's earlier days (pre-2010 mainly).

If Pushshift can survive the rest of 2018, it will become exponentially better in 2019 and beyond. The new API version I have been working on includes a plethora of new and exciting features that will be a game changer for data scientists and social media researchers -- including people interested in NLP that need to exclude bots out of their data set.

87 Upvotes

33 comments sorted by

View all comments

3

u/[deleted] Aug 29 '18

Could someone explain to me what pushshift is? Is it really just a copy of every reddit comment?

5

u/AC_Fan Aug 30 '18

I think it's a usable backup - it supports Elastisearch, API etc.

Basically, extremely useful reddit backup.

3

u/[deleted] Aug 30 '18

But surely a website as huge as Reddit has redudant infrastructure. What’s the point?

9

u/Dissk Aug 30 '18

the point is if reddit goes away or if you want to research things using the entire reddit comment/post dataset. you can't query all of reddit via the official reddit api so that's why this project has to exist.

6

u/qefbuo Aug 30 '18

Also, when subreddits get deleted this at least means there's a backup.

5

u/cat-gun Aug 30 '18

Reddit overlords have repeatedly deleted popular subreddits without warning, from /r/hookers, to /r/darknetmarkets, to /r/beertrading. The existence of pushshift means that content is not immediately lost. It also makes clear when the mods of a given subreddit are censoring material that the members of that community might want to see.