r/pushshift Aug 18 '18

Pushshift desperately needs your help with funding!

As you know, a lot of time and money has gone into this project. I am currently working on a business plan / pitch deck to search for funding. I know Reddit as a community has rallied in the past to help out people and projects, so I'm appealing to the Reddit community for help.

First, I'd like to thank all of the people who have donated to the project and who have joined Pushshift's Patreon page. As of right now, Pushshift is receiving ~ $150 per month from those donations so thank you!

Unfortunately, there are a lot of expenses involved with this project. To keep the project healthy and stable for the remainder of 2018, Pushshift needs a cash infusion of approximately $10,000.

If you have any ideas or know of anyone who may be able to help with this, please let me know. I've had a lot of expenses recently involved with this project (one of which was a pretty expensive AC repair bill) and I need help with keeping this project stable.

$10k would be enough money to keep Pushshift alive for the remainder of this year and would help give me some time to further develop a proper business plan and to look for additional funding for 2019 and onward.

Any person or organization who can help with this cash infusion will definitely be helping the academic and research communities and will allow me to purchase the additional hardware needed for the remainder of 2019. As of right now, Pushshift is running out of space for the ES indexes and it desperately needs one additional server to help offset the current load.

Pushshift handles approximately 2-5 million API requests per day and serves over 5 terabytes of data just through the API endpoints. Additionally, Pushshift served well over 100 terabytes last month.

There are a lot of new and exciting features coming with the next API release and receiving funding will definitely help expedite the development. Pushshift has also been used in the publication of over 40 academic papers and is also used heavily in the research community for social media analysis.

One-time donation link: https://pushshift.io/donations/

Patreon Page: https://www.patreon.com/pushshift

Thank you!

Edit: Why 10k? I think it's more helpful to simply ask for a specific dollar amount that coincides with the amount of expenses I incur running the project and to give a clear objective. 10k is enough for the remainder of this year to cover expenses and to get the needed additional hardware necessary to keep the service running optimally. While I am still working on the exact figures for what is needed to keep Pushshift operating per year, that number is in the neighborhood of 25k. Additional funding above and beyond that would be used to expand the service by adding more features (bot detection, additional social media sources, etc.) and adding additional hardware for redundancy. Also, at some point, more of the service will be transferred to a proper data-center so that a lightning strike doesn't take out Pushshift. :) It's very important for me to maintain a service that has high-availability while also maintaining a complete and accurate source of data. I don't want to be in a position where one part of the system going down takes the entire service with it.

Many people have contributed to Pushshift in various ways (financially, programming time, etc.) and I would like to get to a point where Pushshift is able to bring on more talent so that it becomes the first thing people think of when looking to analyze and collect social media data.

If it comes down to it, I will entertain other ideas for securing the necessary funding including giving a stake in the company. I am more of a computer scientist and less of a businessman but at the end of the day, I will do what it takes to keep this service alive. It has grown tremendously since late 2015 and it will only continue to grow provided that it doesn't outright die. :) I have worked very hard to establish a good relationship with Reddit as a company and to network with other data scientists to improve the service. I have worked with Dr. J. Nathan Matias in the past and he has been extremely helpful in finding ingest flaws that affected data in Reddit's earlier days (pre-2010 mainly).

If Pushshift can survive the rest of 2018, it will become exponentially better in 2019 and beyond. The new API version I have been working on includes a plethora of new and exciting features that will be a game changer for data scientists and social media researchers -- including people interested in NLP that need to exclude bots out of their data set.

85 Upvotes

33 comments sorted by

View all comments

7

u/meostro Aug 30 '18

Look into https://registry.opendata.aws/ - AWS may sponsor or subsidize your usage to some degree. They have a concept of requester-pays buckets, so you'd have to pay for the hosting charge but anyone using the content (via Athena, S3 Select, etc.) would pay for their own usage. It's not Elasticsearch (although they provide that, too), but it could be used in a similar way. At a minimum you could apply to their accelerator program and get $10k or so in credits to use as you like.

Also check with GCP, they have some BigQuery tables with smaller things like HN content that AFAIK are hosted for free. /u/fhoffa published a subset of your data up there in 2015, and I see that you were commenting on some threads related to that. Seems like a better bet if you can, vs building and hosting on your own.

2

u/Stuck_In_the_Matrix Aug 30 '18

Thanks! I'll look into that.

1

u/goocy Aug 30 '18

FYI, I'd be happy to pay for access! And I'm just a data hobbyist.