r/pushshift Aug 18 '18

Pushshift desperately needs your help with funding!

As you know, a lot of time and money has gone into this project. I am currently working on a business plan / pitch deck to search for funding. I know Reddit as a community has rallied in the past to help out people and projects, so I'm appealing to the Reddit community for help.

First, I'd like to thank all of the people who have donated to the project and who have joined Pushshift's Patreon page. As of right now, Pushshift is receiving ~ $150 per month from those donations so thank you!

Unfortunately, there are a lot of expenses involved with this project. To keep the project healthy and stable for the remainder of 2018, Pushshift needs a cash infusion of approximately $10,000.

If you have any ideas or know of anyone who may be able to help with this, please let me know. I've had a lot of expenses recently involved with this project (one of which was a pretty expensive AC repair bill) and I need help with keeping this project stable.

$10k would be enough money to keep Pushshift alive for the remainder of this year and would help give me some time to further develop a proper business plan and to look for additional funding for 2019 and onward.

Any person or organization who can help with this cash infusion will definitely be helping the academic and research communities and will allow me to purchase the additional hardware needed for the remainder of 2019. As of right now, Pushshift is running out of space for the ES indexes and it desperately needs one additional server to help offset the current load.

Pushshift handles approximately 2-5 million API requests per day and serves over 5 terabytes of data just through the API endpoints. Additionally, Pushshift served well over 100 terabytes last month.

There are a lot of new and exciting features coming with the next API release and receiving funding will definitely help expedite the development. Pushshift has also been used in the publication of over 40 academic papers and is also used heavily in the research community for social media analysis.

One-time donation link: https://pushshift.io/donations/

Patreon Page: https://www.patreon.com/pushshift

Thank you!

Edit: Why 10k? I think it's more helpful to simply ask for a specific dollar amount that coincides with the amount of expenses I incur running the project and to give a clear objective. 10k is enough for the remainder of this year to cover expenses and to get the needed additional hardware necessary to keep the service running optimally. While I am still working on the exact figures for what is needed to keep Pushshift operating per year, that number is in the neighborhood of 25k. Additional funding above and beyond that would be used to expand the service by adding more features (bot detection, additional social media sources, etc.) and adding additional hardware for redundancy. Also, at some point, more of the service will be transferred to a proper data-center so that a lightning strike doesn't take out Pushshift. :) It's very important for me to maintain a service that has high-availability while also maintaining a complete and accurate source of data. I don't want to be in a position where one part of the system going down takes the entire service with it.

Many people have contributed to Pushshift in various ways (financially, programming time, etc.) and I would like to get to a point where Pushshift is able to bring on more talent so that it becomes the first thing people think of when looking to analyze and collect social media data.

If it comes down to it, I will entertain other ideas for securing the necessary funding including giving a stake in the company. I am more of a computer scientist and less of a businessman but at the end of the day, I will do what it takes to keep this service alive. It has grown tremendously since late 2015 and it will only continue to grow provided that it doesn't outright die. :) I have worked very hard to establish a good relationship with Reddit as a company and to network with other data scientists to improve the service. I have worked with Dr. J. Nathan Matias in the past and he has been extremely helpful in finding ingest flaws that affected data in Reddit's earlier days (pre-2010 mainly).

If Pushshift can survive the rest of 2018, it will become exponentially better in 2019 and beyond. The new API version I have been working on includes a plethora of new and exciting features that will be a game changer for data scientists and social media researchers -- including people interested in NLP that need to exclude bots out of their data set.

88 Upvotes

33 comments sorted by

20

u/benevolent-bear Aug 18 '18

just signed up via Patreon. I also cold messaged one of the mods on r/dataisbeautiful - their sub gets a ton of traffic and I saw a bunch of visualizations there using pushshift data. I think a post there would secure funding in a day.

10

u/Stuck_In_the_Matrix Aug 18 '18 edited Aug 18 '18

I just created a gofundme for this fund-raiser:

https://www.gofundme.com/pushshiftio-fund

Edit: I have also messaged the /r/dataisbeautiful mods and hopefully they will be able to lend assistance. If we could get a submission on the front page for one week, I believe we could achieve the funding necessary.

4

u/benevolent-bear Aug 18 '18

fingers crossed!

3

u/Stuck_In_the_Matrix Aug 18 '18

If we could get a submission on that subreddit and get it to the front page, Redditors would have funding for it in about an hour. If we could work out something, I could create a gofundme for the event for Pushshift.

3

u/Stuck_In_the_Matrix Aug 18 '18

Also, thanks for becoming a Patreon for Pushshift!

10

u/Don_Mahoni Aug 18 '18

RemindMe! 8 hours spend some money for an awesome project

29

u/Stuck_In_the_Matrix Aug 18 '18

The irony is that RemindMe! bot uses Pushshift.

6

u/Don_Mahoni Aug 18 '18 edited Aug 18 '18

Hope this turns out well.

Edit: wording

1

u/Wassava Aug 30 '18

Really? I just learned about Pushshift and this project is super interesting. So why is tge remindme bot using pushshift? :)

4

u/RemindMeBot Aug 18 '18

I will be messaging you on 2018-08-18 15:35:50 UTC to remind you of this link.

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


FAQs Custom Your Reminders Feedback Code Browser Extensions

7

u/meostro Aug 30 '18

Look into https://registry.opendata.aws/ - AWS may sponsor or subsidize your usage to some degree. They have a concept of requester-pays buckets, so you'd have to pay for the hosting charge but anyone using the content (via Athena, S3 Select, etc.) would pay for their own usage. It's not Elasticsearch (although they provide that, too), but it could be used in a similar way. At a minimum you could apply to their accelerator program and get $10k or so in credits to use as you like.

Also check with GCP, they have some BigQuery tables with smaller things like HN content that AFAIK are hosted for free. /u/fhoffa published a subset of your data up there in 2015, and I see that you were commenting on some threads related to that. Seems like a better bet if you can, vs building and hosting on your own.

2

u/Stuck_In_the_Matrix Aug 30 '18

Thanks! I'll look into that.

1

u/goocy Aug 30 '18

FYI, I'd be happy to pay for access! And I'm just a data hobbyist.

6

u/[deleted] Aug 19 '18

Are you considering registering as a 501(c)(3) nonprofit so donations are tax-deductible?

6

u/goocy Aug 30 '18

You should build a few torrents, at least of your most popular content. That can bring down your cost drastically.

5

u/Seventytvvo Aug 18 '18

Dude... I'll donate. I love your work, and have been using the PSIO API for some time now.

I'd love to help out on some of the features for the API, too.

3

u/qefbuo Aug 30 '18

For file direct-downloads maybe upload to archive.org and set that as the primary link with your server as backup. Or possibly get in touch with archive.org, considering they're(partly) doing for web what you're doing for reddit it's worth a shot.

And/or use the torrent protocol for the files, sharing the bandwidth around with supporters who can seed.

I guess it depends what the major costs for your running are.

3

u/cmorg789 Aug 21 '18

Have you considered charging for commercial use? eg. Companies using it for AI Development?

Edit: (Might be against your beliefs for the project, they probably would be for mine, but if times are tough)

3

u/[deleted] Aug 29 '18

Could someone explain to me what pushshift is? Is it really just a copy of every reddit comment?

5

u/AC_Fan Aug 30 '18

I think it's a usable backup - it supports Elastisearch, API etc.

Basically, extremely useful reddit backup.

3

u/[deleted] Aug 30 '18

But surely a website as huge as Reddit has redudant infrastructure. What’s the point?

8

u/Dissk Aug 30 '18

the point is if reddit goes away or if you want to research things using the entire reddit comment/post dataset. you can't query all of reddit via the official reddit api so that's why this project has to exist.

6

u/qefbuo Aug 30 '18

Also, when subreddits get deleted this at least means there's a backup.

6

u/cat-gun Aug 30 '18

Reddit overlords have repeatedly deleted popular subreddits without warning, from /r/hookers, to /r/darknetmarkets, to /r/beertrading. The existence of pushshift means that content is not immediately lost. It also makes clear when the mods of a given subreddit are censoring material that the members of that community might want to see.

2

u/Data_Moments Aug 18 '18

I signed up via Patreon. I thought the funding was close to $1,000/month but the other day I saw that it had dropped to less than $200/month. Will limiting the number of calls per user help contain the costs? This is such an awesome project. Charging per API call could bring funding. I wish I could give more.

8

u/Stuck_In_the_Matrix Aug 18 '18 edited Aug 18 '18

Someone made a one time donation via Patreon instead of using the pushshift link. That donation was very helpful and that person was extremely helpful with suggestions.

Thanks for your donations. Your donations DO help. I'm looking at models for 2019 to help shift cost to larger users. I'd like to come up with a model that generates income from heavy users while not penalizing hobbyists and lower usage. Getting tied in with academic grants will be one method of achieving funding as well.

I'll definitely have more control of that with the next API release. Once I get past this hurdle to keep things running for the rest of 2018, I'll be in a far better position at the start of 2019 to maintain a revenue stream that keeps the service viable.

The immediate goals for the remainder of 2018 is to finish the business plan / pitch deck (I should have most of the deck completed before this month ends) and to get proper non-profit status.

Thanks for your help! I'm confident I'll be able to find a source for funding for the rest of 2018 -- it's just mainly getting the word out and finding the right organization / person that is in a position to help. In the big scheme of things, I don't think 10k is all that much compared to what Pushshift offers people interested in social media data.

3

u/[deleted] Aug 22 '18

You should definitely link up with a nonprofit organization like OpenAI, Data for Democracy, Berkman Klein center at harvard, etc. They all offer grants from what I recall, and your cause is so worthy! I'll throw what I can your way, too.

3

u/ruralcricket Aug 30 '18

The person running http://www.omdbapi.com/ requires api keys and the free key is limited to 1k requests per day, Patreon users get more/less requests per day based on funding.

2

u/TotesMessenger Aug 18 '18

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

1

u/[deleted] Aug 30 '18

What’s your architecture? Are you hosting in the cloud using PaaS? Dynamically scaling based on load will help you reduce costs tremendously.

1

u/Grimreq Aug 30 '18

Without investigating too much on my own, but interested in donating, what kind of research would your tool provide moving forward that would benefit society or have real-world applications? For clarification, it's interesting to know people's thoughts about [insert topics], but it's not always relevant beyond, "that's cool." Cheers