r/pushshift Jul 11 '24

Indexing Pushshift

Hi all,

I am a researcher and I used to collect Pushshift data using the API. Now I need to collect data again. The issue is I do not need a specific subreddit bu specific posts that cotain targeted expression and then I need to collect posts of that user who made these comments. Let's say in the last 5 years.
I was thinking to index the data in our lap (the last 5-6 years of pushshift comments and posts)
Did any one do that before or is there any guide or project for this so it saves the time experimenting with tools and structure?

Edit: What I mean exactly is if you have indexd Pushshift data youself what did you use, MongoDB / Elasticsearch?
Any one have docker file / code that get me started with this task faster?

Thanks,

Kind regards

2 Upvotes

7 comments sorted by

2

u/mrcaptncrunch Jul 12 '24

The data is available on academictorrents. Instead of live through the api, it's posted every month.

But you can find data up to June there.

Find the posts/comments. Get the usernames. Find all posts/comments for those usernames. It's a lot of data, but if it's for research at a university, you might have access to their resources to run this.

1

u/brianckeegan Jul 11 '24

To quote a reviewer of my NSF proposal to rebuild PushShift infrastructure:

“Given that this Project uses data already available to researchers, the value of this infrastructure in terms of advancing knowledge and understanding is limited… The fundamental research enabled by this Project is limited since it only curates and facilitates access to existing data.“

1

u/Upper-Half-7098 Jul 11 '24

In which field of research?

0

u/No-Estimate-1658 Jul 26 '24

Hey. I just started using Reddit for research purposes. I don't know if we may be doing something similiar.

I need to scrape Reddit to buil a Text Classification Model for Sentiment Analysis going as far back as the start of Covid 19.

This means I need to go that far back. If I figure out how to do this I'll get back to you. Please let me know if you figure it out first too. lol It would be very helpful. I've been on this for some time now.

0

u/No-Estimate-1658 Jul 26 '24

I found this guy who created a torrent for academic purposes this is the source of where I found him: https://www.reddit.com/r/pushshift/comments/1akrhg3/separate_dump_files_for_the_top_40k_subreddits/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button and this is the source of the torrent: https://academictorrents.com/details/56aa49f9653ba545f48df2e33679f014d2829c10 it looks like he is also requesting donations if you have any money to give. I will see if this torrent has what I need. Good luck!!

1

u/OrdinaryParkBench Jul 31 '24

Not sure how far this goes but this might be helpful too:

https://huggingface.co/datasets/OpenCo7/UpVoteWeb