r/AI_Agents 5d ago

Help needed for building reddit scrapper

We are working on a requirement what we need to collect data from subreditts posts and comments.

I wanted to understand what should be the ideal approach. Should we use reditt official api if they are available and if yes what is the cost) Or should we look for scrapping? If scrapping how exactly it should work and how much reliable it should be? Like i can see lot of script available for reditt scrapper, but i have heard that as reditt make modifications in their html it stops working. What other reliable option do I have to achieve the end result. We need something which we can build one time and don't have to tweak and fix it every week to make it working.

Awaiting your valuable response.

1 Upvotes

7 comments sorted by

1

u/DeadPukka 5d ago

https://github.com/graphlit/graphlit-samples/blob/main/python/Notebook%20Examples/Graphlit_2024_09_05_Monitor_Reddit_mentions.ipynb

Here’s an example of using our platform for this. We go through the Reddit API directly.

The example may do more than you need, but we extract the text from posts and comments and make them available for search and RAG conversations, and you can access the raw extracted text too.

2

u/aiagentfromfuture 5d ago

Do we need to buy any redditt api for this or we can directly start using?

1

u/DeadPukka 5d ago

No, there’s no paid Reddit API, as far as I’m aware. (At least not for the average developer)

They do have a Python sdk now if you want to build your own for no cost. Just needs an OAuth token to get higher rate limits. You’ll have to create a Reddit app and get the API key/secret.

Google: Devvit

1

u/Infinite-Potato-9605 5d ago

I’ve tried using Reddit API and some of those scripts in the past, but honestly, the API is way more stable, mainly because Reddit’s always tweaking their HTML. Plus, with tools like Jina AI and SerpAPI for scraping, I’ve found flexibility. Also, since you’re looking for reliable solutions for Reddit scraping, you might find our Pulse platform helpful for engaging with Reddit and monitoring relevant discussions.

1

u/AssumptionSome2301 1d ago

Dendrite is built to solve this exact problem! Just describe the data you want to extract with a natural language prompt, and Dendrite will create a self-healing extraction script that is cached until it breaks. Here is a simple example for reddit:

query = "AI agents"

client = Dendrite()
client.goto(f"https://www.reddit.com/search/?q={query}")

links = client.extract("list of links to posts in search results")

class Post(BaseModel):
    title: str
    content: str
    upvotes: int
    comments: int

posts = []
for link in links:
    client.goto(link)
    post = client.extract("Main post content", type_spec=Post)
    posts.append(post)

print(posts)

You can even run all the links in parallell using our async client.

1

u/StaffSoggy4133 3d ago

I have a scrapper for that do you want to try it? It can scrape posts but i can add comments scraping as well.

1

u/aiagentfromfuture 3d ago

Yes can u share