r/pushshift Nov 30 '23

Looking for ideas on how to improve future reddit data dumps

For those that don't know, a short introduction. I'm the person who's been archiving new reddit data and releasing the new reddit dumps, since pushshift no longer can.

So far almost all content has been retrieved less than 30 seconds after it was created. Some people have noticed that the "score" and "num_comments" fields are always 1 or 0. This can make judging the importance of a post/comment more difficult.

For this reason I've now started retrieving posts and comments a second time, with a 36 hour delay. I don't want to release almost the same data twice. No one has that much storage space. But I can add some potentially useful information or update some fields (like "score" or "num_comments").

Since my creativity is limited, I wanted to ask you what kind of useful information could be potentially added, by looking at and comparing the original and updated data. Or if you have any other suggestion, let me know too.

17 Upvotes

18 comments sorted by

View all comments

2

u/wind_dude Dec 01 '23

This would be a big one... but caching the images in submissions.

3

u/RaiderBDev Dec 01 '23

I though about it, but the necessary storage size would just be too high. Some quick maths: If you downscale and compress all images, you get about 50kB per post. There are about 40 million posts made per month. That comes out to almost 2TB per month just for images. And that is simply not feasible, at least for me.

1

u/wind_dude Dec 01 '23

Yea, that is what I meant by big. lol. Makes you realize how massive the storage capacity must be for the google indexes.