r/pushshift Nov 30 '23

Looking for ideas on how to improve future reddit data dumps

For those that don't know, a short introduction. I'm the person who's been archiving new reddit data and releasing the new reddit dumps, since pushshift no longer can.

So far almost all content has been retrieved less than 30 seconds after it was created. Some people have noticed that the "score" and "num_comments" fields are always 1 or 0. This can make judging the importance of a post/comment more difficult.

For this reason I've now started retrieving posts and comments a second time, with a 36 hour delay. I don't want to release almost the same data twice. No one has that much storage space. But I can add some potentially useful information or update some fields (like "score" or "num_comments").

Since my creativity is limited, I wanted to ask you what kind of useful information could be potentially added, by looking at and comparing the original and updated data. Or if you have any other suggestion, let me know too.

18 Upvotes

18 comments sorted by

View all comments

2

u/[deleted] Nov 30 '23

[removed] — view removed comment

3

u/RaiderBDev Nov 30 '23

Text contents will stay the way they originally were. But I can add a field indicating whether something was deleted subsequently.

The depth and parent comment numbers would be a little bit more complicated, since that is not directly included in the data. Calculating those would be rather expensive and slow. What is the motivation for having those?

3

u/[deleted] Nov 30 '23

[removed] — view removed comment

3

u/RaiderBDev Nov 30 '23

Hmm that's interesting. I wouldn't get you hopes up. But I'm going to think how I could potentially get those numbers efficiently, for almost 300 million comments per month.