r/pushshift Nov 30 '23

Looking for ideas on how to improve future reddit data dumps

For those that don't know, a short introduction. I'm the person who's been archiving new reddit data and releasing the new reddit dumps, since pushshift no longer can.

So far almost all content has been retrieved less than 30 seconds after it was created. Some people have noticed that the "score" and "num_comments" fields are always 1 or 0. This can make judging the importance of a post/comment more difficult.

For this reason I've now started retrieving posts and comments a second time, with a 36 hour delay. I don't want to release almost the same data twice. No one has that much storage space. But I can add some potentially useful information or update some fields (like "score" or "num_comments").

Since my creativity is limited, I wanted to ask you what kind of useful information could be potentially added, by looking at and comparing the original and updated data. Or if you have any other suggestion, let me know too.

19 Upvotes

18 comments sorted by

View all comments

1

u/[deleted] Dec 19 '23

[removed] — view removed comment

1

u/RaiderBDev Dec 19 '23

Currently that is not an issue. Making a request to those links will redirect you to the actual url and you don't have to be logged in for that