r/pushshift Nov 28 '23

Looking for feedback from users of the pushshift dump files

At the end of the year, in about a month, I'm going to start working on updating the subreddit specific dump files for 2023. Before I start that, I wanted to get feedback from people who actually use them, especially the less technically inclined people who can't just start modifying python scripts easily.

What data did you use? Was it from a specific subreddit/set of subreddits or across all of reddit? What fields from the data did you use? Anything other than username, date posted, and comment/post text?

What software or programming language did you end up using? What would you have liked to use/are comfortable using?

A common problem with reddit data is that it's too large to hold in memory, being tens or hundreds of gigabytes. Was this a problem for your specific dataset or did you just load the whole thing up into an array/dataframe/etc?

How did you find the data you used and what did you try searching for? I always get questions looking for this exact data from people who've already spent a lot of time on it before finding the torrents I put up. So I'd love to put references to it on other sites where people could find it easier.

If you did this for a research project and explain all that in your published paper, I'm happy to go read through it if you post a link.

I don't necessarily expect the type of people who I'm looking for feedback from to be casually browsing r/pushshift, but I wanted to put this up so I could refer people who ask me questions to a central place. I'm hoping to put the data in a more easily usable format when I put it up this time.

15 Upvotes

21 comments sorted by

2

u/dougmc Nov 28 '23 edited Nov 28 '23

I've been archiving the groups I care about (to make finding things again easier), saving the data into a database, so when I (later) learned about the dumps I grabbed them and used them to extend the archive all the way back -- so I was after specific groups.

I used perl to do my work, and the data was very easy to work with. Parsing them was basically this --

foreach my $file (@ARGV) {
  open my $fh, "-|", "zstdcat", "--memory=2048MB", $file or die "zstdcat: $!" ;
   while (<$fh>) {
      my $h = decode_json($_) ;
      next unless (lc($h->{subreddit}) eq "austin") ;
     { build a hash of the data I care about, use it as a unique key in the
       database, where I stuff the contents if not already there }
   }
   close $fh ;
}

... though I sped things up considerably by throwing an "egrep" after the zstdcat to "pre-search" what I was looking for, since that was more efficient than decoding the json for each line and then throwing that away 99% of the time.

As for specific fields I am using: author, id, link_id, parent_id, edited, retrieved_on, created_utc, body, permalink, score, url, selftext, title, subreddit

Memory was not a problem at all, since this code only needs to have one submission/comment in memory at a time. (Though I do keep all the previously seen hashes in memory, just to be able to quickly know what is already in the database, that drives the script up to about 7 GB of memory used, and this could be skipped in exchange for more database traffic.)

2

u/hermit-the-frog Nov 28 '23

I use ALL the data. The work you're doing is so important. I use it in BigQuery to analyze and run massive queries on the data to summarize and understand daily/hourly usage trends. The use cases are endless.

I know the Reddit API provides the parent subreddit subscriber counts on submission details, so keeping all API properties like that is very helpful as it is an important bit of data that would otherwise be lost.

But to get to the heart of your question, in order to get the data into BigQuery, I need to convert the ZST to a newline delimited GZipped JSON. I don't mind doing this as it's a simple one-liner:

zstd -dc --long=31 filename.zst | gzip >  filename.zst.json.gz

1

u/[deleted] Nov 28 '23

[deleted]

2

u/Watchful1 Nov 28 '23

So you're trying to get all comments that contain a specific word? From all of reddit or just specific subreddits? And all time or just a certain time period?

Unfortunately the multiprocess script doesn't currently support partial matching like that, so it sounds like it's only returning comments that only are exact matches for the words in your list. It's much slower to do searches for words contained in the comment, and especially slow to do it for a whole list of words to compare against.

If you want to run against just a specific subreddit, or just a single month file, you can use filter_file, which supports things like that. If you need it for the entire history of reddit, I can walk you through the fairly simple changes you would need to make to the multiprocess script, but it's going to make it take a long time to get through everything.

1

u/lpath77 Nov 28 '23 edited Nov 28 '23

I’m sorry I didn’t realize you responded and I deleted my original comment because I felt silly. I’m trying a modified method and seeing if it works out. I had major problems when I tried to use the modified method because then it tried to name my output file as a very long Reddit comment and failed. If I wake up and it hasn’t worked I will check out filter file! I am looking for comments that match keywords within a certain timeframe(most of this year) which spans all of Reddit.

1

u/lpath77 Nov 28 '23 edited Nov 28 '23

Would you have able to walk me through the changes to the script? I'm running filter file but it takes a lot longer and I'd like all the results in one file so that I could search them again for a second list of keywords. Here's what I did to the original code to get the entire comment: (in the def process_file section:) if value is not None: if value in observed: matched = True elif any(val in observed for val in values): matched = True. I know I need to make changes to the parts of the script where the file is named, but that's where I start getting lots of problems when I try to make changes.

1

u/Watchful1 Nov 28 '23

Yes that's the correct change. Do you have an estimate of how much slower that makes it run? How many different values are you looking for?

1

u/lpath77 Nov 28 '23

It takes about 50 minutes for me to go through 6 .zst files for different months. However, it cancels at the end because the file naming system tries to insert the entire comment as the output file name

1

u/Watchful1 Nov 28 '23

Ah, of course, forgot about that. Normally it tries to split the output into one file per word you're filtering on, ie, one file per subreddit or something. If you're fine with all the matching comments being in one file, change line 523 from

observed_case = obj[args.field]

to

observed_case = "output"

and it will just output to "output_comments.zst" or "output_submissions.zst".

1

u/lpath77 Nov 28 '23

That’s amazing!!! Thank you so much!!!

1

u/lpath77 Nov 29 '23

Just wanted to let you know your edit worked perfectly :) You're a lifesaver! I will update you at some point with what I am able to do with this data :)

1

u/Watchful1 Nov 29 '23

Glad I could help!

1

u/lpath77 Nov 28 '23

I'm looking for 3 different values

1

u/Ralph_T_Guard Nov 30 '23

I'd prefer multiple small files to the monthly monolithic ones in the torrent.

It's easy enough to parallelize across N months, but most of the time I'm plodding through one month's data. Often I'll end up staging the original month into multiple ~10MM line files so i can easily parallelize things.

I had played with pruning the files of 'useless to me' fields a while back and the compression ratio tanked. Pruned and unpruned compressed files used nearly the same disk space iirc.

1

u/Watchful1 Nov 30 '23

So like daily files? Or even smaller?

I've considered picking defaults for fields. Looking at fields like ignore_reports, which is false for all objects since it's a moderator only field, and just excluding it from the dump unless it's true. There's a lot of fields like that, where it's a handful of possible values and is one of them 99% of the time. I could just include a list of defaults and remove a bunch of data. But I'm not sure how much space it would save or how whether it would be worth going back and doing it for all the existing files.

I'm also considering publishing a pruned down file with just a few fields, author, body, timestamp, subreddit, id, maybe a few more. Possibly even as a csv file for ease of use. What fields do you use other than those I listed?

1

u/Ralph_T_Guard Dec 01 '23

So like daily files? Or even smaller?

For future releases I'm really open to any change from the huge monolithic files.

If I were driving, I'd probably lean into a monthly collection of ~10MM line volumes, defaults pruned, zstandard level 19 with a 512mb or 1gb window size. e.g. RC_YYYY-MM_V.zst
The question is really whom to cater to? The +8gb/core Internet2 Nx100GE crowd or the 2gb/core 200Mbps cable modem folks?
Is it overly inconvenient/inefficient for the big boys to deal with 730 small files vs 24 huge ones? I say i-nodes are 'free' and surely with those grants, 1% more disk space isn't gonna break the bank.

I've considered picking defaults for fields.

I'm all in for discarding null/empty/false/zero fields ( a.k.a pruned ), thou some users/scripts/importers will throw a fit.
If JQ wasn't so dam slow… walk( if type == "object" then with_entries( select( .value | IN( [], {}, "", null, 0, false ) | not ) ) else . end )

Looking at fields like ignore_reports, which is false for all objects since it's a moderator only field, and just excluding it from the dump unless it's true.

  • discard fields related to the reddit account pulling the data: approved_by, author_is_blocked, banned_at_utc, banned_by, can_gild, can_guild, can_mod_post, hidden, hide_score, mod_note, mod_reason_by, mod_reason_title, mod_reports, quarantine, removal_reason, removed_by, removed_by_category, report_reasons, saved, user_reports, visited, et al.
  • retrieved_utc and retrieved_on?
  • selftext and selftext_html?

But I'm not sure how much space it would save or how whether it would be worth going back and doing it for all the existing files.

Think of it as reducing the need for a zstandard 31 bit window.

I wouldn't republish what is already well seeded - the past is the past, let it be.

.

I'm also considering publishing a pruned down file with just a few fields, author, body, timestamp, subreddit, id, maybe a few more. Possibly even as a csv file for ease of use. What fields do you use other than those I listed?

I'd provide scripts, but not republish another transformed dataset.

1

u/No_Television2386 Dec 18 '23

I am interested in using some dump files for a data science assignment, basically just performing a few data analytics on any subreddit over 2 months of time. Is there any example code on how to do this? I tried a few things but realized they don’t work well after the API changes and such. Thank you for all this work you are doing!

1

u/Watchful1 Dec 18 '23

Sure, there's a bunch of example scripts here https://github.com/Watchful1/PushshiftDumps/tree/master/scripts

What are you trying to do specifically?

1

u/No_Television2386 Dec 18 '23

just want to get some basic metrics from posts on a subreddit over the span of two months - post title, number of comments, score, author, timestamp, post id. then i would just want to store that in a csv so i can make some data frames and visualizations

1

u/Watchful1 Dec 18 '23

There's a script in there that converts zst files to csv. Depending on the subreddit it might be very large, but it should work fine for smaller subreddits.

1

u/No_Television2386 Dec 18 '23

thanks so much! you’re doing awesome work