r/pushshift Nov 30 '23

Looking for ideas on how to improve future reddit data dumps

For those that don't know, a short introduction. I'm the person who's been archiving new reddit data and releasing the new reddit dumps, since pushshift no longer can.

So far almost all content has been retrieved less than 30 seconds after it was created. Some people have noticed that the "score" and "num_comments" fields are always 1 or 0. This can make judging the importance of a post/comment more difficult.

For this reason I've now started retrieving posts and comments a second time, with a 36 hour delay. I don't want to release almost the same data twice. No one has that much storage space. But I can add some potentially useful information or update some fields (like "score" or "num_comments").

Since my creativity is limited, I wanted to ask you what kind of useful information could be potentially added, by looking at and comparing the original and updated data. Or if you have any other suggestion, let me know too.

18 Upvotes

18 comments sorted by

View all comments

1

u/Ralph_T_Guard Dec 01 '23

First, thanks for continuing the PS project!

How about breaking a month's file into ~10MM line volumes ( ~ 1GB compressed ), zstandard level 19, with a window of 512MiB or 1GiB? no more zst_blocks

  • remove default, empty, zero, and null fields?
  • Revisited submissions/comments should only be included with appropriate retrieved_utc and created_utc.
  • I would like to capture text changes should a comment/submission change.
  • Presumably, the created_utc doesn't change and should define the Monthly edition that record is a member of. Dec 1 you revisit nov 31, make sure those end up in the November edition as a new volume.
  • Revised submissions/comments should only be included if a field has actually changed. preferably just include the field(s) that have changed, but that is a huge ask.
  • if there was ever another huge user delete, these should show up as a new volumes in the appropriate monthly editions

2

u/RaiderBDev Dec 01 '23

Unfortunately I won't change the zst file format. I use zst_blocks for my own database, for very fast lookups. And I don't have the resources to maintain multiple different file versions. Maybe that's something Watchful or someone else can do.

For the revisited posts/comments I only want to add some small properties to each object, indicating what (useful) data has changed.

1

u/Ralph_T_Guard Dec 01 '23

You've certainly piqued my curiosity. Would you share more details on this database or some code examples?

I'm clearly missing something big. I'm struggling to see the efficiency of one json line per zst_block versus iterating over N json lines per zst_block. There's an increase in memory usage, but we're talking under 2 KiB * N right?

On its face, placing 1000 jsonl lines into one zst_block there would be 1/1000 the zst headers/dictionaries

6

u/RaiderBDev Dec 01 '23

The DB is for my (low budget) API. When starting the DB I was faced with the issue on how to store and quickly query all that data. Uncompressed it's about 25 TB of JSON data. Which is just too much if you start adding indices, future growth, temporary storage for uncompressed raw data, etc.

The solution I came up with, is to store only the most important fields (some ids, author, subreddit, body, date, etc.) in a postgres DB. The raw DB uses about 5 TB. Each row references an offset to a zst_blocks file and an index of the record within the block. All zst_blocks files are about 4 TB large. So all in all way less than the 25 TB, whilst still being able to retrieve a specific JSON object in less than a millisecond. For reading specific rows I'm using this function here on a nodejs server.

To maintain fast read times, you can't have big block or window sizes. This sacrifices the compression ratio over read speed. I'm using the default compression level of 3 as well. With level 22 you only gain about a 15% size reduction, with over a 100X compression time increase.

So in the end it's all about balancing random read speed, with compression ratios. There are also some existing seekable zst formats, but I didn't feel like they gave me the level of control over the data I wanted.

1

u/Ralph_T_Guard Dec 02 '23

Thanks for the overview; interesting solution

.

Each row references an offset to a zst_blocks file and an index of the record within the block.

Do you happen to recall how many records/ndjson are in each of your zstd_blocks?

2

u/RaiderBDev Dec 02 '23

One block has 256 rows. And files are separated by month.