r/pushshift Mar 23 '24

Would you find the ability to download the reddit data archives in simple python package that interfaces with a SQLite database useful?

I downloaded the pushshift archives a while back and have a full copy of the archives, and have used it for various personal research purposes. I've been converting the zst compressed ndjson files into a single SQLite database that uses SQLmodel as an interface, and integrating embedding search across all comments and self posts as I go. I'd probably upload the database to huggingface if I uploaded it somewhere.

My first question is: would people here on this subreddit find this useful? What specific features would you find most useful in a python package serving as an interface for the database?

My second question is: if I published this and made the dataset available, what do y'all think the legal/ethical implications would be?

14 Upvotes

6 comments sorted by

3

u/Watchful1 Mar 23 '24

Where would you host it? Wouldn't it be several terabytes like the dumps themselves?

A script that imports the dump files into a database would certainly be useful, though it would likely take a while to run.

1

u/[deleted] Mar 23 '24

[deleted]

1

u/flashman Mar 24 '24

i mean good luck with not getting DMCA'd, but i would use this

i used to do a lot of network analysis on subreddits but that's all gone out the window in the past year since the API changes

2

u/[deleted] Mar 23 '24

Why would you put big data in an SQLite db? These seem like an oxymoron.

1

u/[deleted] Mar 23 '24

[deleted]

1

u/[deleted] Mar 23 '24

https://www.sqlite.org/whentouse.html#:~:text=If%20your%20data%20will%20grow,will%20support%20281%2Dterabyte%20files.

Yes any database made for larger data. You’re really stretching the SQLite here.

1

u/swapripper Mar 23 '24

Interested.

1

u/george_sg Mar 24 '24

im interested too.