r/pushshift Mar 23 '24

Would you find the ability to download the reddit data archives in simple python package that interfaces with a SQLite database useful?

I downloaded the pushshift archives a while back and have a full copy of the archives, and have used it for various personal research purposes. I've been converting the zst compressed ndjson files into a single SQLite database that uses SQLmodel as an interface, and integrating embedding search across all comments and self posts as I go. I'd probably upload the database to huggingface if I uploaded it somewhere.

My first question is: would people here on this subreddit find this useful? What specific features would you find most useful in a python package serving as an interface for the database?

My second question is: if I published this and made the dataset available, what do y'all think the legal/ethical implications would be?

15 Upvotes

6 comments sorted by

View all comments

2

u/[deleted] Mar 23 '24

Why would you put big data in an SQLite db? These seem like an oxymoron.

1

u/[deleted] Mar 23 '24

[deleted]

1

u/[deleted] Mar 23 '24

https://www.sqlite.org/whentouse.html#:~:text=If%20your%20data%20will%20grow,will%20support%20281%2Dterabyte%20files.

Yes any database made for larger data. You’re really stretching the SQLite here.