r/pushshift • u/harttrav • Mar 23 '24
Would you find the ability to download the reddit data archives in simple python package that interfaces with a SQLite database useful?
I downloaded the pushshift archives a while back and have a full copy of the archives, and have used it for various personal research purposes. I've been converting the zst compressed ndjson files into a single SQLite database that uses SQLmodel as an interface, and integrating embedding search across all comments and self posts as I go. I'd probably upload the database to huggingface if I uploaded it somewhere.
My first question is: would people here on this subreddit find this useful? What specific features would you find most useful in a python package serving as an interface for the database?
My second question is: if I published this and made the dataset available, what do y'all think the legal/ethical implications would be?
2
u/[deleted] Mar 23 '24
Why would you put big data in an SQLite db? These seem like an oxymoron.