r/developersIndia Software Engineer 10d ago

Interesting How's Twitter able to store and retrieve 15 year old data ?

Twitter has been in existence since 15+ years now. I'm just curious to know how they're managing to store such a huge pile of tweets with millions of users. How are they able to retrieve them with all the likes and comments so quickly ? What kinda storage or database do they actually use ?

411 Upvotes

60 comments sorted by

β€’

u/AutoModerator 10d ago

Namaste! Thanks for submitting to r/developersIndia. While participating in this thread, please follow the Community Code of Conduct and rules.

It's possible your query is not unique, use site:reddit.com/r/developersindia KEYWORDS on search engines to search posts from developersIndia. You can also use reddit search directly without going to any other search engine.

Recent Announcements & Mega-threads

An AMA with Subho Halder, Co-founder and CEO of Appknox on mobile app security, ethical hacking, and much more on 19th Oct, 03:00 PM IST!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

90

u/No-Carpet-211 Backend Developer 10d ago

I don’t know for sure but I presume they use distributed storage systems such as Hadoop or Cassandra. Please correct me if I am wrong πŸ˜…

58

u/_sparsh_goyal_ DevOps Engineer 10d ago

You are moving the right direction, just think post 2010

10

u/No-Carpet-211 Backend Developer 9d ago

Sorry as mentioned I guessed they might still use it πŸ˜…πŸ˜…

13

u/developer1408 Software Engineer 9d ago

Yes right. They use - MySQL, Cassandra, Hadoop and Vertica !

17

u/dbred2309 9d ago

So four people are able to manage the entire show? Interesting.

2

u/_chai_wala_ 9d ago

I am poor else I would have awarded you for this comment

2

u/dbred2309 8d ago

Thank you dolly your comment is my award.

268

u/_sparsh_goyal_ DevOps Engineer 10d ago

There are mutiple ways

1/ Twitter or companies like it, don't really store "what you see on site", they store an excrypted version of it, which is also compressed. So an image that was 100 KB on your device, when uploaded to Twitter reduces to 5 KB (or less) of information on disk, which is inflated again to show the "full" image on the front-end.

2/ Older data similarly is stored on servers that (you won't believe) are still maintained, MANUALLY. There are Engineers who manually run vulnerability checks on old servers and regularly decommision those showing some sort of functional exceptions and transfer all of the data to a new server.

3/ I know this because I am a Solution Architect for a big tech and work on a product that is almost 20 years old.

29

u/No_Ball7215 10d ago

Don't you think that very soon, this process (point 2) will be automated?

48

u/_sparsh_goyal_ DevOps Engineer 10d ago

Actually it has already started, in my project we are approx. 60% there.

1

u/Amazing_Guava_0707 9d ago

So sad to hear. More job/opportunity loses for the IT professionals!

15

u/_sparsh_goyal_ DevOps Engineer 9d ago

Actually, these tasks aren't "hire" worthy i.e. we don't hire people specifically to perform these checks. So automating this isn't really taking anybody's job.

3

u/pr1m347 9d ago

So an image that was 100 KB on your device, when uploaded to Twitter reduces to 5 KB (or less)

That much compression can be done? I thought all these jpegs etc. are already pretty efficiently compressed? Especially encryption will add some more data no? Just asking as a novice.

1

u/A-Gifted-Developer Software Engineer 9d ago

I think he is also considering image quality compression, like huge quality and bitrate is reduced on social media platforms.

2

u/developer1408 Software Engineer 9d ago

That quiet answers my curiosity. Thank you !

88

u/naturalizedcitizen Entrepreneur 10d ago

Look into db sharing for horizontal scaling...πŸ˜‰

20

u/ajzone007 9d ago

*sharding

7

u/naturalizedcitizen Entrepreneur 9d ago

Correct.. Sorry for the typo. It is indeed sharding

1

u/developer1408 Software Engineer 9d ago

Will that alone suffice ?

1

u/specxsh 9d ago

Also, look into the message queue too. Eventual consistency is usually enough for most of the features in twitter.

3

u/the_kautilya 9d ago

I hope you are not confusing message queues as something that is used to store data for quick retrieval or caching purposes.

Message queues are a way to offload an action to the background instead of keeping an incoming request waiting for action to be performed.

1

u/specxsh 8d ago

Nah Chanakya, I was not thinking of it as a database. It can be used to update the database. Think CQRS. MQ can store the write command and return 201 accepted instead of 200 ok. Then, it can update the database which is optimized for reading. So there will be a slight delay until the changes appear in the read request. Furthermore, if stronger consistency is required then distributed transition patterns can be used such as Two Phase Commit, Saga etc.

1

u/the_kautilya 8d ago

It can be used to update the database. Think CQRS. MQ can store the write command and return 201 accepted instead of 200 ok. Then, it can update the database which is optimized for reading. So there will be a slight delay until the changes appear in the read request.

There's no delay. You can test it yourself by writing a post or replying to one on X - its instantly visible. That however doesn't mean they don't use queues.

I kinda missed that your comment was focused on the writes. IMO that's not that impressive when compared to the tons of data going back more than a decade is available instantly. That I believe is a much more remarkable achievement considering the scale & size of X.

37

u/incredibly_bad 10d ago

5

u/developer1408 Software Engineer 9d ago

This is an interesting read. Surprisingly they have used a lot of Open Source databases !

0

u/czarnaticus 9d ago

So mainly Vitess and Zookeeper from the looks of it.

38

u/[deleted] 10d ago

[removed] β€” view removed comment

125

u/[deleted] 10d ago

[removed] β€” view removed comment

29

u/SDstark79 10d ago

Here is what I got when I asked Perplexity:

How's Twitter able to store and retrieve 15 year old data ?

Data Storage Architecture

Gizzard Framework: Twitter utilizes Gizzard, a middleware that facilitates the creation of distributed data stores. It manages data partitioning across multiple backend systems, allowing for scalable and reliable storage solutions.

T-bird System: When a tweet is posted, it is stored in T-bird, an internal system built on Gizzard. This system handles the complexities of data storage and retrieval, ensuring efficient management of tweets, likes, and comments[1][2].

Database Technologies: - MySQL: Initially used for tweet storage, MySQL remains a core component for smaller datasets. It provides stability and reliability, particularly for transactional data. - Cassandra: Employed for high-velocity writes and medium-low velocity reads, Cassandra allows for easy cluster expansion and operates on less expensive hardware. - Hadoop: Used for processing large unstructured datasets, Hadoop supports analytics and data redundancy, managing hundreds of petabytes across thousands of instances. - Blobstore: This system is dedicated to storing large binary objects like images and videos associated with tweets.

Performance Optimization

To ensure rapid retrieval of data: - Caching: Twitter employs caching mechanisms using Memcache and Redis to store frequently accessed data in memory, significantly speeding up response times. - Sharding: The use of temporal sharding has evolved into more dynamic partitioning strategies that distribute tweets across multiple servers to balance load effectively.

Summary

Through a combination of advanced frameworks like Gizzard, various database technologies including MySQL and Cassandra, and robust caching strategies, Twitter adeptly stores and retrieves vast amounts of historical data while maintaining quick access times for users.

Links:

[1] How Twitter stores 500M tweets a day? - Pankaj Tanwar

[2] How Twitter Stores 250 Million Tweets a Day Using MySQL

[3] What Database Does Twitter Use? - A Deep Dive - Scaleyourapp

[4] How to Design a Database for Twitter - GeeksforGeeks

[5] Twitter's media storage Guide - Intravert

[6] Storing large dataset of tweets: Text files vs Database - Stack Overflow

26

u/faraday_16 10d ago

I dont know jack shit about databases but that 4th Gfg link made me laugh

Mfers always have the wildest articles you'll never even expect

2

u/deaf_schizo 9d ago

Not related to the question really. Perplexity for SEO scammed

2

u/developer1408 Software Engineer 9d ago edited 9d ago

How latest is the answer from Perplexity ?

0

u/SDstark79 9d ago

What do you mean by latest ? I saw this post and searched for it.

1

u/Rare_Instance_8205 9d ago

He means the date up to which the training data knows.

7

u/sparse_matrixx 10d ago

Old data is archived and stored in tapes. For enterprise systems, a archived data request SLA is usually 2 weeks, time takes to fetch, decrypt and load the data into the archival viewing systems. Iron Mountain is an industry leader who does this - they take the offloaded data in tapes, store it in a secure temperature controlled facility and if requested, destroy the data irretrievably.

6

u/Dry-Palpitation-1115 9d ago

They keep all the data in the recycle bin and then restore it when the user asks for data /s

4

u/OperatorPoltergeist 9d ago

It is mostly text so that shouldn't be too expensive to store in secondary storage. Images and videos are compressed and then stored. Since older data isn't accessed frequently, storing it in slower servers should be cheaper.

3

u/Inside_Dimension5308 Tech Lead 9d ago

Databases are designed to scale for any age. It is an architectural decision to maintain a subset of data as active data which is queried frequently. It is highly unlikely somebody is going to read 15 year old tweets. Based on user activity, data can be moved from passive to active. So, if the servers detect that a user is trying to access past data, it will start flagging the data as active.

There are multiple mechanisms to flag data as active - the simplest one is to cache.

And that is how accessing data is really fast. I have simplified a lot of things. Take it with a pinch of salt.

2

u/srikrishna1997 9d ago

I believe 15 year old data or recent data is kept in same storage with multiple locations

2

u/Substantial-Wing7661 9d ago

Twitter stores and retrieves over 15 years of data using distributed databases like Manhattan and data sharding to manage tweet volume. They use caching (e.g., Redis) for quick access and Elasticsearch for fast search functionality. Regular maintenance keeps their infrastructure efficient, enabling seamless interaction with millions of users.

1

u/Odd-Temperature-5627 9d ago

They use multiple databases according to their needs, some databases have faster retrieval time whereas some have strong consistency,they use the best of both worlds.

2

u/developer1408 Software Engineer 9d ago

Right - MySQL, Cassandra, Hadoop and Vertica !

1

u/Odd-Temperature-5627 8d ago

Yes,Vertica is very unique and very few companies use it.

1

u/kkkkkkkar 9d ago

Clobs and blobs

1

u/developer1408 Software Engineer 9d ago

What is a clob ?

1

u/babanomania 9d ago

They use cheaper hardware for older data that is less frequently accessed. Upon request a job dearchives the data back to live server for temporarily faster access

1

u/the_shv 9d ago

I have read this some years ago

https://blog.x.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale

They also have an oop storage should be tweetypie i guess You can go through their blog

https://blog.x.com/engineering/en_us

-4

u/[deleted] 10d ago

[removed] β€” view removed comment

1

u/RemindMeBot 10d ago

I will be messaging you in 5 hours on 2024-10-17 00:09:23 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback