常用twitter的用户可能感觉到了,该网站在过去几个月中出过一些过载导致无法访问的故障。世界杯期间每天300000新建用户的增长是造成过载的一个重要因素。这也推动了twitter建设自己的数据仓库存储中心。他们正在建设的一个数据中心,位于盐湖城。
虽然毫无疑问该中心将不及苹果在北卡罗来纳州耗资10亿美元建设的数据中心庞大。 Twitter发言人称他们正加紧建设一个为自身定制的数据中心,并将在今年启动。
“拥有独立的数据中心,将给予网站更大的容量,以适应用户的增长”Twitter的jean-Paul Cozzatti在其技术博客中写到。
“在该数据中心,Twitter将有能力完全控制其网络及系统配置,将占用大块商用面积,并使用特殊设计的电源和冷却设备。该数据中心将采用多个供应商提供的服务器,并且运行开源操作系统和应用程序”。
直到最近,Twitter仍使用由日本电信电话株式会社NTT美国公司在海湾地区建设的数据中心。”我们仍将和NTT美国合作管理现有的中心,这尚是我们首次定制数据中心”,一位Twitter发言人这样告诉我们。
这是自facebook在1月份公开其独立数据中心以来大型社交网站的又一重大举措。facebook的数据中心位于俄勒冈州,那里聚集了众多其他公司的Datacenter,包括亚马逊和谷歌。巨型网络公司扎堆于是有原因的,这里可以提供廉价的电能和适宜的气候(够凉爽),以及公司税收优惠。
近期Twitter出现了因大量无法正常访问的用户投诉(最主要的恐怕是有时无法注册新用户)而引发的公关危机。Twitter公司的博客大致叙述了他们的问题,Cozzatti-Twitter公司的主要技术负责人之一的博客也详细地描述了该问题。最主要的问题在于每周一,Twitter的主用户数据库会因为一个查询而卡住,此时整个系统都被锁定了。他们不得不重启该数据库,这个过程历时超过12小时!现在你或许理解他们需要对系统拥有更多控制权的苦衷了:)
”我们时常对比在各种在收缩,维护,调整Twitter这一飞翔中的火箭的工作”Cozzatti写到。
twitter engineering performance article:
On Monday, a fault in the database that stores Twitter user records caused problems on both Twitter.com and our API. The short, non-technical explanation is that a mistake led to some problems that we were able to fix without losing any data.
While we were able to resolve these issues by Tuesday morning, we want to talk about what happened and use this an opportunity to discuss the recent progress we’ve made in improving Twitter’s performance and availability. We recently covered these topics in a pair of June posts here and on our company blog).
Riding a rocket
Making sure Twitter is a stable platform and a reliable service is our number one priority. The bulk of our engineering efforts are currently focused on this effort, and we have moved resources from other important projects to focus on the issue.
As we said last month, keeping pace with record growth in Twitter’s user base and activity presents some unique and complex engineering challenges. We frequently compare the tasks of scaling, maintaining, and tweaking Twitter to building a rocket in mid-flight.
During the World Cup, Twitter set records for usage. While the event was happening, our operations and infrastructure engineers worked to improve the performance and stability of the service. We have made more than 50 optimizations and improvements to the platform, including:
* Doubling the capacity of our internal network;
* Improving the monitoring of our internal network;
* Rebalancing the traffic on our internal network to redistribute the load;
* Doubling the throughput to the database that stores tweets;
* Making a number of improvements to the way we use memcache, improving the speed of Twitter while reducing internal network traffic; and,
* Improving page caching of the front and profile pages, reducing page load time by 80 percent for some of our most popular pages.
So what happened Monday?
While we’re continuously improving the performance, stability and scalability of our infrastructure and core services, there are still times when we run into problems unrelated to Twitter’s capacity. That’s what happened this week.
On Monday, our users database, where we store millions of user records, got hung up running a long-running query; as a result, most of the table became locked. The locked users table manifested itself in many ways: users were unable to sign-up, sign in, update their profile or background images, and responses from the API were malformed, rendering the response unusable to many of the API clients. In the end, this affected most of the Twitter ecosystem: our mobile, desktop, and web-based clients, the Twitter support and help system, and Twitter.com.
To remedy the locked table, we force-restarted the database server in recovery mode, a process that took more than 12 hours (the database covers records for more than 125 million users — that’s a lot of records). During the recovery, the users table and related tables remained unavailable. Unfortunately, even after the recovery process completed, the table remained in an unusable state. Finally, yesterday morning we replaced the partially-locked user db with a copy that was fully available (in the parlance of database admins everywhere, we promoted a slave to master), fixing the database and all of the related issues.
We have taken steps to ensure we can more quickly detect and respond to similar issues in the future. For example, we are prepared to more quickly promote a slave db to a master db, and we put additional monitoring in place to catch errant queries like the one that caused Monday’s incidents.
Long-term solutions
As we said last month, we are working on long-term solutions to make Twitter more reliable (news that we are moving into our own data center this fall, which we announced this afternoon, is just one example). This will take time, and while there has been short-term pain, our capacity has improved over the past month.
Finally, despite the rapid growth of our company, we’re still a relatively small crew maintaining a comparatively large (rocket) ship. We’re actively looking for engineering talent, with more than 20 openings currently. If you’re interested in learning more about the problems we’re solving or “joining the flock,” check out our jobs page.