You can also read this post in Chinese.
We wanted to update you all on the status of the data recovery that we are currently performing.
TLDR: Your data is safe!
As you probably already know, last Friday, we had a pretty serious outage, resulting in complete data loss of one article storage node, consisting of two servers. Due to operational error, a simple command intended for only one of the servers replicated to both of them, just few days before a scheduled full backup and archival. This was a huge operational mistake from our side that will not happen again as we are taking pretty serious measures against that.
This node was holding article content data from August 2015 until now. Both servers had a total of 6TB of data. Immediately after realizing the mistake, we shut down the application to prevent further data corruption (polling servers use those nodes to check for older articles in order not to duplicate them). Servers were immediately shut down to minimize chances of overwriting of the data. A technician was immediately dispatched to boot the servers from USB drive and mount the filesystems are read-only. We were able to salvage 3 days worth of data from the transaction logs of the databases, which we then used to feed a new temporary database and to start the service with this data. That’s why on Friday you had only 3 days of article history. Unfortunately we also had to stop certain feeds from updating, because they don’t contain dates of the articles and we can’t be sure if they are not already in the crashed database. To allow all articles to be inserted meant to duplicate a huge number of feeds and then for many users duplicate them second time when we restore the database. This limit is still enforced and will be like that until we finally recover the full database.
After the service was restored, we started extracting raw data form the filesystems. This is a very long process, since it is performed on very low level, but just few hours ago it finally finished. From those files we can finally extract tabular data that can be fed into a database. This is what we are doing right now. Judging by the speed of extraction, our initial estimation for Saturday June 4 still holds and we might be able to finish even on Friday. Data will be 100% restored by then!
We will inform you of the progress in this post and we’ll try to update it as often as possible.
- Phase 1 (extracting raw data into files): 100% completed…
- Phase 2 (generating readable data from files): 100% completed… ETA: Friday June 3
- Phase 3 (loading data into the databases): 100% completed… ETA: Monday June 6
- Progress last updated: June 6 1PM GMT
UPDATE June 4 4AM GMT:
Phase 3 is going slower than expected. ETA to full recovery is now Monday June 6.
However, we have restored enough data, so we can enable polling of all feeds, even those without pubDate. You should start seeing updates from all your feeds in the next few hours.
UPDATE June 6 1PM GMT
Data recovery complete!