Answers.ros.org Offline
Incident Report for ROS
Postmortem

We're sorry about the outage of answers.ros.org. We know that it's a valuable resource for many of you search and reference every day.

On December 11th, we noticed degraded performance and identified the issue as high disk load caused by maintenance/migration operations on our shared hosting platform. The disk IO became the bottleneck for the server and it started crashing. On December 13th, through the repeated crashing due to disk IO limits eventually caused our PostgreSQL database to become corrupted.

With the help of several coworkers we were able to find a way to clear the corruption with the minimal amount of lost data on December 19th. The corruption occurred in the TOAST tables which are where PostgreSQL stores large strings. The recommended way to recover from this is to clear the corrupted entries from the tables. Clearing the corrupted entries also required finding and clearing cross referenced tables. Of the over 180000 posts on the sitem we found approximately 8 posts that were corrupted and had to be cleared, however there were about 15 associated tables that needed to be removed of references for each post. So although we had a large downtime the dataloss was minimal.

After resolving the corruption we migrated the database to a dedicated PostgreSQL host instead of having it locally on the server. And brought the server back online on December 20th.

Going forward, we will be setting up frequent backups such that in the case of a catastrophic database failure we will be able to roll back to an earlier state as well. We have also learned a lot more about debugging PostgreSQL databases. With these changes we hope to avoid future issues, but if they do occur we will have more options available and be able to respond more quickly.

Posted almost 2 years ago. Dec 20, 2017 - 23:54 UTC

Resolved
Answers.ros.org appears to be up and stable with the new deployment. Closing the incident.
Posted almost 2 years ago. Dec 20, 2017 - 22:55 UTC
Monitoring
The database migration has been completed and the answers.ros.org is back online. In the process of resolving the corruption we had to clear about 8 posts and associated metadata. We will be monitoring the new deployment but encourage you to start using it. We'll keep this incident open until we've had a full weekday under full load without issue.
Posted almost 2 years ago. Dec 20, 2017 - 05:44 UTC
Update
We had some complications deploying onto the new hosting. We are now shooting for bringing up the website tomorrow, Wednesday.
Posted almost 2 years ago. Dec 19, 2017 - 23:50 UTC
Update
We have successfully exported the corrupted databases. The cleanup only appeared to effect 5 posts. We will be loading the export onto a new hosting infrastructure tomorrow morning with the goal of the site being back online before the end of tomorrow.
Posted almost 2 years ago. Dec 19, 2017 - 01:18 UTC
Identified
Answers.ros.org's database was corrupted in a recent hosting migration. The site has been taken offline. We are working to clear the corrupted data and will bring the site back online as soon as we've cleared the corrupted data.

This is a follow on to: https://status.ros.org/incidents/jfql57hvr474
Posted almost 2 years ago. Dec 18, 2017 - 19:12 UTC