We're sorry about the outage of answers.ros.org. We know that it's a valuable resource for many of you search and reference every day.
On December 11th, we noticed degraded performance and identified the issue as high disk load caused by maintenance/migration operations on our shared hosting platform. The disk IO became the bottleneck for the server and it started crashing. On December 13th, through the repeated crashing due to disk IO limits eventually caused our PostgreSQL database to become corrupted.
With the help of several coworkers we were able to find a way to clear the corruption with the minimal amount of lost data on December 19th. The corruption occurred in the TOAST tables which are where PostgreSQL stores large strings. The recommended way to recover from this is to clear the corrupted entries from the tables. Clearing the corrupted entries also required finding and clearing cross referenced tables. Of the over 180000 posts on the sitem we found approximately 8 posts that were corrupted and had to be cleared, however there were about 15 associated tables that needed to be removed of references for each post. So although we had a large downtime the dataloss was minimal.
After resolving the corruption we migrated the database to a dedicated PostgreSQL host instead of having it locally on the server. And brought the server back online on December 20th.
Going forward, we will be setting up frequent backups such that in the case of a catastrophic database failure we will be able to roll back to an earlier state as well. We have also learned a lot more about debugging PostgreSQL databases. With these changes we hope to avoid future issues, but if they do occur we will have more options available and be able to respond more quickly.