As we all know, the fault is the pain of the operation and maintenance personnel forever! I believe that every operation and maintenance personnel has one KPI: availability. High availability means no failure, companies have different standards for availability and failure ratings, but the way to avoid failures is the same.
1、Changes must be rolled back and tested in the same environment
All changes must be rolled back and tested in the same environment. Things that have not been done will always give you a blow in places you unexpectedly. The years of operation and maintenance experience tells us that all the changes that have not been made have the greatest probability of error. So we need to give the change the possibility of rolling back, and consider rolling back to the original state if the steps are likely to go wrong. Excellent operation and maintenance personnel are far from the operation that does not consider rollback. In a sense, operation and maintenance is a subject of experience and a discipline of trial and error.
2、Be cautious about destructive operations
What are the columns of destructive operations? For the database: DROP Table, Drop database, truncate table, delete all data; after these operations are done, it is almost impossible to consider how to roll back the data. Even if it rolls back, the cost is very large. It is very simple to execute such a statement, but it is very difficult to roll back the recovery data. These operations should be more cautious.
3、Set the command prompt
Let you know which database you are working on and which directory you are in. If you open multiple tabs, if the content of each tab is the same, we can cut it off and possibly operate on the wrong tab. After setting this, the probability of this problem will be much smaller.
4、Back up and verify backup validity
We need to prepare for backup. If you have a backup, can you sit back and relax? Or not. You need to verify the validity of the backup. No backup can guarantee that the data it backs up can recover 100% of the correct data. Therefore, backup is not just a backup, it also includes backup verification, if it can not recover the correct data, it is just a waste of space.
5、Handover and vacation are the most likely to fail, please be cautious when changing
The operation and maintenance department and the operation and maintenance personnel need to be as calm as possible in the change; take over the work of others, and then confirm the change plan again and again. Asking people is not necessarily a performance that cannot be done; it is best to have a variety of things to do before the vacation. It is best to prepare a document indicating under what circumstances and under what circumstances. When someone else takes a vacation, they take over the work, "can drag and drop", it really needs to be executed: you must tirelessly confirm the details of each operation with the original operator.
6、Set up an alarm to get timely error information
The alarm allows you to know in time what is wrong with the system. Performance monitoring allows you to understand historical performance information for your system. Analyze various phenomena when the fault occurs, confirm the real cause of the fault; understand the trend of change, find the signs of the fault, and optimize and adjust early. Alarms and performance monitoring are not completely independent, and many performance monitoring items can also be alarmed.