Sunday, 31 May 2015

Refactoring & Reliability

We rely on so many systems that their reliability is becoming more and more important.

Pay bonuses are determined based on the output of performance review systems; research grants handed out based on researcher tracking systems, and entire institutions may put their faith for visa enforcement in yet more systems.

The failure of these systems can lead to distress, financial loss, the closure of the organisations, or even prosecution. Clearly, we want these systems to have a low failure rate; be they design flaws or implementation defects.

Unfortunately for the developers of the aforementioned systems, the all have a common (serious) problem: the business rules around them are often in flux. Therefore, the systems must have the dual property of flexibility and reliability. Very often, these are in contradiction to one another.

Reliability requires requirements, specification, design, test suites, design and code review, change control, monitoring, and many other processes to prevent, detect, and recover from failures in the system. Each step in the process is designed as a filter to deal with certain kinds of failures. Without them, these failures can start creeping into a production system. These filters also reduce the agility of a team; reducing their capability to respond to new opportunities and changing business rules.

On the other hand, the flexibility demanded by the team's environment is often attained through the use of traditional object-orientated design. This is typically achieved by writing to specific design patterns. If a system is not already in a state that is considered to be "good design," a team will apply refactorings.

Refactorings are small, semantics-preserving changes to the source of a system, with the goal of migrating towards a better design. This sounds perfect. Any analysis and testing which took place prior to the refactoring should still be valid! [1].
However, even though the semantics of the source are preserved (although, humans do occasionally make mistakes!), other observable properties of the program are not preserved. Any formal argument that was made regarding the correctness, or time, space or power requirements may not be valid after the refactoring.

Not only does the refactoring undermine any previous formal argument, it can often make it more difficult to construct a new argument for the new program. This is because many of the refactoring techniques given introduce additional indirection, duplicate loops, or use dynamically allocated objects. These are surprisingly difficult to deal with in a formal argument. So much so that many safety-critical environments simply do not support them, for example, SPARKAda. In many common standards aimed at safety critical systems, they are likewise banned.

I am not arguing against refactoring. I think it's a great tool to have in one's toolbox. I also think that like any other tool, it needs to be used carefully and with prior thought. I'd also shy away from the idea that just because something's important, it is critical. With a suitable development process, a development team can remain agile whilst still reducing the risk of a serious failure to an acceptable level.

In the end, it's exactly that -- the balance of risks. If a team is not responsive, they may miss out on significant opportunities.To mitigate this risk, teams introduce flexibility into their code through refactoring. To mitigate the risk of these refactorings causing a serious failure [2], the team should employ other mitigations, for example, unit and integration testing, design and code review, static analysis, and so on. Ideally, to maintain the team's agility, they should be as automated and integrated into their standard development practice as possible. Each team and project is different, so they would need to assess which processes best mitigate the risks, whilst maintaining that flexibility and agility.

[1] Fowler states that for refactorings to be "safe", you should have (as a minimum) comprehensive unit tests.
[2] Assuming that the "original" system wasn't going to cause a serious failure regardless.