How Systems Go Bad During Change

Posted under: From the Blog 29th of April, 2010

Most business processes sit a knife edge away from collapse these days and it is only active listening to the process operators that avoids the black hole.

Most standard IT enabled processes, like payroll for example, depend on the IT system getting it right for most transactions, so the people (who cost the most) are able to fix up the smaller number of errors and exceptions with manual adjustments, work arounds, etc. IT does the standard stuff and humans clean up the mess that IT makes or can’t handle.

And with budget restraint,  in most processes this labour balance is on a knife edge because we afford only just enough staff to handle the errors, in just enough time to prevent the system melting down. Melting down under increasing data errors with its core data becoming unusable, and those who depend on the system losing confidence as the frequency, persistence and impact of the errors grows.

Systems generally include bad data and only approach accuracy at certain times, eg an accounting system comes close at balance/reconciliation, asset systems have their most accurate data after correction at stock take time, and so on.

Most of these periodic data cleansing exercises have become fundamental to systems remaining viable, and hold them back from the edge of data collapse. Periods of standard processing corrode data, with software bugs and operator errors eating away at data quality. The system lumbers on, becoming less usable, waiting for the next periodic stocktake, reconciliation or other scrape out of built up data rust.

What happens during change is the balance shifts,  sometimes subtly but with big impact. A small rise in error rate pushes the system past the operational people’s ability to cope, and the system bogs down under the weight of its own errors. A payroll system which previously coped with a 20% error rate, collapses with a 25 or 30% error rate because the pay staff can no longer keep up.  Staff waiting for correct pay, and who previously lived happily with the 20% error rate, because their pay would get fixed in a reasonable and predictable time, now riot. Errors they used to be fine with are now unacceptable, as their confidence in the degrading system evaporates.

I have often been struck by how remote IT people and organisational leaders can be from this knife edge reality of most operational staff. Understanding and managing change requires operational knowledge that can often muddy and discomfort aspiring strategic thinkers.

Only trouble is the rust and mud is real, and the data will always be corroded. What is important is how fast is the data corrosion occurring and how are we cleaning it up?

Fixing the errors is one thing but restriking the balance is more important – why did the error rate go up and what can we do to reduce it. Most system changes use a pilot, parallel or staged implementation to ensure that new systems reduce error rates, not increase them.

Anyone implementing blindly without operational knowledge of error rates under real operational load is likely to finish in the mud.

2 comments. Would you like to comment?

Add your comment