Recovery Best Practices
By: Dan Tragresser
When a unit trips or experiences an event, the site will incur costs associated with the loss in production, regulatory penalties, and, if applicable, outage scope, hardware replacement, and the purchase of make-up power. These costs can drive the priority of returning to service to quickly become the only priority.
With the reduction in staffing at power plants over the past 2 decades, many traditionally routine engineering and maintenance tasks have fallen by the wayside. With limited resources, operations and engineering personnel must focus their time and efforts based on priority. Quite often, keeping a unit online or quickly returning a unit to service will take priority over continuous improvement actions such as investigations and root cause analysis.
When a unit trips or experiences an event, the site will incur costs associated with the loss in production, regulatory penalties, and, if applicable, outage scope, hardware replacement, and the purchase of make-up power. These costs can drive the priority of returning to service to quickly become the only priority. Unfortunately, the review of event operational data, event precursors, and the collecting evidence through the unit disassembly very often fall below the priority of returning to service. Collecting or re-creating evidence after the fact is nearly impossible. This lack of priority often results in a lack of understanding of the root cause of the trip or event.
Within large, complex plants and turbomachinery, trips or minor events are common, but are rarely isolated, one-off events. Many trips and events are repetitive in nature, and worse, are early indications of a more serious event to come. While the cost of delays in returning to service may be high, the cost of not solving the root cause may be orders of magnitude higher, particularly if a failure event happens a second time.
Focusing on unit trips, best practices include:
- Hold regular, cross functional trip reviews.
- If available, consider holding reviews across similar sites within a parent company.
- Utilize knowledge and solutions that may already have been developed.
- Trend trip events and frequency over a 1-to-3-year period.
- Measure the success of prior projects based on the reduction of occurrences or elimination over a multi-year period.
- Trips may be seasonal in nature and re-occurrence may span timeframes greater than one year.
- Review each trip as a near miss and assess potential consequences that may not have occurred this time.
- Consider including trip investigation in site or corporate level procedures and celebrate successes.
Focusing on unit events, the cost of an event requiring an outage and hardware replacement, not including make-up power purchase, can very quickly escalate to millions of dollars. Compare that cost to the cost of a dedicated, independent resource for the duration of time required to perform a comprehensive investigation. Also, consider the cost of the investigation versus the cost of reoccurrence, or a similar event with more serious consequence. The cost of the resource and investigation will almost always be in the noise of the overall cost. Best practices include:
- In nearly all cases, site and outage resources will be dedicated to the speedy rehabilitation of the unit.
- Critical evidence is often lost or destroyed, unintentionally, based on the need to return to service quickly.
- A dedicated, independent resource provides the best option to ensure that useful evidence is collected.
- Assign a dedicated, independent resource to collect and review data and findings.
- If a site resource is not available, borrow from a sister site or corporate team, ideally someone with an outside perspective and not necessarily an expert in the field.
- Consider an external independent resource such as an industry consultant.
- It will likely require a team to complete the overall root cause analysis, however, the likelihood of success will be much greater with facts and details being collected by a dedicated resource.
- Initial steps as a dedicated, independent resource:
- Ensure a controller and DCS data and alarm logs backup is completed before they time out.
- Interview individuals that were on site at the time of the event and or in the days prior.
- There is no such thing as too many pictures. It is common to find a critical link or detail in the background of a picture taken for another reason.
- Clearly articulate hold points at which the independent resource will require inspections or data collection through the disassembly process.
- Collect and preserve samples and evidence.
- Where available, utilize other fleet assets to enable a detailed causal analysis with corrective and preventative actions.
- Demonstrating a commitment to fleet risk reduction can minimize impacts with regulators and insurers.
- Once an event occurs, those limited resources will be fully occupied. Creating a plan at this point is too late.
- Discuss including the cost of an investigation into an event insurance claim with site insurers and what their expectations would be to cover the cost.
- Maintain a list of resources, internal and external, to call upon as dedicated, independent resources.
Identifying the root cause of an event might be cumbersome, but far less cumbersome than dealing with the same type of event on a recurring basis.
Structural Integrity has team members and laboratory facilities available to support event investigations and to act as independent consultants on an emergent basis.