Introduction
The reasons why Antarctic subglacial lakes exist are now well understood. In essence, their existence stems from a combination of three factors: (1) there is a reduction of the ice melting temperature at the base, caused by the weight of the ice above;(2) the existence of an ice sheet acts as an insulator, protecting the base from the ultra-cold temperatures at the surface; and (3) the heat present under the ice can cause the ice to melt (Reference SiegertSiegert and others, 2007). Studies of accreted ice in the lower part of the Vostok Subglacial Lake borehole have provided insights into the geomicrobiology (Reference PriscuPriscu and others, 1999) and lake water physical, chemical and biological processes (Reference SiegertSiegert and others, 2001) without direct access to lake waters. Several major research projects are now at the stage of planning for access to subglacial lakes with measurement and sampling instruments, namely at Ellsworth (Reference Mowlem, Siegert, Kennicutt and BindschadlerMowlem and others, 2011), Vostok (Reference Lukin, Bulat, Siegert, Kennicutt and BindschadlerLukin and Bulat, 2011) and Whillans lakes (Reference Priscu, Powell and TulaczykPriscu and others, 2010; Reference Fricker, Siegert, Kennicutt and BindschadlerFricker and others, 2011).
Both the Scientific Committee on Antarctic Research (http://www.scar.org/researchgroups/lifescience/SCAR_ CoC_SAEs.pdf), via a sub-committee denoted as SALE (Subglacial Antarctic Lake Environments), and the US National Research Council Committee on Principles of Environmental Stewardship for the Exploration and Study of Subglacial Environments (http://www.nap.edu/catalog.php? record_id=11886) recommend that any probe deployment in a subglacial lake should not contaminate the ecosystem. Clean access is also a scientific requirement, as it will ensure the integrity of the observations and the validity of the conclusions drawn. However, clean access presents huge engineering challenges in these environments, which are likely to be met only if two conditions are fulfilled: the borehole is drilled using a clean drilling method; and the probe and all components that it interacts with are fully cleaned prior to access to the lake. It is not surprising that accessing subglacial Antarctic lakes with science probes is a multi-risk engineering project, where technical or operational failure may be catastrophic to the campaign and project. The probe may fail to operate, it may be lost or it may contaminate the lake. In addition, delay in the deployment may incur the risk that the hole may have partially closed by the time the probe is ready to deploy.
It is therefore imperative to define a framework that can be used to quantify the probe operational risk, which includes its availability and reliability. Such a framework is needed to allow scientists and engineers to make risk- informed decisions prior to and during the deployment. While such frameworks have been used in other domains (e.g. in infection control by Reference Nicas and SunNicas and Sun, 2006), there is a dearth of peer-reviewed literature on the reliability of instruments and methods in glaciology. The focus has been on the reliability of scientific measurements, for example the reliability of ice-coring analysis (Reference AlleyAlley, 2010).
In this paper, we present a formal approach for estimating the risk of a science probe deployment into a subglacial lake. The method is particularly relevant for deployments where the borehole cannot be permanently maintained and thus the availability and reliability requirements for the probe must be high. The motivating setting for this work has been the plan to deploy a probe into Ellsworth Subglacial Lake (ESL) through a hole made by a hot-water drill, in 2013 (Reference SiegertSiegert, 2005). Here we show how the three components of a formal approach can be applied to assess the likely success of the probe deployment. Risks with drilling the access hole are discussed elsewhere (Reference Brito, Griffiths, Mowlem and MakinsonBrito and others, in press); while these are not insignificant, hot-water drilling has been applied successfully in a number of projects (e.g. Reference Engelhardt, Kamb and BolseyEngelhardt and others, 2000) and the issues are known through field experience. No such field experience exists for a clean probe to access a subglacial lake.
Risk Estimation Process for Probe Deployment
This section presents the formal approach for estimating scientific probe risk for subglacial lake exploration. The method combines three techniques: first, modelling the deployment sequence using a state diagram based on Markov chains (Reference FellerFeller, 1950); second, estimation of the transition probabilities between states using fault tree analysis (FTA); and finally, estimation of the probability of failure of individual failure modes, using expert judgment.
It is important to define risk in the context of this work from the several interpretations available. The most frequently referenced interpretation is the definition introduced by Reference Kaplan and GarrickKaplan and Garrick (1981), where risk is defined by three parameters, scenario Si, likelihood Li and consequence Ci . Here the interpretation of risk changes, depending on whether availability or reliability is being addressed. When discussing availability (the probability that the equipment will be ready in an appropriate phase when required), risk is defined by scenario S i, which is the current state of deployment, and the likelihood Li , which is the probability of a single transition or a set of transitions taking place. When discussing reliability, risk is defined by scenario Si , here the failure scenario, and likelihood L i, which is the probability of failure due to scenario Si .
Markov chain representation of the probe deployment
The probe deployment consists of a sequence of phases, with several risks foreseen at each phase. Markov chain theory is useful for modelling a problem where there is a sequence of events and the probability of an event taking place will only depend on the outcome of the preceding trial (Reference FellerFeller, 1950). Given that a transition probability, pj,k, associated with a pair of events (Ej, Ek) is known, where Ek is the event of interest and Ej is the event that took place in the previous trial, and given that the probability of being in the initial state Ej is also known, aj, 0, the probability of going from initial state Ej to the next state Ek , is computed:
For the general case, considering a sequence of many transitions, given that event Ej, 0 precedes Ej, 1 which precedes Ej, 2 and so on for the remaining events, the joint probability distribution is computed using
Combining estimates of failure modes
Fault tree analysis is a well-established method for establishing the reliability characteristics of systems. It is a method in which the probability of an event is estimated using Boolean logic to combine probabilities of occurrence of lower-level events. The low-level events may, among others, be individual component failures, human error or design faults (Reference O’ConnorO’Connor, 1995). FTA allows ranking of failure modes and identification of the minimal cut-set; this is the shortest credible way through the tree from fault to initiating event.
Estimating probability of failure
In engineering, typically, component reliability predictions are made based on data abstracted from databanks or collected from results of specific reliability testing (e.g. USDOD and Rome Laboratories, 1991; Offshore Reliability Data (OREDA) database from the offshore oil and gas industry at http://webshop.dnv.com/global/category.asp?c0=2631). This is only possible if component failure data are either readily available or tests can be performed on many samples in a realistic environment. However, the combination of the polar environment, a borehole hundreds or thousands of metres deep and subglacial lakes presents an extreme environment impossible to recreate realistically in the laboratory. In addition, the instrumentation and support equipment will largely be purpose-made. Hence failure data, as frequency of occurrence from databases or tests, are not available. As a result, reliability prediction must be based on the third technique, that of eliciting experts’ judgments (Reference O’HaganO’Hagan and others, 2006). The use of expert subjective judgments for reliability prediction was first introduced in the nuclear engineering industry in the 1950s; Reference Keeney and von WinterfeldtKeeney and von Winterfeldt (1991) provide a review. Since then, the approach has been adopted by many industries, and applied to space exploration systems. More recently it has been introduced for risk assessment and management of autonomous underwater vehicles (AUVs) and their missions, particularly for missions under floating ice shelves (Reference Brito, Griffiths and ChallenorBrito and others, 2010).
Ellsworth Probe Deployment Process
The exploration of ESL is a UK-led multidisciplinary project that involves cooperation across 15 institutions (http://www.geos.ed.ac.uk/research/ellsworth/members.html). For the probe and associated systems, the key project partners are the UK National Oceanography Centre (NOC), the British Antarctic Survey (BAS) and the University of Edinburgh. Details of the infrastructure and logistics required to support the probe deployment have been described elsewhere (Reference SiegertSiegert and others, 2007; Reference WoodwardWoodward and others, 2010). Reviews of the hot-water drilling system designed and deployed by BAS are given by Reference MakinsonMakinson (1993, Reference Makinson1994).
The Ellsworth probe (Fig. 1) is 5.17 m long and 0.20 m in diameter, and is expected to weigh ~250 kg. The outer cases are made of titanium. Power will be supplied from generators on the surface as high-voltage direct current through cables in the tether. The probe is divided into five gas-filled pressure cases. The upper pressure case contains the power and communications link to the tether. In the middle of the probe there are three pressure cases, each containing the electronics, couplings and actuator motors for a free flooded sampler system consisting of a set of eight pressure-tolerant water bottles, with a capacity of 100 mL per bottle, a filter stack for the collection of lake water filtrand, and two in situ gear pumps to drive fluid into the samplers. The pressure case at the bottom of the probe contains electronic sensor control systems, sensors and a short gravity core sediment sampler. An upward-pointing acoustic altimeter, high-definition (HD) camera and lights will be mounted on top of the upper pressure case. This system is replicated at the bottom of the probe, looking downwards.
The sediment sampler will capture at least one sample of sediment of 1 cm diameter and 20cm length. The filtrand samplers will filter 200 L of lake water onto 0.2 mm filters. All samplers have been purpose-designed at NOC. In situ measurements will be conducted using a series of commercially available sensors. These sensors will measure: (1) pressure (two sensors for redundancy); (2) temperature (two sensors for redundancy); (3) conductivity (two sensors for redundancy); (4) oxygen by Clark Electrode; (5) pH;(6) redox potential (Eh (one bespoke (doi: 10.1029/2009JC005776) and one commercial (Idronaut) sensor)); and (7) oxygen by fluorescence lifetime optode.
During deployment the probe will be connected to the surface control station via the tether. An Aramid polyester fibre tether links the top of the probe, the top sheave (of the gantry) and the winch system. Embedded in the tether are optical fibres and the electrical cables. The outer layer of the tether is made of polyurethane, for mechanical protection, to provide a cleanable surface and to prevent microbial egress from inside the tether. The probe will be lowered at a speed of ~1ms-1.
Markov Chain Model of Ellsworth Probe Deployment
Each phase of the deployment consists of a series of many activities and these are prone to faults or human error. The Markov chain captures and illustrates the desired sequence of events for a successful deployment, and also provides paths for any necessary rework if a transition between states is not successful.
The Markov chain that represents the probe deployment is generic; it is not specific to ESL. However, the failure modes in each transition might differ for different probe designs or deployment methods. The Markov chain comprises 13 states (X1-X13), corresponding to 13 deployment phases (Fig. 2; Table 1). State names are abbreviated to facilitate model description and subsequent analysis.
The starting point is a built and laboratory-tested probe. The deployment starts at the systems testing phase (St), at sea, being a more amenable environment to rectify problems than the Antarctic. All systems are run as if they were performing a real deployment; instruments, video cameras, sediment and water samplers and communications are tested. This will be, in some ways, a science mission; samples will be returned to the laboratory for analysis and the probe’s instrument parameters compared with those measured by standard oceanographic instruments. If the probe passes the systems test, it is disassembled and individual components are cleaned before reassembling. Thus, following successful systems testing (St) the deployment moves to the state of pre-deployment cleaning (Pc). The probe is cleaned both outside and inside. Components are individually cleaned using wet or dry methods, depending on the material. Components made of titanium (e.g. the outer case) can be autoclaved. This is a robust cleaning process where the component is steamed at 145°C for 40min. Electronic components are cleaned using hydrogen peroxide vapour (HPV; Reference Mowlem, Siegert, Kennicutt and BindschadlerMowlem and others, 2011). Once the probe is cleaned and assembled, the system undergoes a pre-deployment functional test (Pt). Only the electrical systems can be tested during this phase. The probe is kept in an environment that has been cleaned to reduce microbial population. If the test is successful, the whole system of probe, tether and winch is transported from the UK to the science camp in two stages.
Neither the final cleaning after probe manufacture, nor any on-site cleaning (see below) will be verified by direct testing. Instead we will pre-qualify our cleaning procedures to ensure that the required microbial population reduction can be guaranteed. (To qualify each process we will purposely contaminate (positive control) test coupons to which the process is applied. This raises the microbiological population above the detection limit of our assessment techniques both before and after our cleaning processes. Thus an accurate measurement of the reduction achieved and the reliability of the method can be made.) This removes the risk that accessing and sampling the equipment itself introduces contamination. Only if a visible or obvious fault (e.g. a tear in a bag, a broken seal, a fault reported by the HPV machine) is encountered will a process be recognizable as having failed in these final stages.
For the first stage, on board a ship to Punta Arenas, Chile, the probe will be in a single container. The container will be isolated and split into cleaned and sealed compartments. The probe will be kept inside a cleaned plastic bag, which in turn will be inside a hard case; all these components will have been cleaned prior to transportation. The case will be insulated and heated. The temperature inside the case will be monitored during transportation. The corers will also be transported in cleaned boxes. The minimum temperature in the box should be –20°C and the maximum 40°C. If the temperature in the box drops below –20°C the probe cannot be used because this may damage the sensors or sample bottles through freezing of electrolyte in the sensors or water held as a control in one sample bottle in each sampler carousel. Once the system arrives in South America, the system will be tested; this is the post-stage 1 transportation test (Trt). The aim is to ascertain if, during transport, the system suffered unacceptable temperatures or vibrations. If temperatures or vibrations exceeded set limits, the probe would be removed from the case and an electronic test of the system carried out. The second stage of the transportation takes it to the science camp. This will be via a combination of an aircraft flight to the Ellsworth Mountains plus tractor train to the lake site. Once the probe arrives at the science camp, the probe will undergo an on-site test (Ot). This involves connecting to a test lead, and then the tether through the winch and the probe switched on. The electronic status of each sensor is tested. There is a risk that the system may be contaminated during the on-site testing phase (Ot). If so, the probe or the individual component will have to be re-cleaned, and the process moves to the on-site cleaning (Oc) state. The on-site cleaning will consist of using HPV, ultraviolet (UV) radiation treatment and ethanol wipes; it will not be as complete as the predeployment cleaning. Some components treated in the laboratory cannot be cleaned in the field.
The next phase is the sheave deployment (Sd). At this time the 36 cm diameter borehole will have been completed, and will refreeze at ~6mm h-1, giving ~26 hours for the probe deployment and recovery. This stage is therefore extremely time-critical. The sheave is kept in a protective environment to avoid contamination, and is integrated with the winch container. When the probe is ready, the sheave is positioned above the borehole by moving the winch container on rails to stops marking the correct position. During this phase the borehole is capped with a gland that is securely fixed to the structure. The gland includes a shut-off valve and a lining through the porous firn ice to enable a degree of blowout protection (Brito and others, in press). If a failure takes place during this phase, the sheave and container are retracted, cleaned and functional tests are conducted, hence the transition p7,6 in Table 2.
During the probe positioning (Pp) phase, the probe is craned into place and the bottom of its protective case attached to the well head. The pre-deployment phase (Ppr) is arguably the most complex phase of the deployment. In simple terms: the outer protective hard probe case is removed, revealing a soft flexible inner case with valves and airlocks at either end; the valve that gives access to the borehole is opened via an airlock and the probe is partially lowered using the crane into the borehole; the sheave container is connected to the top of the flexible probe case (via a flexible polymer tube and another airlock) and the probe is connected to the tether (using an integral glove box). There are also cleaning processes where all valves, airlocks and the glove box are cleaned, using alcohol wipes, heating to above 0°C and treatment with HPV. UV treatment is used as a reserve method if the HPV machine fails. The probe and tether are then lowered through the gland and the probe deployed (Pd). The probe is lowered to the lake, the samplers are remotely controlled from the surface and so are all other sensors.
After hauling in, the probe is recaptured (Pr) into its protective case and handed over to the science team; this is the handover state in the Markov chain (Ho). A large sediment corer will then be deployed (Reference SiegertSiegert and others, 2012) and if the ice hole allows, a second probe may be deployed, going through the same process as the first probe. There is an event associated with each transition probability; brief summary is presented in Table 2. Above, the successful transitions have been described; unsuccessful transitions loop back to the same state or to a previous state.
Estimating Reliability of the Probe and Processes
Having established the topology of the Markov chain, the next task is to estimate the values of the transition probabilities. A transition probability of 1 means that there is a 100% chance that the deployment will move from one phase to the next and the probability of failure is 0. The probability of failure F (t) for a transition is a measure of the reliability of the components and processes deployed in that particular phase:
where F (t) is the cumulative failure distribution. For example, the probe pre-deployment phase, X9 , described earlier, involves a sequence of steps with several possible failure modes, including ‘top valve fails to open’, ‘winch failure caused by human error’ and ‘failure to clean the latex membrane using UV treatment’. The central predicament is how to calculate the probability of a successful transition from state X9 to X10 based on the probabilities of multiple failure modes. Fault tree analysis (FTA), event tree analysis (ETA) and failure mode and effect analysis (FMEA) are techniques commonly used to elicit and structure failure modes that are likely to drive the probability of failure of a system (Reference O’ConnorO’Connor, 1995). We chose FTA because its graphical structure provides the best insight into system functionality.
Fault tree analysis
FTA uses a graphical logical representation of the system functionality to represent fault propagation in complex systems. The graphical representation consists of one end event (rectangle; this is the event for which we are interested in computing the probability of failure), a number of base events (triangles) and a number of logical AND and OR gates. Base events are not dependent on any other events or combination of events. Once the probabilities of failures for all base events have been estimated, the mathematical model encoded by the graphical structure can be used to calculate the probability of failure for the end event, in this example p9,9. Figure 3 shows the fault tree for this phase. The transition probability p9,10 of the desired outcome is computed using the probability of failing the deployment, as indicated in Figure 2, i.e. 1 –p9, 9.
The base events on the fault tree each capture a potential failure mode. Some failure modes are of mechanical actuation or of a structural nature; some are related to electronics and software systems. Human errors are also captured.
Formal Expert Judgment Elicitation
The probability judgments for each failure mode were obtained during eight workshops, of 3.5 hours each, spread over a 2 month period from October to December 2010. Formal judgment elicitation is a technique used to ensure that experts are properly trained in the task of assigning probability to events. A formal elicitation is needed to ensure that the process is transparent and repeatable and that biases are removed (Reference Keeney and von WinterfeldtKeeney and von Winterfeldt, 1991). We adopted the SHeffield ELicitation Framework (SHELF) package (http://www.tonyohagan.co.uk/shelf/index.html) because the aim is to elicit a probability distribution rather than a single probability judgment, allowing the capture of uncertainty. In addition, the SHELF method encourages experts to agree on the final judgment, denoted as behavioural aggregation. The debate and intensive questioning seeks to remove any bias introduced by individual judgments. A summary on how the SHELF package was applied to estimate the risk posed by each failure mode is presented below.
Expert selection
There are different schools of thought as to what is the ideal number of experts for a judgment elicitation exercise. At one extreme, attempting to quantify the probability of failure of a nuclear power station safety valve Reference Keeney and von WinterfeldtKeeney and von Winterfeldt (1991) elicited judgments from 40 experts. Reference Cooke and GoossensCooke and Goossens (2000) state, ‘in general the maximum number of experts should be used, but at least four’. Experts should be chosen to obtain a spectrum of views and to cover a range of specialisms. An expert is someone that has great knowledge of the subject matter; however, expertise also involves how the person organizes and uses that knowledge (Reference O’HaganO’Hagan and others, 2006, p. 27). The SHELF package does not specify a minimum number of experts, but research has shown that a group of experts does not perform better than the best expert in the group (the difficulty being knowing who is the best expert). The group of experts for this task was formed by engineers and scientists working on the project. A total of ten experts took part in the judgment elicitation exercise: four mechanical engineers (with a combined 48 years’ experience in oceanography); three electronic engineers (with a combined 48 years’ experience); one postdoctoral scientist in chemistry (3 years’ experience); the principal scientist (10 years’ experience); and an external mechanical engineer from BAS (30 years’ experience) provided assessments for the crane failure modes.
Background information
Experts were briefed on the purpose of the elicitation process in a 1 day seminar. All experts and project managers took part in this exercise.
Training
Experts were trained on basic probability concepts. The elicitation was conducted by the facilitator (Brito). Experts were briefed on judgment elicitation, some fallacies of assigning probabilities to events (e.g. biases) and how a formal elicitation process aims to address fallacies (Reference Tversky and KahnemanTversky and Kahneman, 1974; Reference KynnKynn, 2008). This was followed by an expert judgment elicitation exercise facilitated by one of the experts for three failure modes. Experts were asked to assess optical connector failure, optical multiplexer failure and optical fibre failure.
Eliciting probabilities
Experts were asked to provide a plausible range for the probability of failure P(X). For each failure mode, they were asked to estimate the upper and lower bound of P(X), so that L < P(X) < U. During this stage, experts discussed factors pertaining to each failure mode and how, in the worst-case scenario, these factors could increase P(X), and also how, in the best-case scenario, the probability of failure could be reduced. Experts had to agree that P(X) was in this range. Next, experts were asked to draw the shape of a unimodal probability distribution. Based on the shape of the probability distribution, they were asked to individually identify the median and to record evidence that supported their assessment. In the next step the experts were asked to assign the lower (LQ) and upper quartiles (UQ) so that L < P(X) < LQ = 0.25 and LQ < P(X) < M =0.25. Similarly, for the upper quartile so that M < P(X) < UQ = 0.25 and UQ < P(X) < U = 0.25.
Behavioural aggregation
The cumulative distributions for each expert’s judgments were plotted using the median, upper and lower quartile. Experts were then asked to discuss the reasons supporting their assessments, to jointly specify M, LQ and UQ, and to record any disagreements.
Subglacial Lake Ellsworth Probe and Deployment Processes Reliability
This section presents and describes the results, phase by phase, of combining the elicitation process with fault trees, focusing on the final probability of failure for each transition.
The assessments for each failure mode were propagated in a fault tree, as in Figure 3, before fitting a beta distribution to the parameters of each probability distribution. As an example, Figure 4 shows the fitted cumulative probability distributions for the first and second levels for the fault tree of Figure 3. They provide a visual method of communicating and comparing the relative importance of the different subsystems to overall reliability. They also show the uncertainty in the expert judgments; the steeper the curves, the less uncertainty over the probability judgment. Fitting a beta distribution allows 95% confidence probability judgments to be estimated, and these are the measures presented here. Thus users of the probe can be 95% confident that the failure probabilities are no greater than those presented. As the fault trees and cumulative frequency plots are voluminous, only this one example is presented here; the full set is presented in an open-access report (available by request from http://eprints.soton.ac.uk/187347/).
Systems test (X1)
The aim of this phase is to simulate the probe deployment in the Antarctic with a probe test at sea. All systems will be tested. Experience with testing of AUVs has shown that faults can be discovered at this stage, even though thorough testing has taken place in the laboratory. The overall elicited probability of failure for this transition was 0.139.
The most critical fault source was ‘probe systems’, which includes probe electronics, tether and topside electronics, at 0.11. Next was tether handling with three failure modes: ‘bag failure’, ‘winch failure’ and ‘sheave handling’. Bag failure was assigned a probability of 0.0185, winch failure 0.0075 and sheave handling 0.00034. The sediment sampler had two failure modes: ‘loses sample’ and ‘breaks off’. The probability of failure for each of these failure modes was < 0.000139.
Probe cleaning (X2)
Having passed the systems test, the probe is then cleaned inside and outside. Inside cleaning includes the sensors and inside of the case. Outside cleaning involves cleaning the outer case of the probe and other external components that may contaminate the lake. Two assessments were conducted for this phase of the deployment. The first was at the start of the project, on 24 November 2010, the second on 24 October 2011. The estimated risk of failure to clean the probe at the start of the project was 0.976. This is because at that time the team had not tested the effectiveness of HPV, UV and other potential treatments. There was also significant uncertainty with respect to the type of materials that would comprise different sensors and external systems. The HPV treatment was considered very inefficient while the UV treatment was considered to cause severe material degradation. With this very high probability of failure it was almost certain that the deployment would not leave this state. Consequently, a major review of cleaning was instigated, and changes implemented (Reference SiegertSiegert and others, 2012).
The results of these improvements were considered at the second workshop on the cleaning process. The probability of failure of the revised process was assessed as 0.000353. Based on the fault tree for this phase, with 26 failure modes, the probability of failure to clean the inside of the probe was 0.000350, and 0.000108 for failure to clean the outside. Of the failure modes for the inside of the probe, failure to clean the motors and the Optode oxygen sensor were deemed most likely, with a probability of 0.000175. For the remaining failure modes, the probability of failure was in the order of 10-7. The single biggest factor for the reduction in estimated risk was successful experimental results with testing of our cleaning procedures. While the reliability and performance for the procedures we use is well documented for ideal and laboratory conditions, our uncertainty arose from lack of knowledge as to how well these could be implemented with our specific materials and structures.
Results of laboratory HPV testing have shown that this technique is more effective than initially anticipated in cleaning the outside components of the probe. (A panel of test coupons was contaminated separately with both Geobacillus stearothermophilus and Pseudomonas fluores- cens and passed through our cleaning process consisting of detergent wash, 70% ethanol wash, rinsing and drying before HPV or UV treatment. The specific sequence and extent of treatment depends on the material and structure. However, we have demonstrated reliable log6 reduction in population. Log6 is a standard accepted by the US Environmental Protection Agency. This was assessed using cell staining and fluorescence microscopy.) Five failure modes were identified for the outside systems; these were failure to clean: the hot-water drill hose (this has been cleaned during testing prior to shipment (flushing with hot water) and this verified (no culturable microorganisms detected). It will be further cleaned on-site with alcohol wipes, and is further cleaned during operation as it is flushed with hot water);the ethylene propylene terpolymer o-rings; the titanium case of the probe;the jacket of the probe tether; and the winch. Failure to clean the hot-water drill hose was most likely, with a probability of 6.15 x 10-5, with failure to clean the ethylene propylene terpolymer o-rings second most likely, with a probability of 4.68 x 10-5. HPV was considered so effective that the remaining failure modes were assigned a probability of 0.
Post-cleaning electronic test (X3)
The fault tree for this phase is a subset of the fault tree for the deployment phase (Fig. 3), with the mechanical systems and the sediment sampler not tested. However, both probes are tested in X3. The probability of failure for this phase was 0.112. An electronic failure would result in the need to disassemble the probe, correct the faulty component, test and re-clean the component.
Probe transportation (X 4)
Two distinct hazards present a risk to the probe during transportation: vibration and temperature. Probe vibration will be minimized by a damping mechanism that supports the case. However, excessive vibration may break mechanical links, or human error (e.g. poor or incorrect assembly) may occur. Consequently there are three vibration failure modes: (1) excessive vibration, beyond design limits; (2) failure of the damping system; (3) procedural failure.
The probe can also fail due to exposure to excessively low or high temperatures. The temperature inside the container will be controlled by a mechanism designed inhouse. Once on-site, heating systems failures could cause the probe to experience <-20°C, causing sensor degradation, potentially reducing the quality of the observations.
The probability of failure during transport was assessed at 0.027. The top failure mode was ‘on-site failure to control probe temperature’ at 0.0167.
Probe on-site test (X5)
The on-site test consists of performing a probe electronic test, identical to that carried out during phase X3 with the same probability of failure of 0.112. A failure on-site would mean the case(s) having to be opened, the fault found and rectified, followed by on-site cleaning (X6). Another way of presenting this assessment is that there is an 89% probability that the probe will not have to be opened on-site (and subsequently cleaned).
Probe on-site cleaning (X6)
On-site cleaning may need to take place if either the probe or sheave is dropped, or if the electronic test fails and thus the probe needs dismantling in order to assess and replace faulty components. The environmental conditions and reduced number of staff limit the type of cleaning that can be performed on-site. As a result, only HPV and UV can be applied. The HPV may fail to clean due to an equipment failure or due to a failure of technique, or it may cause material degradation. The two failure modes for UV treatment are failure to clean and failure due to material degradation. The overall probability of failure was 0.0399.
Sheave deployment (X7)
The sheave is removed from the container and manually carried through a polythene tube into position, the tether pulled over it (in an airlock) partly consisting of a flexible polymer tube. The failure modes identified were: (1) tether handling, (2) design layout, (3) design fault, (4) handling of the sheave and (5) polymer tube failure. The tether handling failure mode was dependent on two other failure modes: (1) design mitigation and (2) procedural mitigation. The probability of failure is < 0.124. On-site, this means that the problems would need to be corrected and the sheave would need to be cleaned (X6) before attempting a second deployment.
Probe positioning (X8)
The probability of failure to position the probe is < 0.00479. The first-level failure modes are (1) probe freezes, (2) probe manually dropped, (3) failure to align the probe, (4) crane failure to lift the probe and (5) probe dropped by the crane. The probabilities for the crane-related failure modes were estimated by the crane designer based at BAS at < 0.00478.
Probe pre-deployment (X9)
The first-level failure modes for this phase are: (1) valves fail to open;(2) tether failure to mate;(3) failure to move the glove box;(4) failure to align either the glove-box flange or the top valve;(5) failure to detach the hard case;(6) crane failure; (7) probe bag damaged;(8) failure to clean either the glove box or valves;(9) winch failure; and (10) failure to release the probe. Some of these failure modes were decomposed into subsidiary modes (e.g. failure to open the valves was decomposed into four, one for each valve). The probability of failure to position the probe predeployment is < 0.0941. The probe would be held in this position while remedial action was taken if possible (the p9,9 transition), or if the problems were more substantial, a dummy deployment would take place and the path via X10 back to on-site testing X5 would be followed.
Probe deployment (X10)
The probe deployment is divided into two stages. First, the probe is lowered to the lake, controlled by the winch driver. Second, operators in the control room send signals to activate all sensors and samplers and monitor the data and the probe performance. The fault tree is similar to that defined for probe testing phase X5. The only difference is that here we consider that only one probe is deployed rather than two. The probability of failure is < 0.13.
Probe recapturing (X12)
During probe recapturing, the probe is switched off and lifted to the surface controlled by the winch driver. The probability of failure is < 0.0263.
Probe loss (X11)
Probe loss may come as a result of failing to recapture; thus the probability of loss, at 0.0263, is determined by the Markov condition, that the probabilities from a node must equal 1, applied to node X12.
Summary of the expert judgments
Experts assessed the probability of failure for all base events in the fault trees. We had a total of 387 individual expert assessments. For some failure modes, judgments were elicited from five experts. For other failure modes the panel agreed that one expert would best represent the group view of the risk assessment; this expert would normally be the most experienced individual in the topic under consideration. This is the case for, for example, crane failures during probe positioning phase: experts agreed that Steve White (from BAS) would best represent the group’s view.
Clearly, experts will not always agree, so it is often necessary to conduct an analysis of the expert judgments. Given the large number of assessments, in this paper we review the top ten most critical failure modes only. These are: (1) tether handling failure (during sheave deployment, state X7), with a UQ of 0.143; (2) communication optical connector failure (during system testing phase), with a UQ of 0.04; (3) design failure during sheave deployment, with a UQ of 0.04; (4) handling of the sheave (during sheave deployment), with a UQ of 0.0333; (5) on-site failure to control probe temperature during on-site test (state X5), with a UQ of 0.0133; (6) tether handling failure caused by bag damage during systems pre-deployment test (state X1), with a UQ of 0.0133; (7) tether electronic connector failure during systems pre-deployment test (state X1), with a UQ of 0.0133; (8) topside systems failure due to human error during systems pre-deployment test (state X1), with a probability of 0.01; (9) electronics sensor failure, i.e. at least one sensor fails, during pre-deployment systems tests (state X1), with a UQ of 0.005; and (10) tether mechanical failure (state X1), with a UQ of 0.004.
The judgments provided for the probability of tether handling failure ranged between 0.05 and 0.2, showing that the panel was very confident in this assessment. It is not unusual for expert judgments to span over three orders of magnitude (Reference Keeney and von WinterfeldtKeeney and von Winterfeldt, 1991). The fitted probability distributions for this failure mode are presented in Figure 5a. Tether handling is an operator procedure. Experts agreed that this procedure would be rehearsed prior to the deployment. Three experts (N.R., K.S. and E.W.) provided the same assessments for the lower quartile, median and upper quartile. Experts J.C. and J.W. were slightly more optimistic, but the differences with experts N.R., K.S. and E.W. are very small. There is no evidence of biases in this assessment.
Three experts (J.C., E.W. and L.F.) assessed the probability of optical connector failure; the fitted probability distributions are presented in Figure 5b. All three experts are experienced electronics and communications engineers. L.F. is the most experienced in fibre optics, particularly for remote operating vehicle applications (ROVs), where he has experienced this type of failure mode before. J.C. and E.W. provided the most optimistic assessments: both provided assessments in the order of 10-3. L.F. was more pessimistic, his assessments being in the order of 10-2. We believe L.F. used the representativeness heuristics for estimating the probability of optical connector failure. Representativeness heuristics is a mental shortcut for assessing the probability of event A based on the degree of resemblance with event B (Reference Tversky and KahnemanTversky and Kahneman, 1974). Biases can be introduced using this heuristic; however, given that the probe optical connector will be provided by the same supplier and given that the probe will operate in a similar fashion to an ROV, we have no evidence to support the argument that L.F. has introduced biases.
Of the ten most critical failure modes, this is the assessment for which we have observed the largest discrepancy between expert judgments. Following a discussion, the group agreed that L.F.’s assessments would better represent the panel view.
The probability distribution for design failure during sheave deployment is presented in Figure 5c. With a probability range 0.01–0.1 we can see that the difference in expert judgment is small. For this failure mode, disagreements are more evident in the upper quartile. The agreed upper quartile was 0.04; E.W. was the most optimistic expert, assigning an upper quartile of 0.0181; N.R. was the most pessimistic expert, assigning an upper quartile of 0.0667. The agreed UQ presents a good representation of the experts’ assessment.
Sheave handling is a human operator activity. The panel agreed that the risk of sheave handling failure should be smaller than the risk of tether handling failure. The fitted probability distributions for this failure mode are presented in Figure 5d. J.W. was the most optimistic expert, with LQ of 0.0125, M of 0.0167 and UQ of 0.0333. J.C. assigned the same values for LQ and M, but a UQ of 0.05. Following a discussion between experts, the panel agreed that assessments provided by J.W. would best represent the panel view, provided that the sheave handling procedure is rehearsed prior to the deployment.
Controlling the probe temperature is key to the success of the deployment, as some sensors can be permanently damaged if the ambient temperature is below –26°C. For example, the Idronaut platform uses a reference that contains a gel which can freeze if the temperature drops below –26°C. The protective system will consist of a case with foam. The panel agreed that the judgments of J.W. and N.R. best represent the group view. Following a brief discussion, both experts assigned the same values to all parameters of the probability of failure distribution. The probability distribution for this failure mode is presented in Figure 5e.
The panel agreed that the failure distribution for tether bag failure should be bounded between 0.005 and 0.02. The fitted probability distributions are presented in Figure 5f. Experts provided similar values for the median and upper quartile. For the lower quartile, N.R. was slightly more pessimistic than the rest of the group, with a LQ of 0.01. The panel agreed that the LQ assessment provided by J.W. and E.W., of 0.0667, would best represent the panel view. We have no reason to believe that there was bias in this assessment.
The probability distribution for the tether electronic connector failure is bounded between 0.0002 and 0.002. All experts assigned a median of 0.000667. J.C. was the most optimistic expert, with an assessment of 0.000284 for the LQ and a UQ of 0.0008; E.W. assigned a LQ of 0.0003333 and UQ of 0.001. The panel agreed on a more confident distribution, presented in Figure 5g.
For the topside system operator error failure mode, the panel agreed that the probability distribution is bounded between 0.002 and 0.02. As shown in Figure 5h, the panel agreed on the most pessimistic view presented by experts. The probability distribution for sensor failure is bounded between 0.001 and 0.01. As shown in Figure 5i, the agreed distribution is similar to the unweighted linear opinion pool.
The panel agreed that J.W. would best represent the panel on tether mechanical failure mode. The probability distribution for this failure mode is bounded between 0.001 and 0.005, which represents a very confident assessment.
Expert judgment is naturally a subjective process, but this does not imply that the process is biased, as we have shown in this review of the expert assessments.
Estimating Probe Availability
Having explained how FTA can be used to populate the transitions in the Markov chain of the probe deployment, in this section we describe how the model can be used to perform analyses of availability. First the transition matrix for the Markov chain is specified, based on the transition probabilities as elicited and set out as 95% quantiles in Figure 6. However, in order to reflect the uncertainty in the risk assessment of different phases of the deployment, all parameters of the probability distributions (L, LQ, M, UQ and U) have been propagated using the Markov formalism and a beta distribution fitted to these parameters to ease data analysis.
Probe availability, without any rework (i.e. all of the steps between the first and the last in the sequences below all go according to plan without any problems), was calculated for the following scenarios:
-
Probability that the probe will go from being ready for itssystem test at sea to being cleaned and ready for transportation to Antarctica: p1, 2 x p2,3 x p3,4 = 0.76. All the remedial work needed if the probe fails during any of these stages would be done before the probe leaves the UK.
-
Probability that the probe will go from passing its tests in the UK to passing its tests on-site: p4,5 x p5,7 = 0.86. Remedial work would need to be done on-site, and would be done before the hot-water drilling started.
-
Probability that the probe will go from having passed its on-site tests to completing the pre-deployment phase: p7,8 x p8,9 = 0.87. These two phases can be practised, and faults corrected, before the hot-water drilling starts, as the wellhead will be in place above an auger-drilled hole of 15m length.
-
Probability that the deployment will go from successful pre-deployment to recovery: p9,10 x p10,12 = 0.79. This is the time-critical period, as the hole has been drilled and will only be open for ~26 hours.
-
Probability of successfully deploying and recovering either of the two probes, given both passed their on-site testing (1 –(1 –(p9,0 x p10,12))2 = 0.96).
It is nearly impossible to validate these predictions of availability. This would require significant field testing and process rehearsal. However, they can be compared against availability estimates for other systems of comparable complexity. Availability figures for five types of unmanned aerial vehicles (UAVs) – Predator RQ-1A (concept demonstrator), Predator RQ-1B (early production), Pioneer RQ-2A (1990–91), Pioneer RQ-2B, Hunter RQ-5 (reliability enhanced 1996–2001) – were published by the US Office of the Secretary of Defense (www.acq.osd.mil/uas/docs/reliabilitystudy.pdf, but the document is no longer available). The average availability for these vehicles was 0.77. For the Autosub3 AUV, the estimated availability is 0.75 (Reference Brito and GriffithsBrito and Griffiths, 2011). For one Ellsworth probe, the availability for deployment during the critical period after the hole has been drilled is 0.79. Given that the probe is a totally new development, the availability figure is encouraging compared with the UAV and AUVs. However, it is sufficiently low that the need for a second probe to be available is clear, both from good practice and from the elicitation and statistical propagation arguments presented here.
Conclusion
Clean access to subglacial lakes is now a requirement for Antarctic science expeditions. Given the engineering and environment challenges, it is imperative that scientists adopt a formal approach for estimating the likelihood of success. The approach described for estimating the availability of a scientific probe, based on, among other factors, the effectiveness of the cleaning processes, components’ and operator reliability, provides both a means of assessment and a means of communication of the risks involved.
While the three separate techniques – Markov chain, fault analysis supported by fault trees, and expert judgment elicitation – are well known, their integration in the manner described here is novel. The major limitation is the lack of verification for the quantitative results of the process. Much rests on the elicited expert judgments for a whole series of activities that have yet to take place. However, after deployment, drawing upon the experience gained, exactly the same structure of the Markov chains and fault trees as set out here can be used with newly elicited judgments to revisit the predictions. Experience may also show that other paths in the Markov chain need to be added or existing paths removed; new states may be needed. Consequently this first model would be refined, and would gain in realism, while keeping the underlying formalism.
Perhaps more important than the initial predictions of availability are the discussions at the workshops held to determine and then populate the fault trees with failure probabilities. The first round of application of the fault tree analyses helped engineers to identify design vulnerabilities. This led to major changes with respect to the deployment and cleaning processes. For example, the sheave deployment was deemed a high-risk operation; following this assessment the design team changed the sheave deployment process, so that the sheave is deployed together with the probe.
Throughout this process, the main actors were the project engineers and scientists. The acceptable risk was not specified by the project owner. If this value had been specified it would have allowed us to follow a targeted risk approach for the probe deployment. Instead, we followed a revealed risk approach; the estimated risk was compared to that of other vehicles and platforms. The decision is left to the engineers and managers to accept or reject the revealed risk.
This paper has not discussed the different shapes of the agreed probability-of-failure distributions or the amount of agreement or disagreement between experts for all assessments. We have presented an analysis of the expert judgments for the ten most critical failures only. From this analysis we concluded that disagreements between expert judgments are not significant. We believe that this is a result of the judgment elicitation process. A key phase of this project is for experts to agree on the upper and lower bound of the probability distribution.
Future work will seek to compare the predicted and actual transition probabilities following the deployment of the probe at Lake Ellsworth in 2013.
Acknowledgements
M.B. and G.G. were supported by the UK Natural Environment Research Council (NERC) under the 0ceans2025 research programme, while M.M. was supported under NERC grant NE/G004242/1. We thank the Ellsworth team at the National Oceanography Centre, Southampton, for freely taking part in the judgment elicitation exercise. We thank our colleagues at BAS, particularly Keith Makinson, for providing insightful details on the hot-water drilling process, and the experts that took part in the elicitation exercise.