Bayesian Analysis

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Bayesian analysis and the reasons for using it when faced with uncertainty in making decisions. The discussion also addresses any assumptions of Bayesian analysis, cases which would call for Bayesian analysis, and any problems with the Bayesian analysis.

Probability Theory and Probability Calculus

Various words and terms are used to describe uncertainty and related concepts such as probability, chance, randomness, luck, hazard, and fate (Hand, Mannila, & Smyth, 2001). Modeling “uncertainty” is a required component of almost all data analysis (Hand et al., 2001). There are various reasons for such uncertainty. One reason is that the data may be only a sample from the population which is used for the research study so that the uncertainty is about the extent to which different samples differ from each other and the overall population (Hand et al., 2001). Another reason for such uncertainty is a prediction about tomorrow based on today’s data so that the conclusions are subject to uncertainty about what the future will bring (Hand et al., 2001). Another reason includes the ignorance, and some value cannot be observed, and the ideas must be based on the “best guess” about it (Hand et al., 2001).

There are two different probabilities: probability theory and probability calculus (Hand et al., 2001). The probability theory is concerned with the interpretation of probability while the probability calculus is concerned with the manipulation of the mathematical representation of probability (Hand et al., 2001).

The probability measures the likeliness that a particular event will occur. Mathematicians refer to a set of potential outcomes of an experiment or trial to which a probability of occurrences can be assigned (Fischetti, Mayor, & Forte, 2017). Probabilities are expressed as a number between 0 and 1 or as a percentage out of 100 (Fischetti et al., 2017). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur (Fischetti et al., 2017). An example of probability is a coin flip, where there are two outcomes: heads or tails. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive, and mutually exclusive, meaning that they can never co-occur together (Fischetti et al., 2017). Thus, the probability of obtaining either heads or tails is P(heads)=0.50; P(tails)=0.50 respectively. Moreover, when the probability of either outcome does not affect the probability of the other, these events are described as “conditionally independent” (Fischetti et al., 2017). For instance, the probability of event A and event B is the product of the probability of A and the probability of B (Fischetti et al., 2017).

Subjective vs. Objective Probability Views

In the first half of the 20^th century the dominant statistics was the frequentist approach (Hand et al., 2001; O’Hagan, 2004). The frequentist view of probability takes the perspective that probability is an “objective” concept, where the probability of an event is defined as the limiting proportion of times that the event would occur in repetitions of the substantially identical situation (Hand et al., 2001). Example of this frequentist probability is the coin example mentioned above.

In the second half of the 20^th century, the dominance of the frequentist view started to fade out (Hand et al., 2001; O’Hagan, 2004). Although the vast majority of statistical analysis in practice is still frequentist, a competing view of “subjective probability” has acquired increasing importance (Hand et al., 2001; O’Hagan, 2004). The principles and methodologies for data analysis driven from the “subjective” view are often referred to as “Bayesian” statistics (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004). The calculus is the same for the two viewpoints, even though the underlying interpretation is entirely different (Hand et al., 2001).

Bayesian Theorem

It is named after the 18^th-century minister Thomas Bayes, whose paper presented to the Royal Society in 1763 first used the Bayesian argument (O’Hagan, 2004). Although the Bayesian approach can be traced back to Thomas Bayes, its modern incarnation began only in 1950 and 1960 (O’Hagan, 2004).

The central tent of Bayesian statistics is the explicit characterization of all forms of uncertainty in a data analysis problem including uncertainty about any parameters which are estimated from the data, uncertainty as to which among a set of model structures are best or closest to “truth,” uncertainty in any forecast that can be made, and so forth (Hand et al., 2001). Because Bayesian interpretation is subjective, when evidence is scarce, there are sometimes wildly different degrees of belief among different people (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

From the perspective of the “subjective” probability, the Bayesian interpretation of probability views probability as the degree of belief in a claim or hypothesis (Fischetti et al., 2017; Hand et al., 2001). The Bayesian inference provides a method to update that belief or hypotheses in the light of new evidence (Fischetti et al., 2017). The equation of the Bayes’ Theorem is defined in equation (1) (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

Where:

H is the hypothesis.
E is the evidence.

Bayesian Continuous Posterior Distribution

When working with Bayesian analysis, the hypothesis concerns a continuous parameter or many parameters (Fischetti et al., 2017). Bayesian analysis usually yields a continuous posterior called a “posterior distribution” (Fischetti et al., 2017). The Bayes methods utilize the Bayes’ rule which expresses a powerful framework for combining sample information with a prior expert opinion to produce an updated or posterior expert opinion (Giudici, 2005). In the Bayesian analysis, a parameter is treated as a random variable whose uncertainty is modeled by a probability distribution (Giudici, 2005). This distribution is the expert’s prior distribution p(q), stated in the absence of the sampled data (Giudici, 2005). The likelihood is the distribution of the sample, conditional on the values of the random variable q: p(x|q) (Giudici, 2005). The Bayes’ rule provides an algorithm to update the expert’s opinion in the light of the data, producing the so-called posterior distribution p(x|q), as shown in equation (2) below (Giudici, 2005). With c = p(x), a constant that does not depend on the unknown parameter (q).

The posterior distribution represents the main Bayesian inferential tool (Giudici, 2005). Once it is obtained, it is easy to obtain any inference of interest (Giudici, 2005). For instance, to obtain a point estimate, a summary of the posterior distribution is taken, such as the Mean or the Mode (Giudici, 2005). Similarly, confidence intervals can be easily derived by taking any two values of q such the probability of q belonging to the interval described by those two values corresponds to the given confidence level (Giudici, 2005). As q is a random variable, it is now correct to interpret the confidence level as a probabilistic statement: (1 – α ) is the coverage probability of the interval, namely, the probability that q assumes values in the interval (Giudici, 2005). Thus, the Bayesian approach is thus described to be a coherent and flexible procedure (Giudici, 2005).

Special Case of Bayesian Estimates; Maximum Likelihood Estimator (MLE)

The MLE is a special case of Bayesian estimates, when the assumption as a prior distribution for q a constant distribution expressing a vague state of prior knowledge, the posterior mode is equal to the MLE (Giudici, 2005). More generally, when a large sample is considered, the Bayesian posterior distribution approaches an asymptotic normal distribution, with the MLE as expected value (Giudici, 2005).

Bayesian Statistics

In the Bayesian Statistics, there is a four-step process (O’Hagan, 2004). The first step is to create a statistical model to link data to parameters. The second step is to formulate prior information about parameters. The third step is to combine the two sources of information using Bayes’ theorem. The last step is to use the result posterior distribution to derive inferences about parameters (O’Hagan, 2004). Figure 1 illustrates the synthesis of the information by Bayes’ theorem.

Figure 1. Synthesis of the Information by Bayes’ Theorem (O’Hagan, 2004).

Figure 2 illustrates Bayes’ Theorem using a “triplot,” in which the prior distribution, likelihood and posterior distribution are all plotted on the same graph. The prior information is represented by the dashed line lying, in this example, lying between -4 and +4. The data with the dotted line represents the likelihood which favors the values of the parameters between 0 and 3, and strongly argue against any value below -2 or above +4 (O’Hagan, 2004). The posterior is represented as solid line putting these two sources of information together (O’Hagan, 2004). Thus, for values below -2, the posterior density is minimal because the data are saying that these values are highly implausible, while values above +4 are ruled out by the prior (O’Hagan, 2004). While the data favor values around 1.5, the prior prefers values around 0, the posterior listens to both and the synthesis is a compromise, and the parameter is most likely to be around 1 (O’Hagan, 2004).

Figure 2. Triplot. Prior Density (dashed), Likelihood (dotted), and Posterior Density (solid) (O’Hagan, 2004).

Bayes Assumption, Naïve Bayes, and Bayes Classifier

Any joint distribution can be simplified by making appropriate independence assumptions, essentially approximating a full table of probabilities by-products of much smaller tables (Hand et al., 2001). At an extreme, an assumption can be made that all the variables are conditionally independent. Such assumption is sometimes referred to as the Naïve Bayes or first-order Bayes assumption (Alexander & Wang, 2017; Baştanlar & Özuysal, 2014; Hand et al., 2001; Suguna, Sakthi Sakunthala, Sanjana, & Sanjhana, 2017). The conditional independence model is linear in the number of variables (p) rather than being exponential (Hand et al., 2001). To use the model for classification, the product form is simply used for the class-conditional distributions, yielding the Naïve Bayes classifier (Hand et al., 2001). The reduction in the number of parameters by using the Naïve Bayes model comes at a cost as an extreme independence assumption is made (Hand et al., 2001). In some cases, the conditional independence assumption can be made quite reasonable (Hand et al., 2001). In many practical cases, the conditional independence assumption may not be realistic (Hand et al., 2001). Although the independence assumption may not be a realistic model of the probabilities involved, it may still permit relatively accurate classification performance (Hand et al., 2001).

The Naïve Bayes model can easily be generalized in many different directions (Hand et al., 2001). The simplicity, parsimony, and interpretability of the Naïve Bayes model have led to its widespread popularity, particularly in the machine learning literature (Hand et al., 2001). The model can be generalized equally well by including some but not all dependencies beyond first-order (Hand et al., 2001). However, the conventional wisdom in practice is that such additions to the model often provide only limited improvements in classification performance on many data sets, underscoring the difference between building accurate density estimators and building good classifiers (Hand et al., 2001).

Markov Chain Monte Carlo (MCMC) vs. Bayesian Analysis

Computing tools were explicitly developed for Bayesian analysis which is more powerful than anything available for frequentist methods, in the sense that Bayesians can now tackle enormously intricate problems that frequentists methods cannot begin to address (O’Hagan, 2004). The advent of MCMC methods in the early 1990s served to emancipate the implementation of the Bayesian analysis (Allenby, Bradlow, George, Liechty, & McCulloch, 2014).

The transformation is continuing, and computational developments are shifting the balance consistently in favor of Bayesian methods (O’Hagan, 2004). MCMC is a simulation technique, whose concept is to by-pass the mathematical operations rather than to implement them (O’Hagan, 2004). The Bayesian inference is solved by randomly drawing a sizeable simulated sample from the posterior distribution (O’Hagan, 2004). The underlying concept of the Bayesian inference is that a sufficiently large sample from any distribution can represent the whole distribution effectively (O’Hagan, 2004).

Bayesian Analysis Application

The Bayesian principle methods are not the tools of choice in many application areas (O’Hagan, 2004). Bayes’ Theorem has been applied to and proven useful in various disciplines and contexts such as German Enigma code during World War II, saving millions of lives (Fischetti et al., 2017). Furthermore, An essential application of Bayes’ rule arises in the predictive classification problems (Giudici, 2005). Bayesian analyses applied to market science problems have become increasingly popular due to their ability to capture individual-level customer heterogeneity (Allenby et al., 2014). The Big Data is promoting the collection and archiving of an unprecedented amount of data (Allenby et al., 2014). The marketing industry is convinced that there is gold in such Big Data (Allenby et al., 2014). Big Data is described as discrete, and as huge because of its breadth and not because of its depth (Allenby et al., 2014). It provides significant amounts of shallow data which does not reveal the state of the respondent, and the state of the market (Allenby et al., 2014). The Bayesian methods are found to be useful in marketing analysis because of their ability to deal with large, shallow datasets and their ability to produce exact, finite sample inference (Allenby et al., 2014).

The pharmaceutical industry is another example of the use of Bayesian methods. The pharmaceutical companies are regularly forced to abandon drugs that have just failed to demonstrate beneficial effects in frequentist terms, while the Bayesian analysis suggests that it would be worth persevering (O’Hagan, 2004). The DNA is another example of the use of Bayesian analysis (O’Hagan, 2004). Other applications of Bayesian methods include fraud detection (Bolton & Hand, 2002), social networks (Rodriguez, 2012).

Bayesian Analysis Software

There is a growing range of software available to assist with Bayesian analysis (O’Hagan, 2004). Two particular software packages which are in general use, freely available: First Bayes and WinBUGS (O’Hagan, 2004).

The First Bayes is an elementary program which is aimed at helping the beginner to learn and understand how Bayesian methods work (O’Hagan, 2004). The WinBUGS is a robust program for carrying out MCMC computations and is in widespread use for serious Bayesian analysis (O’Hagan, 2004).

Advantages and Disadvantages of Bayesian Analysis

Bayesian methods and classical methods both have advantages and disadvantages, and there are some similarities (sas.com, n.d.) When the sample size is large, Bayesian inference often provides results for parametric models which are very similar to the results produced by frequentist methods (sas.com, n.d.). Bayesian Analysis has the following five advantages:

It provides a natural and principled technique of combining prior information with data, within a solid decision theoretical framework. Thus, past information about a parameter can be incorporated to form a prior distribution for future analysis. When new observations are available, the previous posterior distribution can be used as a prior. All inferences logically follow from the Bayes’ Theorem (sas.com, n.d.).
It provides inferences which are conditional on the data and are precise, without reliance on the asymptotic approximation. Small sample inference proceeds in the same manner as if one had a large sample. The Bayesian analysis can also estimate any functions of parameters directly, without using the “plug-in” method, which is a way to estimate functionals by plugging the estimated parameters in the functionals (sas.com, n.d.).
It adheres to the likelihood principle. If two distinct sampling designs yield proportional likelihood functions for q, then all inferences about q should be identical from these two designs. The classical inference does not adhere to the likelihood principle in general (sas.com, n.d.).
It provides interpretable answers, such as “the true parameter q has a probability of 0.95 of falling in a 95% credible interval” (sas.com, n.d.).
It provides a convenient setting for a wide range of models, such as hierarchical models and missing data problems (sas.com, n.d.).

Bayesian Analysis also has the following disadvantages:

It does not tell how to select a prior. There is no correct method to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior, which can generate misleading results if done with no caution (sas.com, n.d.).
It can produce posterior distributions which are heavily influenced by the priors. From a practical point of view, subject matter experts might disagree with the validity of the chosen prior (sas.com, n.d.).
If often comes with a high computational cost, especially in models with a large number of parameters. Also, simulations provide slightly different answers unless the same random seed is used. However, the slight variations in simulation results do not contradict the claim that the Bayesian inferences are precise or exact, because the posterior distribution of a parameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantities can vary due to the random number generator used in the procedures (sas.com, n.d.).

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Allenby, G. M., Bradlow, E. T., George, E. I., Liechty, J., & McCulloch, R. E. (2014). Perspectives on Bayesian Methods and Big Data. Customer Needs and Solutions, 1(3), 169-175.

Baştanlar, Y., & Özuysal, M. (2014). Introduction to machine learning miRNomics: MicroRNA Biology and Computational Analysis (pp. 105-128): Springer.

Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 235-249.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

O’Hagan, A. (2004). Bayesian statistics: principles and benefits. Frontis, 31-45.

Rodriguez, A. (2012). Modeling the dynamics of social networks using Bayesian hierarchical blockmodels. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(3), 218-234.

sas.com. (n.d.). Bayesian Analysis: Advantages and Disadvantages. Retrieved from https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect006.htm.

Suguna, M., Sakthi Sakunthala, N., Sanjana, S., & Sanjhana, S. (2017). A Survey on Prediction of Heart Diseases Using Big Data Algorithms.

Share this:

Related

Published by Think and Knowledge Tank