The goal is to classify vectors of discrete-valued features via a generative approach. That is

Let . We define a generative classifier using the following equation
- We refer to as the class conditional density (that is, the likelihood of generating given .
- We refer to as the class prior The class prior is usually denoted as . So
  
  It captures the prior assumptions about the label distribution.
Assume: The features are conditionally independent (this is technically a naive assumption). That is,

Where is the value of the label . Note that the distributions in the RHS can be substituted to whatever we want (i.e., a Gaussian or a Bernoulli).
We estimate from the dataset . Then, given , compute

Bayesian Naive Bayes

The Prior becomes:

This follows from the rule of product.
Correspondingly, the posterior becomes

Where is a parameter for the distribution of .

Also, the conditionals on the RHS are determined as needed (i.e., they can be Dirichlet or Beta, for example).

It is of the same form as the prior, except conditioned based on evidence seen from the dataset.
We can analyze the likelihood as follows: For a single datapoint,

We also have
At testing, the goal is

Note that to actually compute the above in a Bayesian Manner, we must integrate over and to get the marginal distribution:

Filtering

Since we assume the features are conditionally independent, we need to choose the appropriate features.
One way to select the features is through variable filtering, that is by taking the top features that are relevant for the problem.
One way to do this is to measure the mutual information between feature and label .