In the integrated BNT classifier, the idea is proposed to derive a decision tree from a Bayesian network (that is built upon the original data) instead of immediately deriving the tree from the original data. By doing so, it is expected that the structure of the tree is more stable, especially because the variable correlations are already taken into account in the Bayesian network, which may reduce the variable masking problem. To the best of our knowledge, the idea to build decision trees in this way has not been explored before in previous studies.

In order to select a particular decision node in the BNT classifier, the mutual information value that is calculated between two nodes in the Bayesian network is used. This mutual information value is to some extent equivalent with the entropy measure that C4.5 decision trees use. It is defined as the expected entropy reduction of one node due to a finding (observation) related to the other node. The dependent variable is called the query variable (denoted by the symbol Q) the independent variables are called findings variables (denoted by the symbol F). Therefore, the expected reduction in entropy (measured in bits) of Q due to a finding related to F can be calculated according the following equation (Pearl, 1988):

(7.1)

where p(q, f) is the posterior probability that a particular state of Q(q) and a particular state of F(f) occur together; p{q) is the prior probability that a state q of Q will occur and p(f) is the prior probability that a state f of F will occur. The probabilities are summed across all states of Q and across all states of F. As a result of this calculation, where you in fact divide the posterior probability (that a particular state of Q(q) and a particular state of F(f) occur together) by the prior probabilities (of p(q) and p(f)), you can calculate which finding variable is the best in explaining the variability of the dependent variable. This means that this particular variable will be selected as the most important variable in our decision tree.

To this end we calculated the expected reduction in entropy of the dependent variable for the various findings variables. The finding variable that obtains the highest reduction in entropy was selected as the root node in the tree. To better illustrate the idea of building a BNT classifier, we consider again the network that was shown in Figure 7.1 by means of example. In this case, the dependent variable is 'Mode choice' and the different finding variables are 'Driving license', 'Gender' and 'Number of cars'. In a first step, we can for instance calculate the expected reduction in entropy between the 'Mode choice' and the 'Gender' variable.

The calculation of the joint probabilities P(Mode?, Gender,•) for ?'={bike, car} and j= {male,female} is the same as explained in Section 3.2. The calculation of the individual prior probabilities P(Mode^) and P(Gender,) is straightforward as well (see Section 3.2). As a result, the expected result of formula (7.1) is:

In a similar way, I (Mode Choice, Driving License) = 0.01781 and I (Mode Choice, Number of cars) = 0.01346 can be calculated.

Since I (Mode Choice, Driving License) > I (Mode Choice, Number of cars) > I (Mode Choice, Gender); the variable Driving License is selected as the root node of the tree (see Figure 7.3). Once the root node has been determined, the tree is split up into different branches according to the different states (values) of the root node. To this end, evidences can be entered for each state of the root node in the Bayesian network and the entropy value can be re-calculated for all other combinations between the findings nodes (except for the root node) and the query node. The node, which achieves the highest entropy reduction is taken as the node which is used for splitting up that particular branch of the root node. In our example, the root node 'Driving License' has two branches: Driving License = yes and Driving License = no. For the split in the first branch (Driving License = yes), only two

Figure 7.3: The final integrated BNT decision tree classifier (example).

variables have to be taken into account (since the root node is excluded): 'Number of cars' and 'Gender'. The way in which the expected reduction in entropy is calculated is the same as shown above, expect for the fact that an evidence needs to be entered for the node 'Driving License', that is P(Driving License = Yes; Driving License = no)=(1; 0) (since we are in the first branch). The procedure for doing this was already described in Section 4.3. Again, I (Mode Choice, Gender) = 0.02282 and I (Mode Choice, Number of cars) = 0.07630. Since I(Mode Choice, Number of cars) > I((Mode Choice, Gender); the variable 'Number of cars' is selected as the next split in this first branch. Finally, the whole process then becomes recursive and needs to be repeated for all possible branches in the tree. A computer code has been established to automate the whole process. The final decision tree for this simple Bayesian network is shown in Figure 7.3.

Found a mistake? Please highlight the word and press Shift + Enter