Decision tree using rpart

R’s rpart package provides a powerful framework for growing classification and regression trees.

Example 1: A motivating problem

First let’s define a problem. There’s a common scam amongst motorists whereby a person will slam on his breaks in heavy traffic with the intention of being rear-ended. The person will then file an insurance claim for personal injury and damage to his vehicle, alleging that the other driver was at fault. Suppose we want to predict which of an insurance company’s claims are fraudulent using a decision tree.

To start, we need to build a training set of known fraudulent claims.

train <- data.frame(
  ClaimID = c(1,2,3),
  RearEnd = c(TRUE, FALSE, TRUE),
  Fraud = c(TRUE, FALSE, TRUE)
)

train

   ClaimID RearEnd Fraud
 1       1    TRUE  TRUE
 2       2   FALSE FALSE
 3       3    TRUE  TRUE

In order to grow our decision tree, we have to first load the rpart package.

library(rpart)

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class"
)

mytree

 n= 3 
 
 node), split, n, loss, yval, (yprob)
       * denotes terminal node
 
 1) root 3 1 TRUE (0.3333333 0.6666667) *

Notice the output shows only a root node. This is because rpart has some default parameters that prevented our tree from growing. Namely minsplit and minbucket. minsplit is “the minimum number of observations that must exist in a node in order for a split to be attempted” and minbucket is “the minimum number of observations in any terminal node”. See what happens when we override these parameters.

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class", 
  minsplit = 2, 
  minbucket = 1
)

mytree

 n= 3 
 
 node), split, n, loss, yval, (yprob)
       * denotes terminal node
 
 1) root 3 1 TRUE (0.3333333 0.6666667)  
   2) RearEnd< 0.5 1 0 FALSE (1.0000000 0.0000000) *
   3) RearEnd>=0.5 2 0 TRUE (0.0000000 1.0000000) *

Now our tree has a root node, one split and two leaves (terminal nodes). Observe that rpart encoded our boolean variable as an integer (false = 0, true = 1). We can plot mytree by loading the rattle package (and some helper packages) and using the fancyRpartPlot() function.

library(rattle)
library(rpart.plot)
library(RColorBrewer)

# plot mytree
fancyRpartPlot(mytree, caption = NULL)

By default, rpart uses gini impurity to select splits when performing classification. You can use information gain instead by specifying it in the parms parameter.

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class",
  parms = list(split = 'information'), 
  minsplit = 2, 
  minbucket = 1
)

mytree

 n= 3 
 
 node), split, n, loss, yval, (yprob)
       * denotes terminal node
 
 1) root 3 1 TRUE (0.3333333 0.6666667)  
   2) RearEnd< 0.5 1 0 FALSE (1.0000000 0.0000000) *
   3) RearEnd>=0.5 2 0 TRUE (0.0000000 1.0000000) *

Now suppose our training set looked like this.

train <- data.frame(
  ClaimID = c(1,2,3),
  RearEnd = c(TRUE, FALSE, TRUE),
  Fraud = c(TRUE, FALSE, FALSE)
)

train

   ClaimID RearEnd Fraud
 1       1    TRUE  TRUE
 2       2   FALSE FALSE
 3       3    TRUE FALSE

If we try to build a decision tree on this data..

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class", 
  minsplit = 2, 
  minbucket = 1
)

mytree

 n= 3 
 
 node), split, n, loss, yval, (yprob)
       * denotes terminal node
 
 1) root 3 1 FALSE (0.6666667 0.3333333) *

Once again we’re left with just a root node. Internally, rpart keeps track of something called the complexity of a tree. The complexity measure is a combination of the size of a tree and the ability of the tree to separate the classes of the target variable. If the next best split in growing a tree does not reduce the tree’s overall complexity by a certain amount, rpart will terminate the growing process. This amount is specified by the complexity parameter, cp, in the call to rpart(). Setting cp to a negative amount ensures that the tree will be fully grown.

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class", 
  minsplit = 2, 
  minbucket = 1, 
  cp = -1
)

fancyRpartPlot(mytree, caption = NULL)

You can also weight each observation for the tree’s construction by specifying the weights argument to rpart().

mytree <- rpart(
  Fraud ~ RearEnd, 
  data = train, 
  method = "class", 
  minsplit = 2, 
  minbucket = 1,
  weights = c(0.4, 0.4, 0.2)
)

fancyRpartPlot(mytree, caption = NULL)

To alter the default, equal penalization of mislabeled target classes set the loss component of the parms parameter to a matrix where the (i,j) element is the penalty for misclassifying an i as a j. (The loss matrix must have 0s in the diagonal). For example, consider the following training data.

train <- data.frame(
  ClaimID = 1:7,
  RearEnd = c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE),
  Whiplash = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE),
  Fraud = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)
)

train

   ClaimID RearEnd Whiplash Fraud
 1       1    TRUE     TRUE  TRUE
 2       2    TRUE     TRUE  TRUE
 3       3   FALSE     TRUE  TRUE
 4       4   FALSE     TRUE FALSE
 5       5   FALSE     TRUE FALSE
 6       6   FALSE    FALSE FALSE
 7       7   FALSE    FALSE FALSE

mytree <- rpart(
  Fraud ~ RearEnd + Whiplash, 
  data = train, 
  method = "class",
  maxdepth = 1, 
  minsplit = 2, 
  minbucket = 1
)

fancyRpartPlot(mytree, caption = NULL)

rpart has determined that RearEnd was the best variable for identifying a fraudulent claim. BUT there was one fraudulent claim in the training dataset that was not a rear-end collision. If the insurance company wants to identify a high percentage of fraudulent claims without worrying too much about investigating non-fraudulent claims they can set the loss matrix to penalize claims incorrectly labeled as fraudulent three times less than claims incorrectly labeled as non-fraudulent.

lossmatrix <- matrix(c(0,1,3,0), byrow = TRUE, nrow = 2)
lossmatrix

      [,1] [,2]
 [1,]    0    1
 [2,]    3    0

mytree <- rpart(
  Fraud ~ RearEnd + Whiplash, 
  data = train, 
  method = "class",
  maxdepth = 1, 
  minsplit = 2, 
  minbucket = 1,
  parms = list(loss = lossmatrix)
)

fancyRpartPlot(mytree, caption = NULL)

Decision tree using rpart

Example 1: A motivating problem

Example 2: kyphosis dataset