### Advantages of Classification and Regression Trees (decision trees) (from Good & Hardin 2012:261ff.)
# (1) CART is useful "both as a preliminary to multiple regression as its primary splitters ought to be used as blocking variables, and for the presentation of results in a readily understandable format."
# (2) "Unlike regression coefficients, the brances of a decision tree lend themselves readily to interpretation by the nonstatistician." [This is not really different from the second part of argument 1; STG]
# (3) "We can influence the shape of the tree if we can specify the proportions of the various categories in the population at large."
# (4) "we can assign losses or penalties on a one-by-one basis to each specific type of misclassification, rather than use some potentially misleading aggregate measure such as least-square error."
# "Decision trees should be used whenever predictors are interdependent and their interaction may lead to reinfocing synergistic effects and/or a mixture of continuous and categorical variables, highly skewed data, and large numbers of missing observations adds to the complexity of the analysis."
### Disadvantages of Classification and Regression Trees (decision trees)
# "Regression trees are also known for their instability (Breiman, 1996). A small change in the training set [...] may lead to a different choice when building a node, which, in turn, may represent a dramatic change in the tree, particularly if the change occurs in top-level nodes. Branching is also affected by data density and sparseness, with more branching and smaller bins in data regions where data points are dense." (Good & Hardin 2012:263)
### Advantages of Classification and Regression Trees (from James et al. 2013:315f.)
# (1) Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression.
# (2) Some people believe that decision trees more closely mirror human decision-making thatn do the regression and classification approaches seen in previous chapters.
# (3) Trees can be displayed graphically and are easily interpreted even by a non-expert (especially if they are small).
# (4) Trees can easily handle qualitative predictors without the need to create dummy variables.
### Disadvantages of Classification and Regression Trees
# Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
#+ fig.width=12, fig.height=8
rm(list=ls(all=TRUE)) # clear memory
set.seed(1) # set random number generator
library(tree) # load useful package
# load and inspect data
summary(x <- read.delim("104_08b_clauseorders.csv"))
attach(x)
addmargins(table(SUBORDTYPE, ORDER))
summary(cart.0 <- # return the summary of an object called cart.0
tree(ORDER ~ # modeling with a classification tree the variable ORDER as a function of
SUBORDTYPE)) # SUBORDTYPE
# note: the residual mean deviance here is the same as in a glm:
glm(ORDER ~ SUBORDTYPE, data=x, family=binomial)$deviance
glm(ORDER ~ SUBORDTYPE, data=x, family=binomial)$df.residual
# note: the Misclassification error rate here is the same as in a glm:
1 - sum(diag(table(predict(glm(ORDER ~ SUBORDTYPE, data=x, family=binomial))>0, ORDER))/length(ORDER))
plot(cart.0) # plot the classification tree
text(cart.0, pretty=0) # add labels to it
# now to a bigger, more realistic tree ...
summary(cart.1 <- # return the summary of an object called cart.1
tree( # modeling with a classification tree
ORDER ~ # the variable ORDER as a function of
SUBORDTYPE+LEN_MC+LEN_SC+LENGTH_DIFF+CONJ+MORETHAN2CL)) # all other variables
# it seems like nearly all predictors are relevant - LEN_SC is not, however
# this tree makes classifications with an accuracy of 1-0.2134-0.7866 (not much better than SUBORDTYPE)
# determine the nature of the effect(s) numerically
cart.1 # split criteria, number of obs., deviance, prediction, fractions of observations
# for instance, node 1, the root node
# - has 403 observations
# - has an overall deviance of 503.80 (that's glm()$null.deviance)
# - predicts "mc-sc": prop.table(table(ORDER))
# for instance, node 2
# - results from splitting on SUBORDTYPE and picking "caus"
# - has 199 observations
# - has a deviance of 106.4 (glm(ORDER ~ 1, data=x, subset=x$SUBORDTYPE=="caus", family=binomial))
# - predicts "mc-sc": prop.table(table(ORDER[SUBORDTYPE=="caus"]))
# etc.
# when does a tree stop splitting? whenever
# (1) the decrease in deviance goes below a certain threshold (default: cp=0.01)
# (2) the number of samples in the node is below another threshold (default: minsplit=20)
# (3) the tree depth exceeds another value (default: maxdepth=30)
predictions.num <- # make predictions.num
predict(cart.1) # the predictions for the data from cart.1
predictions.cat <- # make predictions.cat
predict(cart.1, # the predictions for the data from cart.1
type="class") # but this time the categorical class predictions
table(ORDER, # cross-tabulate the actually produced orders of clauses
predictions.cat) # against the predictions
# accuracy:
(92+225) / length(predictions.cat) # (tp+tn) / (tp+tn+fp+fn) = 0.7866005
# precision:
92/(92+50) # tp/(tp+fp) = 0.6478873
# recall/sensitivity:
92/(92+36) # tp/(tp+fn) = 0.71875
# F:
2*((0.6478873*0.71875)/(0.6478873+0.71875)) # 2 * ((prec*recall)/(prec+recall)) = 0.6814815
# determine the nature of the effect(s) graphically
plot(cart.1) # plot the classification tree
text(cart.1, pretty=4, all=TRUE) # # add labels to it
# validation 1: comparing classification to prediction accuracy
sampler <- sample( # make sampler the random ordering of
rep(c("training", "test"), # the words "training" and "test"
c(302, 101))) # repeated 302 and 101 times respectively
cart.validation.training <- # make cart.validation.training
tree(formula(cart.1), # a classification tree with the same formula as cart.1
data=x[sampler=="training",]) # but applied only to the 302 training cases
predictions.validation.test <- # make predictions.validation.test
predict(cart.validation.training, # the predictions from cart.validation.training
newdata=x[sampler=="test",], # applied only to the 101 test cases
type="class") # return the categorical class predictions
sum(predictions.validation.test == # compute the number of cases where the prediction for the test data
ORDER[sampler=="test"]) / # is the same as what actually happened in the test data
length(predictions.validation.test) # and divide that by the number of test predictions (for a %)
# very similar to the accuracy of the whole data set
# validation 2: can we, or do we need to, prune the tree?
pruning <- # make pruning
cv.tree(cart.1, # the result of cross-validating the tree
FUN=prune.misclass) # based on the number of misclassifications
plot(pruning$size, # plot the pruned tree sizes
pruning$dev, # against the deviances those tree sizes come with
type="b"); grid() # using points and lines; add a grid
# the deviances are lowest for 4 and 7 nodes, we pick 4 (because 7 is the original number of nodes)
cart.1.pruned <- # make cart.1.pruned
prune.misclass(cart.1, # a version of cart.1 pruned down to
best=4) # only 4 terminal nodes
plot(cart.1.pruned) # plot the classification tree
text(cart.1.pruned, pretty=0, all=TRUE) # add labels to it
# but does it do worse? (there should be hardly any difference)
predictions.cat.pruned <- # make predictions.cat
predict(cart.1.pruned, # the predictions for the data from cart.1.pruned
type="class") # the categorical class predictions
table(ORDER, # cross-tabulate the actually produced orders of clauses
predictions.cat.pruned) # against the predictions from the pruned tree
# accuracy:
(74+242) / length(predictions.cat) # (tp+tn) / (tp+tn+fp+fn) = 0.7841191
# precision:
74/(74+33) # tp/(tp+fp) = 0.6915888
# recall/sensitivity:
74/(74+54) # tp/(tp+fn) = 0.578125
# F:
2*((0.6915888*0.578125)/(0.6915888+0.578125)) # 2 * ((prec*recall)/(prec+recall)) = 0.6297872