决策树算法是一种非参数化监督学习方法,可用于分类和回归。其基本原理是用一系列的if else分支来对未知数据进行判断,从而给出其正确的分类。if else中的参数,则通过样本训练得到。决策树也可以看作是对函数的一系列分段常数近似。
决策树有以下优点:
- 易于理解和解释,并且整棵决策树可以实现可视化。
- 只需要较少的训练数据,且不需要对数据进行归一化处理
- 对数据预测的计算开销和用于训练的数据样本数为对数关系
- 可处理多输出问题
- 可通过统计检验验证模型,并给出模型可靠性
- 即使数据不符合模型假设的情况下,也能表现良好
同时具有以下缺点:
- 可能会产生过于复杂的树,过拟合,导致数据预测无法外推。可通过剪枝,设置叶节点最小数据数量或设置树的最大深度来避免这种情况。
- 可能不够稳定,数据很小的变化生成完全不同的决策树。可通过集合训练(ensemble)的方法来减小这种情况带来的影响。
- 决策树的预测既不光滑也不连续,而是分段常数近似,因此不适于外插
- 实际构建决策树时往往用到诸如贪心算法寻找局部最优,因此无法保证得到全局最优决策树。可通过集合训练(ensemble)的方法来缓解。
- 有一些概念决策树很难表达,比如XOR,对称或多路选择器(multiplexer)。
- 如果数据集中某些类占主导,决策树训练结果可能不准确。
相比于前两种监督学习分类方法Gaussian Naive Bayes和Nearest Centroid,决策树方法要复杂许多。其涉及到较复杂的数据结构——树及其相关算法。下面给出scikit-learn中决策树的源代码,由于在scikit-learn采用了cython以加速计算,但代码看起来不直观且无法直接运行,此处进行了修改将其全部用python实现。
Utils.py
SIZE_MAX = 2**32 - 1
# =============================================================================
# Stack data structure
# =============================================================================
# A record on the stack for depth-first tree growing
class StackRecord:
start = 0
end = 0
depth = 0
parent = 0
is_left = 1
impurity = 1
n_constant_features = 0
# =============================================================================
# PriorityHeap data structure
# =============================================================================
# A record on the frontier for best-first tree growing
class PriorityHeapRecord:
node_id = 0
start = 0
end = 0
pos = 0
depth = 0
is_leaf = 1
impurity = 1
impurity_left = 1
impurity_right = 1
improvement = 1
# =============================================================================
# Stack data structure
# =============================================================================
class Stack:
"""A LIFO data structure.
Attributes
----------
capacity : SIZE_t
The elements the stack can hold; if more added then ``self.stack_``
needs to be resized.
top : SIZE_t
The number of elements currently on the stack.
stack : StackRecord pointer
The stack of records (upward in the stack corresponds to the right).
"""
def __init__(self, capacity):
self.capacity = capacity
self.top = 0
self.stack_ = []
for i in range(capacity):
self.stack_.append(StackRecord())
def is_empty(self):
return self.top <= 0
def push(self, start, end, depth, parent,
is_left, impurity,
n_constant_features):
"""Push a new element onto the stack.
Return -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
top = self.top
# Resize if capacity not sufficient
if top >= self.capacity:
for idx in range(self.capacity):
self.stack_.append(StackRecord())
self.capacity *= 2
stack = self.stack_[top]
stack.start = start
stack.end = end
stack.depth = depth
stack.parent = parent
stack.is_left = is_left
stack.impurity = impurity
stack.n_constant_features = n_constant_features
# Increment stack pointer
self.top = top + 1
return 0
def pop(self):
"""Remove the top element from the stack and copy to ``res``.
Returns 0 if pop was successful (and ``res`` is set); -1
otherwise.
"""
top = self.top
stack = self.stack_
if top <= 0:
return (-1, None)
res = stack[top - 1]
self.top = top - 1
return (0, res)
# =============================================================================
# PriorityHeap data structure
# =============================================================================
class PriorityHeap:
"""A priority queue implemented as a binary heap.
The heap invariant is that the impurity improvement of the parent record
is larger then the impurity improvement of the children.
Attributes
----------
capacity : SIZE_t
The capacity of the heap
heap_ptr : SIZE_t
The water mark of the heap; the heap grows from left to right in the
array ``heap_``. The following invariant holds ``heap_ptr < capacity``.
heap_ : PriorityHeapRecord*
The array of heap records. The maximum element is on the left;
the heap grows from left to right
"""
def __init__(self, capacity):
self.capacity = capacity
self.heap_ptr = 0
self.heap_ = []
for idx in range(capacity):
self.heap_.append(PriorityHeapRecord())
def is_empty(self):
return self.heap_ptr <= 0
def heapify_up(self, heap, pos):
"""Restore heap invariant parent.improvement > child.improvement from
``pos`` upwards. """
if pos == 0:
return
parent_pos = (pos - 1) / 2
if heap[parent_pos].improvement < heap[pos].improvement:
heap[parent_pos], heap[pos] = heap[pos], heap[parent_pos]
self.heapify_up(heap, parent_pos)
def heapify_down(self, heap, pos, heap_length):
"""Restore heap invariant parent.improvement > children.improvement from
``pos`` downwards. """
left_pos = 2 * (pos + 1) - 1
right_pos = 2 * (pos + 1)
largest = pos
if (left_pos < heap_length and
heap[left_pos].improvement > heap[largest].improvement):
largest = left_pos
if (right_pos < heap_length and
heap[right_pos].improvement > heap[largest].improvement):
largest = right_pos
if largest != pos:
heap[pos], heap[largest] = heap[largest], heap[pos]
self.heapify_down(heap, largest, heap_length)
def push(self, node_id, start, end, pos,
depth, is_leaf, improvement,
impurity, impurity_left,
impurity_right):
"""Push record on the priority heap.
Return -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
heap_ptr = self.heap_ptr
heap = self.heap_[heap_ptr]
# Resize if capacity not sufficient
if heap_ptr >= self.capacity:
for idx in range(self.capacity):
self.heap_.append(PriorityHeapRecord())
self.capacity *= 2
# Put element as last element of heap
heap.node_id = node_id
heap.start = start
heap.end = end
heap.pos = pos
heap.depth = depth
heap.is_leaf = is_leaf
heap.impurity = impurity
heap.impurity_left = impurity_left
heap.impurity_right = impurity_right
heap.improvement = improvement
# Heapify up
self.heapify_up(self.heap_, heap_ptr)
# Increase element count
self.heap_ptr = heap_ptr + 1
return 0
def pop(self):
"""Remove max element from the heap. """
heap_ptr = self.heap_ptr
heap = self.heap_
if heap_ptr <= 0:
return -1
# Take first element
res = heap[0]
# Put last element to the front
heap[0], heap[heap_ptr - 1] = heap[heap_ptr - 1], heap[0]
# Restore heap invariant
if heap_ptr > 1:
self.heapify_down(heap, 0, heap_ptr - 1)
self.heap_ptr = heap_ptr - 1
return (0, res)
class Node:
# Base storage structure for the nodes in a Tree object
left_child = 0 # id of the left child of the node
right_child = 0 # id of the right child of the node
feature = 0 # Feature used for splitting the node
threshold = 1 # Threshold value at the node
impurity = 1 # Impurity of the node (i.e., the value of the criterion)
n_node_samples = 0 # Number of samples at the node
weighted_n_node_samples = 1 # Weighted number of samples at the node
Criterion.py
import numpy as np
from .Splitter import log
class Criterion:
"""Interface for impurity criteria.
This object stores methods on how to calculate how good a split is using
different metrics.
"""
def proxy_impurity_improvement(self):
"""Compute a proxy of the impurity reduction
This method is used to speed up the search for the best split.
It is a proxy quantity such that the split that maximizes this value
also maximizes the impurity improvement. It neglects all constant terms
of the impurity decrease for a given split.
The absolute impurity improvement is only computed by the
impurity_improvement method once the best split has been found.
"""
(impurity_left, impurity_right) = self.children_impurity()
return (- self.weighted_n_right * impurity_right
- self.weighted_n_left * impurity_left)
def impurity_improvement(self, impurity):
"""Compute the improvement in impurity
This method computes the improvement in impurity when a split occurs.
The weighted impurity improvement equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples
at the current node, N_t_L is the number of samples in the left child,
and N_t_R is the number of samples in the right child,
Parameters
----------
impurity : double
The initial impurity of the node before the split
Return
------
double : improvement in impurity after the split occurs
"""
(impurity_left, impurity_right) = self.children_impurity()
return ((self.weighted_n_node_samples / self.weighted_n_samples) *
(impurity - (self.weighted_n_right /
self.weighted_n_node_samples * impurity_right)
- (self.weighted_n_left /
self.weighted_n_node_samples * impurity_left)))
class ClassificationCriterion(Criterion):
"""Abstract criterion for classification."""
def __init__(self, n_outputs,
n_classes):
"""Initialize attributes for this criterion.
Parameters
----------
n_outputs : SIZE_t
The number of targets, the dimensionality of the prediction
n_classes : numpy.ndarray, dtype=SIZE_t
The number of unique classes in each target
"""
self.sample_weight = None
self.samples = None
self.start = 0
self.pos = 0
self.end = 0
self.n_outputs = n_outputs
self.n_samples = 0
self.n_node_samples = 0
self.weighted_n_node_samples = 0.0
self.weighted_n_left = 0.0
self.weighted_n_right = 0.0
# Count labels for each output
self.n_classes = []
sum_stride = 0
# For each target, set the number of unique classes in that target,
# and also compute the maximal stride of all targets
for k in range(n_outputs):
self.n_classes.append(n_classes[k])
if n_classes[k] > sum_stride:
sum_stride = n_classes[k]
self.sum_stride = sum_stride
n_elements = n_outputs * sum_stride
self.sum_total = np.zeros(n_elements)
self.sum_left = np.zeros(n_elements)
self.sum_right = np.zeros(n_elements)
def init(self, y,
sample_weight, weighted_n_samples,
samples, start, end):
"""Initialize the criterion at node samples[start:end] and
children samples[start:start] and samples[start:end].
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
Parameters
----------
y : array-like, dtype=DOUBLE_t
The target stored as a buffer for memory efficiency
sample_weight : array-like, dtype=DOUBLE_t
The weight of each sample
weighted_n_samples : double
The total weight of all samples
samples : array-like, dtype=SIZE_t
A mask on the samples, showing which ones we want to use
start : SIZE_t
The first sample to use in the mask
end : SIZE_t
The last sample to use in the mask
"""
self.y = y
self.sample_weight = sample_weight
self.samples = samples
self.start = start
self.end = end
self.n_node_samples = end - start
self.weighted_n_samples = weighted_n_samples
self.weighted_n_node_samples = 0.0
w = 1.0
offset = 0
for k in range(self.n_outputs):
for j in range(self.n_classes[k]):
self.sum_total[offset + j] = 0
offset += self.sum_stride
for p in range(start, end):
i = samples[p]
# w is originally set to be 1.0, meaning that if no sample weights
# are given, the default weight of each sample is 1.0
if sample_weight != None:
w = sample_weight[i]
# Count weighted class frequency for each target
for k in range(self.n_outputs):
c = int(self.y[i, k])
self.sum_total[k * self.sum_stride + c] += w
self.weighted_n_node_samples += w
# Reset to pos=start
self.reset()
return 0
def reset(self):
"""Reset the criterion at pos=start
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
self.pos = self.start
self.weighted_n_left = 0.0
self.weighted_n_right = self.weighted_n_node_samples
n_classes = self.n_classes
for k in range(self.n_outputs):
for j in range(n_classes[k]):
idx = k*self.sum_stride + j
self.sum_left[idx] = 0
self.sum_right[idx] = self.sum_total[idx]
return 0
def reverse_reset(self):
"""Reset the criterion at pos=end
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
self.pos = self.end
self.weighted_n_left = self.weighted_n_node_samples
self.weighted_n_right = 0.0
for k in range(self.n_outputs):
for j in range(self.n_classes[k]):
idx = k*self.sum_stride + j
self.sum_right[idx] = 0
self.sum_left[idx] = self.sum_total[idx]
return 0
def update(self, new_pos):
"""Updated statistics by moving samples[pos:new_pos] to the left child.
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
Parameters
----------
new_pos : SIZE_t
The new ending position for which to move samples from the right
child to the left child.
"""
pos = self.pos
end = self.end
w = 1.0
# Update statistics up to new_pos
#
# Given that
# sum_left[x] + sum_right[x] = sum_total[x]
# and that sum_total is known, we are going to update
# sum_left from the direction that require the least amount
# of computations, i.e. from pos to new_pos or from end to new_po.
if (new_pos - pos) <= (end - new_pos):
for p in range(pos, new_pos):
i = self.samples[p]
if self.sample_weight != None:
w = self.sample_weight[i]
for k in range(self.n_outputs):
label_index = k * self.sum_stride + int(self.y[i, k])
self.sum_left[label_index] += w
self.weighted_n_left += w
else:
self.reverse_reset()
for p in range(end - 1, new_pos - 1, -1):
i = self.samples[p]
if self.sample_weight != None:
w = self.sample_weight[i]
for k in range(self.n_outputs):
label_index = k * self.sum_stride + int(self.y[i, k])
self.sum_left[label_index] -= w
self.weighted_n_left -= w
# Update right part statistics
self.weighted_n_right = self.weighted_n_node_samples - self.weighted_n_left
for k in range(self.n_outputs):
for c in range(self.n_classes[k]):
idx = k*self.sum_stride + c
self.sum_right[idx] = self.sum_total[idx] - self.sum_left[idx]
self.pos = new_pos
return 0
def node_value(self):
"""Compute the node value of samples[start:end] and save it into dest.
Parameters
----------
dest : double pointer
The memory address which we will save the node value into.
"""
dest = []
for k in range(self.n_outputs):
for j in range(self.n_classes[k]):
dest.append(self.sum_total[k*self.sum_stride+j])
return dest
class Entropy(ClassificationCriterion):
r"""Cross Entropy impurity criterion.
This handles cases where the target is a classification taking values
0, 1, ... K-2, K-1. If node m represents a region Rm with Nm observations,
then let
count_k = 1 / Nm \sum_{x_i in Rm} I(yi = k)
be the proportion of class k observations in node m.
The cross-entropy is then defined as
cross-entropy = -\sum_{k=0}^{K-1} count_k log(count_k)
"""
def node_impurity(self):
"""Evaluate the impurity of the current node, i.e. the impurity of
samples[start:end], using the cross-entropy criterion."""
entropy = 0.0
for k in range(self.n_outputs):
for c in range(self.n_classes[k]):
count_k = self.sum_total[k*self.sum_stride+c]
if count_k > 0.0:
count_k /= self.weighted_n_node_samples
entropy -= count_k * log(count_k)
return entropy / self.n_outputs
def children_impurity(self):
"""Evaluate the impurity in children nodes
i.e. the impurity of the left child (samples[start:pos]) and the
impurity the right child (samples[pos:end]).
Parameters
----------
impurity_left : double pointer
The memory address to save the impurity of the left node
impurity_right : double pointer
The memory address to save the impurity of the right node
"""
entropy_left = 0.0
entropy_right = 0.0
for k in range(self.n_outputs):
for c in range(self.n_classes[k]):
count_k = self.sum_left[k*self.sum_stride+c]
if count_k > 0.0:
count_k /= self.weighted_n_left
entropy_left -= count_k * log(count_k)
count_k = self.sum_right[k*self.sum_stride+c]
if count_k > 0.0:
count_k /= self.weighted_n_right
entropy_right -= count_k * log(count_k)
impurity_left = entropy_left / self.n_outputs
impurity_right = entropy_right / self.n_outputs
return (impurity_left, impurity_right)
class Gini(ClassificationCriterion):
r"""Gini Index impurity criterion.
This handles cases where the target is a classification taking values
0, 1, ... K-2, K-1. If node m represents a region Rm with Nm observations,
then let
count_k = 1/ Nm \sum_{x_i in Rm} I(yi = k)
be the proportion of class k observations in node m.
The Gini Index is then defined as:
index = \sum_{k=0}^{K-1} count_k (1 - count_k)
= 1 - \sum_{k=0}^{K-1} count_k ** 2
"""
def node_impurity(self):
"""Evaluate the impurity of the current node, i.e. the impurity of
samples[start:end] using the Gini criterion."""
gini = 0.0
for k in range(self.n_outputs):
sq_count = 0.0
for c in range(self.n_classes[k]):
count_k = self.sum_total[k*self.sum_stride+c]
sq_count += count_k * count_k
gini += 1.0 - sq_count / (self.weighted_n_node_samples *
self.weighted_n_node_samples)
return gini / self.n_outputs
def children_impurity(self):
"""Evaluate the impurity in children nodes
i.e. the impurity of the left child (samples[start:pos]) and the
impurity the right child (samples[pos:end]) using the Gini index.
Parameters
----------
impurity_left : double pointer
The memory address to save the impurity of the left node to
impurity_right : double pointer
The memory address to save the impurity of the right node to
"""
gini_left = 0.0
gini_right = 0.0
for k in range(self.n_outputs):
sq_count_left = 0.0
sq_count_right = 0.0
for c in range(self.n_classes[k]):
count_k = self.sum_left[k*self.sum_stride+c]
sq_count_left += count_k * count_k
count_k = self.sum_right[k*self.sum_stride+c]
sq_count_right += count_k * count_k
gini_left += 1.0 - sq_count_left / (self.weighted_n_left *
self.weighted_n_left)
gini_right += 1.0 - sq_count_right / (self.weighted_n_right *
self.weighted_n_right)
impurity_left = gini_left / self.n_outputs
impurity_right = gini_right / self.n_outputs
return (impurity_left, impurity_right)
Splitter.py
import numpy as np
import math
RAND_R_MAX = 0x7FFFFFFF
DEFAULT_SEED = 1
# rand_r replacement using a 32bit XorShift generator
# See http://www.jstatsoft.org/v08/i14/paper for details
def our_rand_r(seed):
"""Generate a pseudo-random np.uint32 from a np.uint32 seed"""
# seed shouldn't ever be 0.
if (seed == 0):
seed = DEFAULT_SEED
seed ^= seed << 13
seed ^= seed >> 17
seed ^= seed << 5
# Note: we must be careful with the final line cast to np.uint32 so that
# the function behaves consistently across platforms.
#
# The following cast might yield different results on different platforms:
# wrong_cast = <UINT32_t> RAND_R_MAX + 1
#
# We can use:
# good_cast = <UINT32_t>(RAND_R_MAX + 1)
# or:
# cdef np.uint32_t another_good_cast = <UINT32_t>RAND_R_MAX + 1
return seed % (RAND_R_MAX + 1)
def rand_int(low, high, random_state):
"""Generate a random integer in [low; end)."""
return low + our_rand_r(random_state) % (high - low)
def rand_uniform(low, high, random_state):
"""Generate a random double in [low; high)."""
return ((high - low) * our_rand_r(random_state) /
RAND_R_MAX) + low
def log(x):
return math.log(x) / math.log(2.0)
INFINITY = np.inf
# Mitigate precision differences between 32 bit and 64 bit
FEATURE_THRESHOLD = 1e-7
# Constant to switch between algorithm non zero value extract algorithm
# in SparseSplitter
EXTRACT_NNZ_SWITCH = 0.1
class SplitRecord:
# Data to track sample split
feature = 0 # Which feature to split on.
pos = 0 # Split samples array at the given position,
# i.e. count of samples below threshold for feature.
# pos is >= end if the node is a leaf.
threshold = 1.0 # Threshold to split at.
improvement = 1.0 # Impurity improvement given parent node.
impurity_left = 1.0 # Impurity of the left split.
impurity_right = 1.0 # Impurity of the right split.
def copy(self):
tmp = SplitRecord()
tmp.feature = self.feature
tmp.pos = self.pos
tmp.threshold = self.threshold
tmp.improvement = self.improvement
tmp.impurity_left = self.impurity_left
tmp.impurity_right = self.impurity_right
return tmp
def _init_split(self, start_pos):
self.impurity_left = INFINITY
self.impurity_right = INFINITY
self.pos = start_pos
self.feature = 0
self.threshold = 0.
self.improvement = -INFINITY
class Splitter:
"""Abstract splitter class.
Splitters are called by tree builders to find the best splits on both
sparse and dense data, one split at a time.
"""
def __init__(self, criterion, max_features,
min_samples_leaf, min_weight_leaf,
random_state):
"""
Parameters
----------
criterion : Criterion
The criterion to measure the quality of a split.
max_features : SIZE_t
The maximal number of randomly selected features which can be
considered for a split.
min_samples_leaf : SIZE_t
The minimal number of samples each leaf can have, where splits
which would result in having less samples in a leaf are not
considered.
min_weight_leaf : double
The minimal weight each leaf can have, where the weight is the sum
of the weights of each sample in it.
random_state : object
The user inputted random state to be used for pseudo-randomness
"""
self.criterion = criterion
self.samples = None
self.n_samples = 0
self.features = None
self.n_features = 0
self.feature_values = None
self.sample_weight = None
self.max_features = max_features
self.min_samples_leaf = min_samples_leaf
self.min_weight_leaf = min_weight_leaf
self.random_state = random_state
def init(self,
X,
y,
sample_weight,
X_idx_sorted=None):
"""Initialize the splitter.
Take in the input data X, the target Y, and optional sample weights.
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
Parameters
----------
X : object
This contains the inputs. Usually it is a 2d numpy array.
y : ndarray, dtype=DOUBLE_t
This is the vector of targets, or true labels, for the samples
sample_weight : DOUBLE_t*
The weights of the samples, where higher weighted samples are fit
closer than lower weight samples. If not provided, all samples
are assumed to have uniform weight.
X_idx_sorted : ndarray, default=None
The indexes of the sorted training input samples
"""
self.rand_r_state = self.random_state.randint(0, RAND_R_MAX)
n_samples = X.shape[0]
# Create a new array which will be used to store nonzero
# samples from the feature of interest
self.samples = []
for idx in range(n_samples):
self.samples.append(0)
weighted_n_samples = 0.0
j = 0
for i in range(n_samples):
# Only work with positively weighted samples
if sample_weight == None or sample_weight[i] != 0.0:
self.samples[j] = i
j += 1
if sample_weight != None:
weighted_n_samples += sample_weight[i]
else:
weighted_n_samples += 1.0
# Number of samples is number of positively weighted samples
self.n_samples = j
self.weighted_n_samples = weighted_n_samples
n_features = X.shape[1]
self.features = []
for i in range(n_features):
self.features.append(0)
for i in range(n_features):
self.features[i] = i
self.n_features = n_features
self.feature_values = []
for idx in range(n_samples):
self.feature_values.append(0)
self.constant_features = []
for idx in range(n_features):
self.constant_features.append(0)
self.y = y
self.sample_weight = sample_weight
return 0
def node_reset(self, start, end):
"""Reset splitter on node samples[start:end].
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
Parameters
----------
start : SIZE_t
The index of the first sample to consider
end : SIZE_t
The index of the last sample to consider
weighted_n_node_samples : ndarray, dtype=double pointer
The total weight of those samples
"""
self.start = start
self.end = end
self.criterion.init(self.y,
self.sample_weight,
self.weighted_n_samples,
self.samples,
start,
end)
weighted_n_node_samples = self.criterion.weighted_n_node_samples
return (0, weighted_n_node_samples)
def node_value(self):
"""Copy the value of node samples[start:end] into dest."""
return self.criterion.node_value()
def node_impurity(self):
"""Return the impurity of the current node."""
return self.criterion.node_impurity()
class BaseDenseSplitter(Splitter):
def __init__(self, criterion, max_features,
min_samples_leaf, min_weight_leaf,
random_state):
super().__init__(criterion, max_features,
min_samples_leaf, min_weight_leaf,
random_state)
self.X_idx_sorted_ptr = None
self.X_idx_sorted_stride = 0
self.sample_mask = None
def init(self,
X,
y,
sample_weight,
X_idx_sorted=None):
"""Initialize the splitter
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
# Call parent init
Splitter.init(self, X, y, sample_weight)
self.X = X
return 0
class BestSplitter(BaseDenseSplitter):
"""Splitter for finding the best split."""
def node_split(self, impurity,
n_constant_features):
"""Find the best split on node samples[start:end]
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
# Find the best split
samples = self.samples
start = self.start
end = self.end
features = self.features
constant_features = self.constant_features
n_features = self.n_features
Xf = self.feature_values
max_features = self.max_features
min_samples_leaf = self.min_samples_leaf
min_weight_leaf = self.min_weight_leaf
random_state = self.rand_r_state
current_proxy_improvement = -INFINITY
best_proxy_improvement = -INFINITY
f_i = n_features
n_visited_features = 0
# Number of features discovered to be constant during the split search
n_found_constants = 0
# Number of features known to be constant and drawn without replacement
n_drawn_constants = 0
n_known_constants = n_constant_features
# n_total_constants = n_known_constants + n_found_constants
n_total_constants = n_known_constants
best = SplitRecord()
current = SplitRecord()
_init_split(best, end)
# Sample up to max_features without replacement using a
# Fisher-Yates-based algorithm (using the local variables `f_i` and
# `f_j` to compute a permutation of the `features` array).
#
# Skip the CPU intensive evaluation of the impurity criterion for
# features that were already detected as constant (hence not suitable
# for good splitting) by ancestor nodes and save the information on
# newly discovered constant features to spare computation on descendant
# nodes.
while (f_i > n_total_constants and # Stop early if remaining features
# are constant
(n_visited_features < max_features or
# At least one drawn features must be non constant
n_visited_features <= n_found_constants + n_drawn_constants)):
n_visited_features += 1
# Loop invariant: elements of features in
# - [:n_drawn_constant[ holds drawn and known constant features;
# - [n_drawn_constant:n_known_constant[ holds known constant
# features that haven't been drawn yet;
# - [n_known_constant:n_total_constant[ holds newly found constant
# features;
# - [n_total_constant:f_i[ holds features that haven't been drawn
# yet and aren't constant apriori.
# - [f_i:n_features[ holds features that have been drawn
# and aren't constant.
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
if f_j < n_known_constants:
# f_j in the interval [n_drawn_constants, n_known_constants[
features[n_drawn_constants], features[f_j] = features[f_j], features[n_drawn_constants]
n_drawn_constants += 1
else:
# f_j in the interval [n_known_constants, f_i - n_found_constants[
f_j += n_found_constants
# f_j in the interval [n_total_constants, f_i[
current.feature = features[f_j]
# Sort samples along that feature; by
# copying the values into an array and
# sorting the array in a manner which utilizes the cache more
# effectively.
for i in range(start, end):
Xf[i] = self.X[samples[i], current.feature]
tmp_list = [(Xf[i], samples[i]) for i in range(start, end)]
tmp_list.sort(key=lambda item: item[0])
for i in range(start, end):
Xf[i] = tmp_list[i-start][0]
samples[i] = tmp_list[i-start][1]
# TODO: original sort algorithm
# sort(Xf, samples, end - start, start)
if Xf[end - 1] <= Xf[start] + FEATURE_THRESHOLD:
features[f_j], features[n_total_constants] = features[n_total_constants], features[f_j]
n_found_constants += 1
n_total_constants += 1
else:
f_i -= 1
features[f_i], features[f_j] = features[f_j], features[f_i]
# Evaluate all splits
self.criterion.reset()
p = start
while p < end:
while (p + 1 < end and
Xf[p + 1] <= Xf[p] + FEATURE_THRESHOLD):
p += 1
# (p + 1 >= end) or (X[samples[p + 1], current.feature] >
# X[samples[p], current.feature])
p += 1
# (p >= end) or (X[samples[p], current.feature] >
# X[samples[p - 1], current.feature])
if p < end:
current.pos = p
# Reject if min_samples_leaf is not guaranteed
if (((current.pos - start) < min_samples_leaf) or
((end - current.pos) < min_samples_leaf)):
continue
self.criterion.update(current.pos)
# Reject if min_weight_leaf is not satisfied
if ((self.criterion.weighted_n_left < min_weight_leaf) or
(self.criterion.weighted_n_right < min_weight_leaf)):
continue
current_proxy_improvement = self.criterion.proxy_impurity_improvement()
if current_proxy_improvement > best_proxy_improvement:
best_proxy_improvement = current_proxy_improvement
# sum of halves is used to avoid infinite value
current.threshold = Xf[p - 1] / 2.0 + Xf[p] / 2.0
if ((current.threshold == Xf[p]) or
(current.threshold == INFINITY) or
(current.threshold == -INFINITY)):
current.threshold = Xf[p - 1]
best = current.copy() # copy
# Reorganize into samples[start:best.pos] + samples[best.pos:end]
if best.pos < end:
partition_end = end
p = start
while p < partition_end:
if self.X[samples[p], best.feature] <= best.threshold:
p += 1
else:
partition_end -= 1
samples[p], samples[partition_end] = samples[partition_end], samples[p]
self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
(best.impurity_left, best.impurity_right) = self.criterion.children_impurity()
# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
# and child nodes
for i in range(n_known_constants):
features[i] = constant_features[i]
# Copy newly found constant features
for i in range(n_found_constants):
constant_features[n_known_constants+i] = features[n_known_constants+i]
# Return values
split = best
n_constant_features = n_total_constants
return (0, split, n_constant_features)
# Sort n-element arrays pointed to by Xf and samples, simultaneously,
# by the values in Xf. Algorithm: Introsort (Musser, SP&E, 1997).
def sort(Xf, samples, n, start):
if n == 0:
return
maxd = 2 * int(log(n))
introsort(Xf, samples, n, maxd, start)
def swap(Xf, samples, i, j):
# Helper for sort
Xf[i], Xf[j] = Xf[j], Xf[i]
samples[i], samples[j] = samples[j], samples[i]
def median3(Xf, n, idx):
# Median of three pivot selection, after Bentley and McIlroy (1993).
# Engineering a sort function. SP&E. Requires 8/3 comparisons on average.
a = Xf[idx]
b = Xf[n // 2 + idx]
c = Xf[n - 1 + idx]
if a < b:
if b < c:
return b
elif a < c:
return c
else:
return a
elif b < c:
if a < c:
return a
else:
return c
else:
return b
# Introsort with median of 3 pivot selection and 3-way partition function
# (robust to repeated elements, e.g. lots of zero features).
def introsort(Xf, samples, n, maxd, idx):
while n > 1:
if maxd <= 0: # max depth limit exceeded ("gone quadratic")
heapsort(Xf, samples, n, idx)
return
maxd -= 1
pivot = median3(Xf, n, idx)
# Three-way partition.
i = l = idx
r = n + idx
while i < r:
if Xf[i] < pivot:
swap(Xf, samples, i, l)
i += 1
l += 1
elif Xf[i] > pivot:
r -= 1
swap(Xf, samples, i, r)
else:
i += 1
introsort(Xf, samples, l-idx, maxd, idx)
n -= (r-idx)
idx = r
def sift_down(Xf, samples, start, end):
# Restore heap order in Xf[start:end] by moving the max element to start.
root = start
while True:
child = root * 2 + 1
# find max of root, left child, right child
maxind = root
if child < end and Xf[maxind] < Xf[child]:
maxind = child
if child + 1 < end and Xf[maxind] < Xf[child + 1]:
maxind = child + 1
if maxind == root:
break
else:
swap(Xf, samples, root, maxind)
root = maxind
def heapsort(Xf, samples, n, idx):
# heapify
start = (n - 2) // 2 + idx
end = n + idx
while True:
sift_down(Xf, samples, start, end)
if start == 0:
break
start -= 1
# sort by shrinking the heap, putting the max element immediately after it
end = n - 1 + idx
while end > idx:
swap(Xf, samples, idx, end)
sift_down(Xf, samples, idx, end)
end = end - 1
class RandomSplitter(BaseDenseSplitter):
"""Splitter for finding the best random split."""
def node_split(self, impurity,
n_constant_features):
"""Find the best random split on node samples[start:end]
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
# Draw random splits and pick the best
samples = self.samples
start = self.start
end = self.end
features = self.features
constant_features = self.constant_features
n_features = self.n_features
Xf = self.feature_values
max_features = self.max_features
min_samples_leaf = self.min_samples_leaf
min_weight_leaf = self.min_weight_leaf
random_state = self.rand_r_state
best = SplitRecord()
current = SplitRecord()
current_proxy_improvement = - INFINITY
best_proxy_improvement = - INFINITY
f_i = n_features
# Number of features discovered to be constant during the split search
n_found_constants = 0
# Number of features known to be constant and drawn without replacement
n_drawn_constants = 0
n_known_constants = n_constant_features
# n_total_constants = n_known_constants + n_found_constants
n_total_constants = n_known_constants
n_visited_features = 0
_init_split(best, end)
# Sample up to max_features without replacement using a
# Fisher-Yates-based algorithm (using the local variables `f_i` and
# `f_j` to compute a permutation of the `features` array).
#
# Skip the CPU intensive evaluation of the impurity criterion for
# features that were already detected as constant (hence not suitable
# for good splitting) by ancestor nodes and save the information on
# newly discovered constant features to spare computation on descendant
# nodes.
while (f_i > n_total_constants and # Stop early if remaining features
# are constant
(n_visited_features < max_features or
# At least one drawn features must be non constant
n_visited_features <= n_found_constants + n_drawn_constants)):
n_visited_features += 1
# Loop invariant: elements of features in
# - [:n_drawn_constant[ holds drawn and known constant features;
# - [n_drawn_constant:n_known_constant[ holds known constant
# features that haven't been drawn yet;
# - [n_known_constant:n_total_constant[ holds newly found constant
# features;
# - [n_total_constant:f_i[ holds features that haven't been drawn
# yet and aren't constant apriori.
# - [f_i:n_features[ holds features that have been drawn
# and aren't constant.
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
if f_j < n_known_constants:
# f_j in the interval [n_drawn_constants, n_known_constants[
features[n_drawn_constants], features[f_j] = features[f_j], features[n_drawn_constants]
n_drawn_constants += 1
else:
# f_j in the interval [n_known_constants, f_i - n_found_constants[
f_j += n_found_constants
# f_j in the interval [n_total_constants, f_i[
current.feature = features[f_j]
# Find min, max
min_feature_value = self.X[samples[start], current.feature]
max_feature_value = min_feature_value
Xf[start] = min_feature_value
for p in range(start + 1, end):
current_feature_value = self.X[samples[p], current.feature]
Xf[p] = current_feature_value
if current_feature_value < min_feature_value:
min_feature_value = current_feature_value
elif current_feature_value > max_feature_value:
max_feature_value = current_feature_value
if max_feature_value <= min_feature_value + FEATURE_THRESHOLD:
features[f_j], features[n_total_constants] = features[n_total_constants], current.feature
n_found_constants += 1
n_total_constants += 1
else:
f_i -= 1
features[f_i], features[f_j] = features[f_j], features[f_i]
# Draw a random threshold
current.threshold = rand_uniform(min_feature_value,
max_feature_value,
random_state)
if current.threshold == max_feature_value:
current.threshold = min_feature_value
# Partition
p, partition_end = start, end
while p < partition_end:
if Xf[p] <= current.threshold:
p += 1
else:
partition_end -= 1
Xf[p], Xf[partition_end] = Xf[partition_end], Xf[p]
samples[p], samples[partition_end] = samples[partition_end], samples[p]
current.pos = partition_end
# Reject if min_samples_leaf is not guaranteed
if (((current.pos - start) < min_samples_leaf) or
((end - current.pos) < min_samples_leaf)):
continue
# Evaluate split
self.criterion.reset()
self.criterion.update(current.pos)
# Reject if min_weight_leaf is not satisfied
if ((self.criterion.weighted_n_left < min_weight_leaf) or
(self.criterion.weighted_n_right < min_weight_leaf)):
continue
current_proxy_improvement = self.criterion.proxy_impurity_improvement()
if current_proxy_improvement > best_proxy_improvement:
best_proxy_improvement = current_proxy_improvement
best = current.copy() # copy
# Reorganize into samples[start:best.pos] + samples[best.pos:end]
if best.pos < end:
if current.feature != best.feature:
p, partition_end = start, end
while p < partition_end:
if self.X[samples[p], best.feature] <= best.threshold:
p += 1
else:
partition_end -= 1
samples[p], samples[partition_end] = samples[partition_end], samples[p]
self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
(best.impurity_left, best.impurity_right) = self.criterion.children_impurity()
# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
# and child nodes
for i in range(n_known_constants):
features[i] = constant_features[i]
# Copy newly found constant features
for i in range(n_found_constants):
constant_features[n_known_constants+i] + features[n_known_constants+i]
# Return values
split = best
n_constant_features = n_total_constants
return (0, split, n_constant_features)
注意此处保留了原始的排序算法,但实际上采用了python中自带的sort来完成排序。两者在对相等数值排序时先后顺序不一致,但至少在后面的测试中预测正确率是一样的。
Tree.py
import numpy as np
from .Utils import Node, Stack, PriorityHeap, PriorityHeapRecord, SIZE_MAX
# =============================================================================
# Types and constants
# =============================================================================
INFINITY = np.inf
EPSILON = np.finfo('double').eps
# Some handy constants (BestFirstTreeBuilder)
IS_FIRST = 1
IS_NOT_FIRST = 0
IS_LEFT = 1
IS_NOT_LEFT = 0
TREE_LEAF = -1
_TREE_LEAF = TREE_LEAF
_TREE_UNDEFINED = -2
INITIAL_STACK_SIZE = 10
# Depth first builder ---------------------------------------------------------
class DepthFirstTreeBuilder:
"""Build a decision tree in depth-first fashion."""
def __init__(self, splitter, min_samples_split,
min_samples_leaf, min_weight_leaf,
max_depth, min_impurity_decrease,
min_impurity_split):
self.splitter = splitter
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_leaf = min_weight_leaf
self.max_depth = max_depth
self.min_impurity_decrease = min_impurity_decrease
self.min_impurity_split = min_impurity_split
def build(self, tree, X, y,
sample_weight=None,
X_idx_sorted=None):
"""Build a decision tree from the training set (X, y)."""
if tree.max_depth <= 10:
init_capacity = (2 ** (tree.max_depth + 1)) - 1
else:
init_capacity = 2047
tree._resize(init_capacity)
# Parameters
splitter = self.splitter
max_depth = self.max_depth
min_samples_leaf = self.min_samples_leaf
min_weight_leaf = self.min_weight_leaf
min_samples_split = self.min_samples_split
min_impurity_decrease = self.min_impurity_decrease
min_impurity_split = self.min_impurity_split
# Recursive partition (without actual recursion)
splitter.init(X, y, sample_weight, X_idx_sorted)
n_node_samples = splitter.n_samples
impurity = INFINITY
first = 1
max_depth_seen = -1
rc = 0
stack = Stack(INITIAL_STACK_SIZE)
# push root node onto stack
rc = stack.push(0, n_node_samples, 0, _TREE_UNDEFINED, 0, INFINITY, 0)
if rc == -1:
# got return code -1 - out-of-memory
raise MemoryError()
while not stack.is_empty():
(_, stack_record) = stack.pop()
start = stack_record.start
end = stack_record.end
depth = stack_record.depth
parent = stack_record.parent
is_left = stack_record.is_left
impurity = stack_record.impurity
n_constant_features = stack_record.n_constant_features
n_node_samples = end - start
(_, weighted_n_node_samples) = splitter.node_reset(start, end)
is_leaf = (depth >= max_depth or
n_node_samples < min_samples_split or
n_node_samples < 2 * min_samples_leaf or
weighted_n_node_samples < 2 * min_weight_leaf)
if first:
impurity = splitter.node_impurity()
first = 0
is_leaf = (is_leaf or (impurity <= min_impurity_split))
if not is_leaf:
(_, split, n_constant_features) = splitter.node_split(impurity, n_constant_features)
# If EPSILON=0 in the below comparison, float precision
# issues stop splitting, producing trees that are
# dissimilar to v0.18
is_leaf = (is_leaf or split.pos >= end or
(split.improvement + EPSILON <
min_impurity_decrease))
node_id = tree._add_node(parent, is_left, is_leaf, split.feature,
split.threshold, impurity, n_node_samples,
weighted_n_node_samples)
if node_id == SIZE_MAX:
rc = -1
break
# Store value for all nodes, to facilitate tree/model
# inspection and interpretation
tree.value[node_id*tree.value_stride:(node_id+1)*tree.value_stride] = splitter.node_value()
if not is_leaf:
# Push right child on stack
rc = stack.push(split.pos, end, depth + 1, node_id, 0,
split.impurity_right, n_constant_features)
if rc == -1:
break
# Push left child on stack
rc = stack.push(start, split.pos, depth + 1, node_id, 1,
split.impurity_left, n_constant_features)
if rc == -1:
break
if depth > max_depth_seen:
max_depth_seen = depth
if rc >= 0:
rc = tree._resize(tree.node_count)
if rc >= 0:
tree.max_depth = max_depth_seen
if rc == -1:
raise MemoryError()
# Best first builder ----------------------------------------------------------
def _add_to_frontier(rec, frontier):
"""Adds record ``rec`` to the priority queue ``frontier``
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
return frontier.push(rec.node_id, rec.start, rec.end, rec.pos, rec.depth,
rec.is_leaf, rec.improvement, rec.impurity,
rec.impurity_left, rec.impurity_right)
class BestFirstTreeBuilder:
"""Build a decision tree in best-first fashion.
The best node to expand is given by the node at the frontier that has the
highest impurity improvement.
"""
def __init__(self, splitter, min_samples_split,
min_samples_leaf, min_weight_leaf,
max_depth, max_leaf_nodes,
min_impurity_decrease, min_impurity_split):
self.splitter = splitter
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_leaf = min_weight_leaf
self.max_depth = max_depth
self.max_leaf_nodes = max_leaf_nodes
self.min_impurity_decrease = min_impurity_decrease
self.min_impurity_split = min_impurity_split
def build(self, tree, X, y,
sample_weight=None,
X_idx_sorted=None):
"""Build a decision tree from the training set (X, y)."""
# Parameters
splitter = self.splitter
max_leaf_nodes = self.max_leaf_nodes
# Recursive partition (without actual recursion)
splitter.init(X, y, sample_weight, X_idx_sorted)
frontier = PriorityHeap(INITIAL_STACK_SIZE)
split_node_left = PriorityHeapRecord()
split_node_right = PriorityHeapRecord()
n_node_samples = splitter.n_samples
max_split_nodes = max_leaf_nodes - 1
max_depth_seen = -1
rc = 0
# Initial capacity
init_capacity = max_split_nodes + max_leaf_nodes
tree._resize(init_capacity)
# add root to frontier
rc = self._add_split_node(splitter, tree, 0, n_node_samples,
INFINITY, IS_FIRST, IS_LEFT, None, 0,
split_node_left)
if rc >= 0:
rc = _add_to_frontier(split_node_left, frontier)
if rc == -1:
raise MemoryError()
while not frontier.is_empty():
(_, record) = frontier.pop()
node = tree.nodes[record.node_id]
is_leaf = (record.is_leaf or max_split_nodes <= 0)
if is_leaf:
# Node is not expandable; set node as leaf
node.left_child = _TREE_LEAF
node.right_child = _TREE_LEAF
node.feature = _TREE_UNDEFINED
node.threshold = _TREE_UNDEFINED
else:
# Node is expandable
# Decrement number of split nodes available
max_split_nodes -= 1
# Compute left split node
rc = self._add_split_node(splitter, tree,
record.start, record.pos,
record.impurity_left,
IS_NOT_FIRST, IS_LEFT, node,
record.depth + 1,
split_node_left)
if rc == -1:
break
# tree.nodes may have changed
node = tree.nodes[record.node_id]
# Compute right split node
rc = self._add_split_node(splitter, tree, record.pos,
record.end,
record.impurity_right,
IS_NOT_FIRST, IS_NOT_LEFT, node,
record.depth + 1,
split_node_right)
if rc == -1:
break
# Add nodes to queue
rc = _add_to_frontier(split_node_left, frontier)
if rc == -1:
break
rc = _add_to_frontier(split_node_right, frontier)
if rc == -1:
break
if record.depth > max_depth_seen:
max_depth_seen = record.depth
if rc >= 0:
rc = tree._resize(tree.node_count)
if rc >= 0:
tree.max_depth = max_depth_seen
if rc == -1:
raise MemoryError()
def _add_split_node(self, splitter, tree,
start, end, impurity,
is_first, is_left, parent,
depth,
res):
"""Adds node w/ partition ``[start, end)`` to the frontier. """
n_constant_features = 0
min_impurity_decrease = self.min_impurity_decrease
min_impurity_split = self.min_impurity_split
weighted_n_node_samples = None
splitter.node_reset(start, end, weighted_n_node_samples)
if is_first:
impurity = splitter.node_impurity()
n_node_samples = end - start
is_leaf = (depth >= self.max_depth or
n_node_samples < self.min_samples_split or
n_node_samples < 2 * self.min_samples_leaf or
weighted_n_node_samples < 2 * self.min_weight_leaf or
impurity <= min_impurity_split)
if not is_leaf:
(_, split, n_constant_features) = splitter.node_split(impurity, n_constant_features)
# If EPSILON=0 in the below comparison, float precision issues stop
# splitting early, producing trees that are dissimilar to v0.18
is_leaf = (is_leaf or split.pos >= end or
split.improvement + EPSILON < min_impurity_decrease)
# TODO: XXX
node_id = tree._add_node(parent - tree.nodes
if parent != None
else _TREE_UNDEFINED,
is_left, is_leaf,
split.feature, split.threshold, impurity, n_node_samples,
weighted_n_node_samples)
if node_id == SIZE_MAX:
return -1
# compute values also for split nodes (might become leafs later).
tree.value[node_id*tree.value_stride:(node_id+1)*tree.value_stride] = splitter.node_value()
res.node_id = node_id
res.start = start
res.end = end
res.depth = depth
res.impurity = impurity
if not is_leaf:
# is split node
res.pos = split.pos
res.is_leaf = 0
res.improvement = split.improvement
res.impurity_left = split.impurity_left
res.impurity_right = split.impurity_right
else:
# is leaf => 0 improvement
res.pos = end
res.is_leaf = 1
res.improvement = 0.0
res.impurity_left = impurity
res.impurity_right = impurity
return 0
# =============================================================================
# Tree
# =============================================================================
class Tree:
"""Array-based representation of a binary decision tree.
The binary tree is represented as a number of parallel arrays. The i-th
element of each array holds information about the node `i`. Node 0 is the
tree's root. You can find a detailed description of all arrays in
`_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split
nodes, resp. In this case the values of nodes of the other type are
arbitrary!
Attributes
----------
node_count : int
The number of nodes (internal nodes + leaves) in the tree.
capacity : int
The current capacity (i.e., size) of the arrays, which is at least as
great as `node_count`.
max_depth : int
The depth of the tree, i.e. the maximum depth of its leaves.
children_left : array of int, shape [node_count]
children_left[i] holds the node id of the left child of node i.
For leaves, children_left[i] == TREE_LEAF. Otherwise,
children_left[i] > i. This child handles the case where
X[:, feature[i]] <= threshold[i].
children_right : array of int, shape [node_count]
children_right[i] holds the node id of the right child of node i.
For leaves, children_right[i] == TREE_LEAF. Otherwise,
children_right[i] > i. This child handles the case where
X[:, feature[i]] > threshold[i].
feature : array of int, shape [node_count]
feature[i] holds the feature to split on, for the internal node i.
threshold : array of double, shape [node_count]
threshold[i] holds the threshold for the internal node i.
value : array of double, shape [node_count, n_outputs, max_n_classes]
Contains the constant prediction value of each node.
impurity : array of double, shape [node_count]
impurity[i] holds the impurity (i.e., the value of the splitting
criterion) at node i.
n_node_samples : array of int, shape [node_count]
n_node_samples[i] holds the number of training samples reaching node i.
weighted_n_node_samples : array of int, shape [node_count]
weighted_n_node_samples[i] holds the weighted number of training samples
reaching node i.
"""
# Wrap for outside world.
# WARNING: these reference the current `nodes` and `value` buffers, which
# must not be freed by a subsequent memory allocation.
# (i.e. through `_resize` or `__setstate__`)
def __init__(self, n_features, n_classes,
n_outputs):
"""Constructor."""
# Input/Output layout
self.n_features = n_features
self.n_outputs = n_outputs
self.n_classes = []
for idx in range(n_outputs):
self.n_classes.append(n_classes[idx])
self.max_n_classes = np.max(n_classes)
self.value_stride = n_outputs * self.max_n_classes
# Inner structures
self.max_depth = 0
self.node_count = 0
self.capacity = 0
self.value = []
self.nodes = []
def _resize(self, capacity=SIZE_MAX):
"""Guts of _resize
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
if capacity == self.capacity and self.nodes != None:
return 0
if capacity == SIZE_MAX:
if self.capacity == 0:
capacity = 3 # default initial value
else:
capacity = 2 * self.capacity
for idx in range(capacity-self.capacity):
self.nodes.append(Node())
# value memory is initialised to 0 to enable classifier argmax
if capacity > self.capacity:
for idx in range(self.capacity*self.value_stride, capacity * self.value_stride):
self.value.append(0)
# if capacity smaller than node_count, adjust the counter
if capacity < self.node_count:
self.node_count = capacity
self.capacity = capacity
return 0
def _add_node(self, parent, is_left, is_leaf,
feature, threshold, impurity,
n_node_samples,
weighted_n_node_samples):
"""Add a node to the tree.
The new node registers itself as the child of its parent.
Returns (size_t)(-1) on error.
"""
node_id = self.node_count
if node_id >= self.capacity:
if self._resize() != 0:
return SIZE_MAX
node = self.nodes[node_id]
node.impurity = impurity
node.n_node_samples = n_node_samples
node.weighted_n_node_samples = weighted_n_node_samples
if parent != _TREE_UNDEFINED:
if is_left:
self.nodes[parent].left_child = node_id
else:
self.nodes[parent].right_child = node_id
if is_leaf:
node.left_child = _TREE_LEAF
node.right_child = _TREE_LEAF
node.feature = _TREE_UNDEFINED
node.threshold = _TREE_UNDEFINED
else:
# left_child and right_child will be set later
node.feature = feature
node.threshold = threshold
self.node_count += 1
return node_id
def predict(self, X):
"""Predict target for X."""
out = self._get_value_ndarray().take(self.apply(X), axis=0,
mode='clip')
if self.n_outputs == 1:
out = out.reshape(X.shape[0], self.max_n_classes)
return out
def apply(self, X):
"""Finds the terminal region (=leaf node) for each sample in X."""
return self._apply_dense(X)
def _apply_dense(self, X):
"""Finds the terminal region (=leaf node) for each sample in X."""
# Check input
if not isinstance(X, np.ndarray):
raise ValueError("X should be in np.ndarray format, got %s"
% type(X))
# Extract input
X_ndarray = X
n_samples = X.shape[0]
# Initialize output
out = np.zeros((n_samples,), dtype=np.intp)
out_ptr = 0
for i in range(n_samples):
node = self.nodes[0]
# While node not a leaf
while node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
if X_ndarray[i, node.feature] <= node.threshold:
out_ptr = node.left_child
node = self.nodes[node.left_child]
else:
out_ptr = node.right_child
node = self.nodes[node.right_child]
out[i] = out_ptr # node offset
return out
def compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
nodes = self.nodes
node = 0
end_node = len(self.nodes) - 1
normalizer = 0.
importances = np.zeros((self.n_features,))
while node != end_node:
if nodes[node].left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = nodes[nodes[node].left_child]
right = nodes[nodes[node].right_child]
importances[nodes[node].feature] += (
nodes[node].weighted_n_node_samples * nodes[node].impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances
def _get_value_ndarray(self):
"""Wraps value as a 3-d NumPy array.
The array keeps a reference to this Tree, which manages the underlying
memory.
"""
n_value = len(self.value)
tmp = np.zeros((self.node_count, self.n_outputs, self.max_n_classes))
if (n_value < self.node_count*self.n_outputs*self.max_n_classes):
for idx in range(n_value):
i = idx//(self.n_outputs*self.max_n_classes)
j = (idx - i*(self.n_outputs*self.max_n_classes))//self.max_n_classes
k = idx - i*(self.n_outputs*self.max_n_classes) - j*self.max_n_classes
tmp[i][j][k] = self.value[idx]
else:
for i in range(self.node_count):
for j in range(self.n_outputs):
for k in range(self.max_n_classes):
tmp[i][j][k] = self.value[(i*self.n_outputs+j)*self.max_n_classes+k]
return tmp
def _get_node_ndarray(self):
"""Wraps nodes as a NumPy struct array.
The array keeps a reference to this Tree, which manages the underlying
memory. Individual fields are publicly accessible as properties of the
Tree.
"""
return np.array(self.nodes)
def compute_partial_dependence(self, X,
target_features,
out):
"""Partial dependence of the response on the ``target_feature`` set.
For each sample in ``X`` a tree traversal is performed.
Each traversal starts from the root with weight 1.0.
At each non-leaf node that splits on a target feature, either
the left child or the right child is visited based on the feature
value of the current sample, and the weight is not modified.
At each non-leaf node that splits on a complementary feature,
both children are visited and the weight is multiplied by the fraction
of training samples which went to each child.
At each leaf, the value of the node is multiplied by the current
weight (weights sum to 1 for all visited terminal nodes).
Parameters
----------
X : view on 2d ndarray, shape (n_samples, n_target_features)
The grid points on which the partial dependence should be
evaluated.
target_features : view on 1d ndarray, shape (n_target_features)
The set of target features for which the partial dependence
should be evaluated.
out : view on 1d ndarray, shape (n_samples)
The value of the partial dependence function on each grid
point.
"""
weight_stack = np.zeros(self.node_count,
dtype=np.float64)
node_idx_stack = np.zeros(self.node_count,
dtype=np.intp)
for sample_idx in range(X.shape[0]):
# init stacks for current sample
stack_size = 1
node_idx_stack[0] = 0 # root node
weight_stack[0] = 1 # all the samples are in the root node
total_weight = 0
while stack_size > 0:
# pop the stack
stack_size -= 1
current_node_idx = node_idx_stack[stack_size]
current_node = self.nodes[current_node_idx]
if current_node.left_child == _TREE_LEAF:
# leaf node
out[sample_idx] += (weight_stack[stack_size] *
self.value[current_node_idx])
total_weight += weight_stack[stack_size]
else: # non-leaf node
# determine if the split feature is a target feature
is_target_feature = False
for feature_idx in range(target_features.shape[0]):
if target_features[feature_idx] == current_node.feature:
is_target_feature = True
break
if is_target_feature:
# In this case, we push left or right child on stack
if X[sample_idx, feature_idx] <= current_node.threshold:
node_idx_stack[stack_size] = current_node.left_child
else:
node_idx_stack[stack_size] = current_node.right_child
stack_size += 1
else:
# In this case, we push both children onto the stack,
# and give a weight proportional to the number of
# samples going through each branch.
# push left child
node_idx_stack[stack_size] = current_node.left_child
left_sample_frac = (
self.nodes[current_node.left_child].weighted_n_node_samples /
current_node.weighted_n_node_samples)
current_weight = weight_stack[stack_size]
weight_stack[stack_size] = current_weight * left_sample_frac
stack_size += 1
# push right child
node_idx_stack[stack_size] = current_node.right_child
weight_stack[stack_size] = (
current_weight * (1 - left_sample_frac))
stack_size += 1
# Sanity check. Should never happen.
if not (0.999 < total_weight < 1.001):
raise ValueError("Total weight should be 1.0 but was %.9f" %
total_weight)
# =============================================================================
# Build Pruned Tree
# =============================================================================
class _CCPPruneController:
"""Base class used by build_pruned_tree_ccp and ccp_pruning_path
to control pruning.
"""
def stop_pruning(self, effective_alpha):
"""Return 1 to stop pruning and 0 to continue pruning"""
return 0
class _AlphaPruner(_CCPPruneController):
"""Use alpha to control when to stop pruning."""
def __init__(self, ccp_alpha):
self.ccp_alpha = ccp_alpha
self.capacity = 0
def stop_pruning(self, effective_alpha):
# The subtree on the previous iteration has the greatest ccp_alpha
# less than or equal to self.ccp_alpha
return self.ccp_alpha < effective_alpha
def after_pruning(self, in_subtree):
"""Updates the number of leaves in subtree"""
for i in range(in_subtree.shape[0]):
if in_subtree[i]:
self.capacity += 1
class _PathFinder(_CCPPruneController):
"""Record metrics used to return the cost complexity path."""
def __init__(self, node_count):
self.ccp_alphas = np.zeros(shape=(node_count), dtype=np.float64)
self.impurities = np.zeros(shape=(node_count), dtype=np.float64)
self.count = 0
def save_metrics(self, effective_alpha,
subtree_impurities):
self.ccp_alphas[self.count] = effective_alpha
self.impurities[self.count] = subtree_impurities
self.count += 1
def _cost_complexity_prune(leaves_in_subtree, # OUT
orig_tree, controller):
"""Perform cost complexity pruning.
This function takes an already grown tree, `orig_tree` and outputs a
boolean mask `leaves_in_subtree` to are the leaves in the pruned tree. The
controller signals when the pruning should stop and is passed the
metrics of the subtrees during the pruning process.
Parameters
----------
leaves_in_subtree : unsigned char[:]
Output for leaves of subtree
orig_tree : Tree
Original tree
ccp_controller : _CCPPruneController
Cost complexity controller
"""
n_nodes = orig_tree.node_count
# prior probability using weighted samples
weighted_n_node_samples = orig_tree.weighted_n_node_samples
total_sum_weights = weighted_n_node_samples[0]
impurity = orig_tree.impurity
# weighted impurity of each node
r_node = np.empty(shape=n_nodes, dtype=np.float64)
child_l = orig_tree.children_left
child_r = orig_tree.children_right
parent = np.zeros(shape=n_nodes, dtype=np.intp)
# Only uses the start and parent variables
stack = Stack(INITIAL_STACK_SIZE)
stack_record = None
rc = 0
n_leaves = np.zeros(shape=n_nodes, dtype=np.intp)
r_branch = np.zeros(shape=n_nodes, dtype=np.float64)
# candidate nodes that can be pruned
candidate_nodes = np.zeros(shape=n_nodes,
dtype=np.uint8)
# nodes in subtree
in_subtree = np.ones(shape=n_nodes, dtype=np.uint8)
max_float64 = np.finfo(np.float64).max
# find parent node ids and leaves
for i in range(r_node.shape[0]):
r_node[i] = (
weighted_n_node_samples[i] * impurity[i] / total_sum_weights)
# Push root node, using StackRecord.start as node id
rc = stack.push(0, 0, 0, -1, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
while not stack.is_empty():
stack.pop(stack_record)
node_idx = stack_record.start
parent[node_idx] = stack_record.parent
if child_l[node_idx] == _TREE_LEAF:
# ... and child_r[node_idx] == _TREE_LEAF:
leaves_in_subtree[node_idx] = 1
else:
rc = stack.push(child_l[node_idx], 0, 0, node_idx, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
rc = stack.push(child_r[node_idx], 0, 0, node_idx, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
# computes number of leaves in all branches and the overall impurity of
# the branch. The overall impurity is the sum of r_node in its leaves.
for leaf_idx in range(leaves_in_subtree.shape[0]):
if not leaves_in_subtree[leaf_idx]:
continue
r_branch[leaf_idx] = r_node[leaf_idx]
# bubble up values to ancestor nodes
current_r = r_node[leaf_idx]
while leaf_idx != 0:
parent_idx = parent[leaf_idx]
r_branch[parent_idx] += current_r
n_leaves[parent_idx] += 1
leaf_idx = parent_idx
for i in range(leaves_in_subtree.shape[0]):
candidate_nodes[i] = not leaves_in_subtree[i]
# save metrics before pruning
controller.save_metrics(0.0, r_branch[0])
# while root node is not a leaf
while candidate_nodes[0]:
# computes ccp_alpha for subtrees and finds the minimal alpha
effective_alpha = max_float64
for i in range(n_nodes):
if not candidate_nodes[i]:
continue
subtree_alpha = (r_node[i] - r_branch[i]) / (n_leaves[i] - 1)
if subtree_alpha < effective_alpha:
effective_alpha = subtree_alpha
pruned_branch_node_idx = i
if controller.stop_pruning(effective_alpha):
break
# stack uses only the start variable
rc = stack.push(pruned_branch_node_idx, 0, 0, 0, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
# descendants of branch are not in subtree
while not stack.is_empty():
stack.pop(stack_record)
node_idx = stack_record.start
if not in_subtree[node_idx]:
continue # branch has already been marked for pruning
candidate_nodes[node_idx] = 0
leaves_in_subtree[node_idx] = 0
in_subtree[node_idx] = 0
if child_l[node_idx] != _TREE_LEAF:
# ... and child_r[node_idx] != _TREE_LEAF:
rc = stack.push(child_l[node_idx], 0, 0, 0, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
rc = stack.push(child_r[node_idx], 0, 0, 0, 0, 0, 0)
if rc == -1:
raise MemoryError("pruning tree")
leaves_in_subtree[pruned_branch_node_idx] = 1
in_subtree[pruned_branch_node_idx] = 1
# updates number of leaves
n_pruned_leaves = n_leaves[pruned_branch_node_idx] - 1
n_leaves[pruned_branch_node_idx] = 0
# computes the increase in r_branch to bubble up
r_diff = r_node[pruned_branch_node_idx] - r_branch[pruned_branch_node_idx]
r_branch[pruned_branch_node_idx] = r_node[pruned_branch_node_idx]
# bubble up values to ancestors
node_idx = parent[pruned_branch_node_idx]
while node_idx != -1:
n_leaves[node_idx] -= n_pruned_leaves
r_branch[node_idx] += r_diff
node_idx = parent[node_idx]
controller.save_metrics(effective_alpha, r_branch[0])
controller.after_pruning(in_subtree)
def _build_pruned_tree_ccp(
tree, # OUT
orig_tree,
ccp_alpha):
"""Build a pruned tree from the original tree using cost complexity
pruning.
The values and nodes from the original tree are copied into the pruned
tree.
Parameters
----------
tree : Tree
Location to place the pruned tree
orig_tree : Tree
Original tree
ccp_alpha : positive double
Complexity parameter. The subtree with the largest cost complexity
that is smaller than ``ccp_alpha`` will be chosen. By default,
no pruning is performed.
"""
n_nodes = orig_tree.node_count
leaves_in_subtree = np.zeros(
shape=n_nodes, dtype=np.uint8)
pruning_controller = _AlphaPruner(ccp_alpha=ccp_alpha)
_cost_complexity_prune(leaves_in_subtree, orig_tree, pruning_controller)
_build_pruned_tree(tree, orig_tree, leaves_in_subtree,
pruning_controller.capacity)
def ccp_pruning_path(orig_tree):
"""Computes the cost complexity pruning path.
Parameters
----------
tree : Tree
Original tree.
Returns
-------
path_info : dict
Information about pruning path with attributes:
ccp_alphas : ndarray
Effective alphas of subtree during pruning.
impurities : ndarray
Sum of the impurities of the subtree leaves for the
corresponding alpha value in ``ccp_alphas``.
"""
leaves_in_subtree = np.zeros(
shape=orig_tree.node_count, dtype=np.uint8)
path_finder = _PathFinder(orig_tree.node_count)
_cost_complexity_prune(leaves_in_subtree, orig_tree, path_finder)
total_items = path_finder.count
ccp_alphas = np.empty(shape=total_items,
dtype=np.float64)
impurities = np.empty(shape=total_items,
dtype=np.float64)
count = 0
while count < total_items:
ccp_alphas[count] = path_finder.ccp_alphas[count]
impurities[count] = path_finder.impurities[count]
count += 1
return {'ccp_alphas': ccp_alphas, 'impurities': impurities}
def _build_pruned_tree(
tree, # OUT
orig_tree,
leaves_in_subtree,
capacity):
"""Build a pruned tree.
Build a pruned tree from the original tree by transforming the nodes in
``leaves_in_subtree`` into leaves.
Parameters
----------
tree : Tree
Location to place the pruned tree
orig_tree : Tree
Original tree
leaves_in_subtree : unsigned char memoryview, shape=(node_count, )
Boolean mask for leaves to include in subtree
capacity : SIZE_t
Number of nodes to initially allocate in pruned tree
"""
tree._resize(capacity)
# value_stride for original tree and new tree are the same
value_stride = orig_tree.value_stride
max_depth_seen = -1
rc = 0
# Only uses the start, depth, parent, and is_left variables
stack = Stack(INITIAL_STACK_SIZE)
stack_record = None
# push root node onto stack
rc = stack.push(0, 0, 0, _TREE_UNDEFINED, 0, 0.0, 0)
if rc == -1:
raise MemoryError("pruning tree")
while not stack.is_empty():
stack.pop(stack_record)
orig_node_id = stack_record.start
depth = stack_record.depth
parent = stack_record.parent
is_left = stack_record.is_left
is_leaf = leaves_in_subtree[orig_node_id]
node = orig_tree.nodes[orig_node_id]
new_node_id = tree._add_node(
parent, is_left, is_leaf, node.feature, node.threshold,
node.impurity, node.n_node_samples,
node.weighted_n_node_samples)
if new_node_id == SIZE_MAX:
rc = -1
break
# copy value from original tree to new tree
for idx in range(value_stride):
tree.value[value_stride*new_node_id+idx] = orig_tree.value[value_stride*orig_node_id+idx]
if not is_leaf:
# Push right child on stack
rc = stack.push(
node.right_child, 0, depth + 1, new_node_id, 0, 0.0, 0)
if rc == -1:
break
# push left child on stack
rc = stack.push(
node.left_child, 0, depth + 1, new_node_id, 1, 0.0, 0)
if rc == -1:
break
if depth > max_depth_seen:
max_depth_seen = depth
if rc >= 0:
tree.max_depth = max_depth_seen
if rc == -1:
raise MemoryError("pruning tree")
DecisionTree.py
import numbers
import warnings
from math import ceil
import copy
import numpy as np
from .Criterion import Criterion, Gini, Entropy
from .Splitter import Splitter, BestSplitter, RandomSplitter
from .Tree import DepthFirstTreeBuilder
from .Tree import BestFirstTreeBuilder
from .Tree import Tree
from .Tree import _build_pruned_tree_ccp
from .Tree import ccp_pruning_path
__all__ = ["DecisionTreeClassifier",
"ExtraTreeClassifier"]
def clone(estimator, *, safe=True):
"""Constructs a new estimator with the same parameters.
Clone does a deep copy of the model in an estimator
without actually copying attached data. It yields a new estimator
with the same parameters that has not been fit on any data.
Parameters
----------
estimator : {list, tuple, set} of estimator objects or estimator object
The estimator or group of estimators to be cloned.
safe : bool, default=True
If safe is false, clone will fall back to a deep copy on objects
that are not estimators.
"""
estimator_type = type(estimator)
# XXX: not handling dictionaries
if estimator_type in (list, tuple, set, frozenset):
return estimator_type([clone(e, safe=safe) for e in estimator])
elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
if not safe:
return copy.deepcopy(estimator)
else:
if isinstance(estimator, type):
raise TypeError("Cannot clone object. " +
"You should provide an instance of " +
"scikit-learn estimator instead of a class.")
else:
raise TypeError("Cannot clone object '%s' (type %s): "
"it does not seem to be a scikit-learn "
"estimator as it does not implement a "
"'get_params' method."
% (repr(estimator), type(estimator)))
klass = estimator.__class__
new_object_params = estimator.get_params(deep=False)
for name, param in new_object_params.items():
new_object_params[name] = clone(param, safe=False)
new_object = klass(**new_object_params)
params_set = new_object.get_params(deep=False)
# quick sanity check of the parameters of the clone
for name in new_object_params:
param1 = new_object_params[name]
param2 = params_set[name]
if param1 is not param2:
raise RuntimeError('Cannot clone object %s, as the constructor '
'either does not set or modifies parameter %s' %
(estimator, name))
return new_object
def check_random_state(seed):
"""Turn seed into a np.random.RandomState instance
Parameters
----------
seed : None | int | instance of RandomState
If seed is None, return the RandomState singleton used by np.random.
If seed is an int, return a new RandomState instance seeded with seed.
If seed is already a RandomState instance, return it.
Otherwise raise ValueError.
"""
if seed is None or seed is np.random:
return np.random.mtrand._rand
if isinstance(seed, numbers.Integral):
return np.random.RandomState(seed)
if isinstance(seed, np.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)
def _check_sample_weight(sample_weight, X, dtype=None):
"""Validate sample weights.
Note that passing sample_weight=None will output an array of ones.
Therefore, in some cases, you may want to protect the call with:
if sample_weight is not None:
sample_weight = _check_sample_weight(...)
Parameters
----------
sample_weight : {ndarray, Number or None}, shape (n_samples,)
Input sample weights.
X : nd-array, list or sparse matrix
Input data.
dtype: dtype
dtype of the validated `sample_weight`.
If None, and the input `sample_weight` is an array, the dtype of the
input is preserved; otherwise an array with the default numpy dtype
is be allocated. If `dtype` is not one of `float32`, `float64`,
`None`, the output will be of dtype `float64`.
Returns
-------
sample_weight : ndarray, shape (n_samples,)
Validated sample weight. It is guaranteed to be "C" contiguous.
"""
n_samples = len(X)
if dtype is not None and dtype not in [np.float32, np.float64]:
dtype = np.float64
if sample_weight is None:
sample_weight = np.ones(n_samples, dtype=dtype)
elif isinstance(sample_weight, numbers.Number):
sample_weight = np.full(n_samples, sample_weight, dtype=dtype)
else:
if dtype is None:
dtype = [np.float64, np.float32]
if sample_weight.ndim != 1:
raise ValueError("Sample weights must be 1D array or scalar")
if sample_weight.shape != (n_samples,):
raise ValueError("sample_weight.shape == {}, expected {}!"
.format(sample_weight.shape, (n_samples,)))
return sample_weight
def compute_class_weight(class_weight, *, classes, y):
"""Estimate class weights for unbalanced datasets.
Parameters
----------
class_weight : dict, 'balanced' or None
If 'balanced', class weights will be given by
``n_samples / (n_classes * np.bincount(y))``.
If a dictionary is given, keys are classes and values
are corresponding class weights.
If None is given, the class weights will be uniform.
classes : ndarray
Array of the classes occurring in the data, as given by
``np.unique(y_org)`` with ``y_org`` the original class labels.
y : array-like, shape (n_samples,)
Array of original class labels per sample;
Returns
-------
class_weight_vect : ndarray, shape (n_classes,)
Array with class_weight_vect[i] the weight for i-th class
References
----------
The "balanced" heuristic is inspired by
Logistic Regression in Rare Events Data, King, Zen, 2001.
"""
# Import error caused by circular imports.
from ..preprocessing import LabelEncoder
if set(y) - set(classes):
raise ValueError("classes should include all valid labels that can "
"be in y")
if class_weight is None or len(class_weight) == 0:
# uniform class weights
weight = np.ones(classes.shape[0], dtype=np.float64, order='C')
elif class_weight == 'balanced':
# Find the weight of each class as present in y.
le = LabelEncoder()
y_ind = le.fit_transform(y)
if not all(np.in1d(classes, le.classes_)):
raise ValueError("classes should have valid labels that are in y")
recip_freq = len(y) / (len(le.classes_) *
np.bincount(y_ind).astype(np.float64))
weight = recip_freq[le.transform(classes)]
else:
# user-defined dictionary
weight = np.ones(classes.shape[0], dtype=np.float64, order='C')
if not isinstance(class_weight, dict):
raise ValueError("class_weight must be dict, 'balanced', or None,"
" got: %r" % class_weight)
for c in class_weight:
i = np.searchsorted(classes, c)
if i >= len(classes) or classes[i] != c:
raise ValueError("Class label {} not present.".format(c))
else:
weight[i] = class_weight[c]
return weight
def compute_sample_weight(class_weight, y, *, indices=None):
"""Estimate sample weights by class for unbalanced datasets.
Parameters
----------
class_weight : dict, list of dicts, "balanced", or None, optional
Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.
Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
``n_samples / (n_classes * np.bincount(y))``.
For multi-output, the weights of each column of y will be multiplied.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
Array of original class labels per sample.
indices : array-like, shape (n_subsample,), or None
Array of indices to be used in a subsample. Can be of length less than
n_samples in the case of a subsample, or equal to n_samples in the
case of a bootstrap subsample with repeated indices. If None, the
sample weight will be calculated over the full sample. Only "balanced"
is supported for class_weight if this is provided.
Returns
-------
sample_weight_vect : ndarray, shape (n_samples,)
Array with sample weights as applied to the original y
"""
y = np.atleast_1d(y)
if y.ndim == 1:
y = np.reshape(y, (-1, 1))
n_outputs = y.shape[1]
if isinstance(class_weight, str):
if class_weight not in ['balanced']:
raise ValueError('The only valid preset for class_weight is '
'"balanced". Given "%s".' % class_weight)
elif (indices is not None and
not isinstance(class_weight, str)):
raise ValueError('The only valid class_weight for subsampling is '
'"balanced". Given "%s".' % class_weight)
elif n_outputs > 1:
if (not hasattr(class_weight, "__iter__") or
isinstance(class_weight, dict)):
raise ValueError("For multi-output, class_weight should be a "
"list of dicts, or a valid string.")
if len(class_weight) != n_outputs:
raise ValueError("For multi-output, number of elements in "
"class_weight should match number of outputs.")
expanded_class_weight = []
for k in range(n_outputs):
y_full = y[:, k]
classes_full = np.unique(y_full)
classes_missing = None
if class_weight == 'balanced' or n_outputs == 1:
class_weight_k = class_weight
else:
class_weight_k = class_weight[k]
if indices is not None:
# Get class weights for the subsample, covering all classes in
# case some labels that were present in the original data are
# missing from the sample.
y_subsample = y[indices, k]
classes_subsample = np.unique(y_subsample)
weight_k = np.take(compute_class_weight(class_weight_k,
classes=classes_subsample,
y=y_subsample),
np.searchsorted(classes_subsample,
classes_full),
mode='clip')
classes_missing = set(classes_full) - set(classes_subsample)
else:
weight_k = compute_class_weight(class_weight_k,
classes=classes_full,
y=y_full)
weight_k = weight_k[np.searchsorted(classes_full, y_full)]
if classes_missing:
# Make missing classes' weight zero
weight_k[np.in1d(y_full, list(classes_missing))] = 0.
expanded_class_weight.append(weight_k)
expanded_class_weight = np.prod(expanded_class_weight,
axis=0,
dtype=np.float64)
return expanded_class_weight
class Bunch(dict):
"""Container object exposing keys as attributes
Bunch objects are sometimes used as an output for functions and methods.
They extend dictionaries by enabling values to be accessed by key,
`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
Examples
--------
>>> b = Bunch(a=1, b=2)
>>> b['b']
2
>>> b.b
2
>>> b.a = 3
>>> b['a']
3
>>> b.c = 6
>>> b['c']
6
"""
def __init__(self, **kwargs):
super().__init__(kwargs)
def __setattr__(self, key, value):
self[key] = value
def __dir__(self):
return self.keys()
def __getattr__(self, key):
try:
return self[key]
except KeyError:
raise AttributeError(key)
def __setstate__(self, state):
# Bunch pickles generated with scikit-learn 0.16.* have an non
# empty __dict__. This causes a surprising behaviour when
# loading these pickles scikit-learn 0.17: reading bunch.key
# uses __dict__ but assigning to bunch.key use __setattr__ and
# only changes bunch['key']. More details can be found at:
# https://github.com/scikit-learn/scikit-learn/issues/6196.
# Overriding __setstate__ to be a noop has the effect of
# ignoring the pickled __dict__
pass
# =============================================================================
# Types and constants
# =============================================================================
#DTYPE = Tree.DTYPE
#DOUBLE = Tree.DOUBLE
from numpy import float64 as DOUBLE
CRITERIA_CLF = {"gini": Gini, "entropy": Entropy}
DENSE_SPLITTERS = {"best": BestSplitter,
"random": RandomSplitter}
# =============================================================================
# Base decision tree
# =============================================================================
class BaseDecisionTree:
"""Base class for decision trees.
Warning: This class should not be used directly.
Use derived classes instead.
"""
def __init__(self, *,
criterion,
splitter,
max_depth,
min_samples_split,
min_samples_leaf,
min_weight_fraction_leaf,
max_features,
max_leaf_nodes,
random_state,
min_impurity_decrease,
min_impurity_split,
class_weight=None,
presort='deprecated',
ccp_alpha=0.0):
self.criterion = criterion
self.splitter = splitter
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes
self.random_state = random_state
self.min_impurity_decrease = min_impurity_decrease
self.min_impurity_split = min_impurity_split
self.class_weight = class_weight
self.presort = presort
self.ccp_alpha = ccp_alpha
def get_depth(self):
"""Return the depth of the decision tree.
The depth of a tree is the maximum distance between the root
and any leaf.
Returns
-------
self.tree_.max_depth : int
The maximum depth of the tree.
"""
return self.tree_.max_depth
def get_n_leaves(self):
"""Return the number of leaves of the decision tree.
Returns
-------
self.tree_.n_leaves : int
Number of leaves.
"""
return self.tree_.n_leaves
def fit(self, X, y, sample_weight=None, check_input=True,
X_idx_sorted=None):
random_state = check_random_state(self.random_state)
if self.ccp_alpha < 0.0:
raise ValueError("ccp_alpha must be greater than or equal to 0")
# Determine output settings
n_samples, self.n_features_ = X.shape
is_classification = True
y = np.atleast_1d(y)
expanded_class_weight = None
if y.ndim == 1:
# reshape is necessary to preserve the data contiguity against vs
# [:, np.newaxis] that does not.
y = np.reshape(y, (-1, 1))
self.n_outputs_ = y.shape[1]
if is_classification:
y = np.copy(y)
self.classes_ = []
self.n_classes_ = []
if self.class_weight is not None:
y_original = np.copy(y)
y_encoded = np.zeros(y.shape, dtype=np.int)
for k in range(self.n_outputs_):
classes_k, y_encoded[:, k] = np.unique(y[:, k],
return_inverse=True)
self.classes_.append(classes_k)
self.n_classes_.append(classes_k.shape[0])
y = y_encoded
if self.class_weight is not None:
expanded_class_weight = compute_sample_weight(
self.class_weight, y_original)
self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)
if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:
y = np.ascontiguousarray(y, dtype=DOUBLE)
# Check parameters
max_depth = (np.iinfo(np.int32).max if self.max_depth is None
else self.max_depth)
max_leaf_nodes = (-1 if self.max_leaf_nodes is None
else self.max_leaf_nodes)
if isinstance(self.min_samples_leaf, numbers.Integral):
if not 1 <= self.min_samples_leaf:
raise ValueError("min_samples_leaf must be at least 1 "
"or in (0, 0.5], got %s"
% self.min_samples_leaf)
min_samples_leaf = self.min_samples_leaf
else: # float
if not 0. < self.min_samples_leaf <= 0.5:
raise ValueError("min_samples_leaf must be at least 1 "
"or in (0, 0.5], got %s"
% self.min_samples_leaf)
min_samples_leaf = int(ceil(self.min_samples_leaf * n_samples))
if isinstance(self.min_samples_split, numbers.Integral):
if not 2 <= self.min_samples_split:
raise ValueError("min_samples_split must be an integer "
"greater than 1 or a float in (0.0, 1.0]; "
"got the integer %s"
% self.min_samples_split)
min_samples_split = self.min_samples_split
else: # float
if not 0. < self.min_samples_split <= 1.:
raise ValueError("min_samples_split must be an integer "
"greater than 1 or a float in (0.0, 1.0]; "
"got the float %s"
% self.min_samples_split)
min_samples_split = int(ceil(self.min_samples_split * n_samples))
min_samples_split = max(2, min_samples_split)
min_samples_split = max(min_samples_split, 2 * min_samples_leaf)
if isinstance(self.max_features, str):
if self.max_features == "auto":
if is_classification:
max_features = max(1, int(np.sqrt(self.n_features_)))
else:
max_features = self.n_features_
elif self.max_features == "sqrt":
max_features = max(1, int(np.sqrt(self.n_features_)))
elif self.max_features == "log2":
max_features = max(1, int(np.log2(self.n_features_)))
else:
raise ValueError("Invalid value for max_features. "
"Allowed string values are 'auto', "
"'sqrt' or 'log2'.")
elif self.max_features is None:
max_features = self.n_features_
elif isinstance(self.max_features, numbers.Integral):
max_features = self.max_features
else: # float
if self.max_features > 0.0:
max_features = max(1,
int(self.max_features * self.n_features_))
else:
max_features = 0
self.max_features_ = max_features
if len(y) != n_samples:
raise ValueError("Number of labels=%d does not match "
"number of samples=%d" % (len(y), n_samples))
if not 0 <= self.min_weight_fraction_leaf <= 0.5:
raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
if max_depth <= 0:
raise ValueError("max_depth must be greater than zero. ")
if not (0 < max_features <= self.n_features_):
raise ValueError("max_features must be in (0, n_features]")
if not isinstance(max_leaf_nodes, numbers.Integral):
raise ValueError("max_leaf_nodes must be integral number but was "
"%r" % max_leaf_nodes)
if -1 < max_leaf_nodes < 2:
raise ValueError(("max_leaf_nodes {0} must be either None "
"or larger than 1").format(max_leaf_nodes))
if sample_weight is not None:
sample_weight = _check_sample_weight(sample_weight, X, DOUBLE)
if expanded_class_weight is not None:
if sample_weight is not None:
sample_weight = sample_weight * expanded_class_weight
else:
sample_weight = expanded_class_weight
# Set min_weight_leaf from min_weight_fraction_leaf
if sample_weight is None:
min_weight_leaf = (self.min_weight_fraction_leaf *
n_samples)
else:
min_weight_leaf = (self.min_weight_fraction_leaf *
np.sum(sample_weight))
min_impurity_split = self.min_impurity_split
if min_impurity_split is not None:
warnings.warn("The min_impurity_split parameter is deprecated. "
"Its default value has changed from 1e-7 to 0 in "
"version 0.23, and it will be removed in 0.25. "
"Use the min_impurity_decrease parameter instead.",
FutureWarning)
if min_impurity_split < 0.:
raise ValueError("min_impurity_split must be greater than "
"or equal to 0")
else:
min_impurity_split = 0
if self.min_impurity_decrease < 0.:
raise ValueError("min_impurity_decrease must be greater than "
"or equal to 0")
if self.presort != 'deprecated':
warnings.warn("The parameter 'presort' is deprecated and has no "
"effect. It will be removed in v0.24. You can "
"suppress this warning by not passing any value "
"to the 'presort' parameter.",
FutureWarning)
# Build tree
criterion = self.criterion
if not isinstance(criterion, Criterion):
if is_classification:
criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
self.n_classes_)
SPLITTERS = DENSE_SPLITTERS
splitter = self.splitter
if not isinstance(self.splitter, Splitter):
splitter = SPLITTERS[self.splitter](criterion,
self.max_features_,
min_samples_leaf,
min_weight_leaf,
random_state)
self.tree_ = Tree(self.n_features_,
self.n_classes_, self.n_outputs_)
# Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise
if max_leaf_nodes < 0:
builder = DepthFirstTreeBuilder(splitter, min_samples_split,
min_samples_leaf,
min_weight_leaf,
max_depth,
self.min_impurity_decrease,
min_impurity_split)
else:
builder = BestFirstTreeBuilder(splitter, min_samples_split,
min_samples_leaf,
min_weight_leaf,
max_depth,
max_leaf_nodes,
self.min_impurity_decrease,
min_impurity_split)
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
if self.n_outputs_ == 1:
self.n_classes_ = self.n_classes_[0]
self.classes_ = self.classes_[0]
self._prune_tree()
return self
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
def predict(self, X, check_input=True):
"""Predict class or regression value for X.
For a classification model, the predicted class for each sample in X is
returned. For a regression model, the predicted value based on X is
returned.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
check_input : bool, default=True
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns
-------
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
The predicted classes, or the predict values.
"""
X = self._validate_X_predict(X, check_input)
proba = self.tree_.predict(X)
n_samples = X.shape[0]
# Classification
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
else:
class_type = self.classes_[0].dtype
predictions = np.zeros((n_samples, self.n_outputs_),
dtype=class_type)
for k in range(self.n_outputs_):
predictions[:, k] = self.classes_[k].take(
np.argmax(proba[:, k], axis=1),
axis=0)
return predictions
def apply(self, X, check_input=True):
"""Return the index of the leaf that each sample is predicted as.
.. versionadded:: 0.17
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
check_input : bool, default=True
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns
-------
X_leaves : array-like of shape (n_samples,)
For each datapoint x in X, return the index of the leaf x
ends up in. Leaves are numbered within
``[0; self.tree_.node_count)``, possibly with gaps in the
numbering.
"""
X = self._validate_X_predict(X, check_input)
return self.tree_.apply(X)
def decision_path(self, X, check_input=True):
"""Return the decision path in the tree.
.. versionadded:: 0.18
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
check_input : bool, default=True
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns
-------
indicator : sparse matrix of shape (n_samples, n_nodes)
Return a node indicator CSR matrix where non zero elements
indicates that the samples goes through the nodes.
"""
X = self._validate_X_predict(X, check_input)
return self.tree_.decision_path(X)
def _prune_tree(self):
"""Prune tree using Minimal Cost-Complexity Pruning."""
if self.ccp_alpha < 0.0:
raise ValueError("ccp_alpha must be greater than or equal to 0")
if self.ccp_alpha == 0.0:
return
# build pruned tree
n_classes = np.atleast_1d(self.n_classes_)
pruned_tree = Tree(self.n_features_, n_classes, self.n_outputs_)
_build_pruned_tree_ccp(pruned_tree, self.tree_, self.ccp_alpha)
self.tree_ = pruned_tree
def cost_complexity_pruning_path(self, X, y, sample_weight=None):
"""Compute the pruning path during Minimal Cost-Complexity Pruning.
See :ref:`minimal_cost_complexity_pruning` for details on the pruning
process.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csc_matrix``.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. Splits are also
ignored if they would result in any single class carrying a
negative weight in either child node.
Returns
-------
ccp_path : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
ccp_alphas : ndarray
Effective alphas of subtree during pruning.
impurities : ndarray
Sum of the impurities of the subtree leaves for the
corresponding alpha value in ``ccp_alphas``.
"""
est = clone(self).set_params(ccp_alpha=0.0)
est.fit(X, y, sample_weight=sample_weight)
return Bunch(**ccp_pruning_path(est.tree_))
@property
def feature_importances_(self):
"""Return the feature importances.
The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.
It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for
high cardinality features (many unique values). See
:func:`sklearn.inspection.permutation_importance` as an alternative.
Returns
-------
feature_importances_ : ndarray of shape (n_features,)
Normalized total reduction of criteria by feature
(Gini importance).
"""
return self.tree_.compute_feature_importances()
# =============================================================================
# Public estimators
# =============================================================================
class DecisionTreeClassifier(BaseDecisionTree):
"""A decision tree classifier.
Read more in the :ref:`User Guide <tree>`.
Parameters
----------
criterion : {"gini", "entropy"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "entropy" for the information gain.
splitter : {"best", "random"}, default="best"
The strategy used to choose the split at each node. Supported
strategies are "best" to choose the best split and "random" to choose
the best random split.
max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
.. versionchanged:: 0.18
Added float values for fractions.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
.. versionchanged:: 0.18
Added float values for fractions.
min_weight_fraction_leaf : float, default=0.0
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
max_features : int, float or {"auto", "sqrt", "log2"}, default=None
The number of features to consider when looking for the best split:
- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`int(max_features * n_features)` features are considered at each
split.
- If "auto", then `max_features=sqrt(n_features)`.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always
randomly permuted at each split, even if ``splitter`` is set to
``"best"``. When ``max_features < n_features``, the algorithm will
select ``max_features`` at random at each split before finding the best
split among them. But the best found split may vary across different
runs, even if ``max_features=n_features``. That is the case, if the
improvement of the criterion is identical for several splits and one
split has to be selected at random. To obtain a deterministic behaviour
during fitting, ``random_state`` has to be fixed to an integer.
See :term:`Glossary <random_state>` for details.
max_leaf_nodes : int, default=None
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
min_impurity_decrease : float, default=0.0
A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.
The weighted impurity decrease equation is the following::
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.
``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.
.. versionadded:: 0.19
min_impurity_split : float, default=0
Threshold for early stopping in tree growth. A node will split
if its impurity is above the threshold, otherwise it is a leaf.
.. deprecated:: 0.19
``min_impurity_split`` has been deprecated in favor of
``min_impurity_decrease`` in 0.19. The default value of
``min_impurity_split`` has changed from 1e-7 to 0 in 0.23 and it
will be removed in 0.25. Use ``min_impurity_decrease`` instead.
class_weight : dict, list of dict or "balanced", default=None
Weights associated with classes in the form ``{class_label: weight}``.
If None, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.
Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
presort : deprecated, default='deprecated'
This parameter is deprecated and will be removed in v0.24.
.. deprecated:: 0.22
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details.
.. versionadded:: 0.22
Attributes
----------
classes_ : ndarray of shape (n_classes,) or list of ndarray
The classes labels (single output problem),
or a list of arrays of class labels (multi-output problem).
feature_importances_ : ndarray of shape (n_features,)
The impurity-based feature importances.
The higher, the more important the feature.
The importance of a feature is computed as the (normalized)
total reduction of the criterion brought by that feature. It is also
known as the Gini importance [4]_.
Warning: impurity-based feature importances can be misleading for
high cardinality features (many unique values). See
:func:`sklearn.inspection.permutation_importance` as an alternative.
max_features_ : int
The inferred value of max_features.
n_classes_ : int or list of int
The number of classes (for single output problems),
or a list containing the number of classes for each
output (for multi-output problems).
n_features_ : int
The number of features when ``fit`` is performed.
n_outputs_ : int
The number of outputs when ``fit`` is performed.
tree_ : Tree
The underlying Tree object. Please refer to
``help(sklearn.tree._tree.Tree)`` for attributes of Tree object and
:ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py`
for basic usage of these attributes.
See Also
--------
DecisionTreeRegressor : A decision tree regressor.
Notes
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets. To
reduce memory consumption, the complexity and size of the trees should be
controlled by setting those parameter values.
References
----------
.. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
.. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
and Regression Trees", Wadsworth, Belmont, CA, 1984.
.. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
Learning", Springer, 2009.
.. [4] L. Breiman, and A. Cutler, "Random Forests",
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Examples
--------
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
... # doctest: +SKIP
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
"""
def __init__(self, *,
criterion="gini",
splitter="best",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features=None,
random_state=None,
max_leaf_nodes=None,
min_impurity_decrease=0.,
min_impurity_split=None,
class_weight=None,
presort='deprecated',
ccp_alpha=0.0):
super().__init__(
criterion=criterion,
splitter=splitter,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_features=max_features,
max_leaf_nodes=max_leaf_nodes,
class_weight=class_weight,
random_state=random_state,
min_impurity_decrease=min_impurity_decrease,
min_impurity_split=min_impurity_split,
presort=presort,
ccp_alpha=ccp_alpha)
def fit(self, X, y, sample_weight=None, check_input=True,
X_idx_sorted=None):
"""Build a decision tree classifier from the training set (X, y).
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csc_matrix``.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. Splits are also
ignored if they would result in any single class carrying a
negative weight in either child node.
check_input : bool, default=True
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
X_idx_sorted : array-like of shape (n_samples, n_features), \
default=None
The indexes of the sorted training input samples. If many tree
are grown on the same dataset, this allows the ordering to be
cached between trees. If None, the data will be sorted here.
Don't use this parameter unless you know what to do.
Returns
-------
self : DecisionTreeClassifier
Fitted estimator.
"""
super().fit(
X, y,
sample_weight=sample_weight,
check_input=check_input,
X_idx_sorted=X_idx_sorted)
return self
def predict_proba(self, X, check_input=True):
"""Predict class probabilities of the input samples X.
The predicted class probability is the fraction of samples of the same
class in a leaf.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
check_input : bool, default=True
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns
-------
proba : ndarray of shape (n_samples, n_classes) or list of n_outputs \
such arrays if n_outputs > 1
The class probabilities of the input samples. The order of the
classes corresponds to that in the attribute :term:`classes_`.
"""
X = self._validate_X_predict(X, check_input)
proba = self.tree_.predict(X)
if self.n_outputs_ == 1:
proba = proba[:, :self.n_classes_]
normalizer = proba.sum(axis=1)[:, np.newaxis]
normalizer[normalizer == 0.0] = 1.0
proba /= normalizer
return proba
else:
all_proba = []
for k in range(self.n_outputs_):
proba_k = proba[:, k, :self.n_classes_[k]]
normalizer = proba_k.sum(axis=1)[:, np.newaxis]
normalizer[normalizer == 0.0] = 1.0
proba_k /= normalizer
all_proba.append(proba_k)
return all_proba
def predict_log_proba(self, X):
"""Predict class log-probabilities of the input samples X.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
proba : ndarray of shape (n_samples, n_classes) or list of n_outputs \
such arrays if n_outputs > 1
The class log-probabilities of the input samples. The order of the
classes corresponds to that in the attribute :term:`classes_`.
"""
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return np.log(proba)
else:
for k in range(self.n_outputs_):
proba[k] = np.log(proba[k])
return proba
class ExtraTreeClassifier(DecisionTreeClassifier):
"""An extremely randomized tree classifier.
Extra-trees differ from classic decision trees in the way they are built.
When looking for the best split to separate the samples of a node into two
groups, random splits are drawn for each of the `max_features` randomly
selected features and the best split among those is chosen. When
`max_features` is set 1, this amounts to building a totally random
decision tree.
Warning: Extra-trees should only be used within ensemble methods.
Read more in the :ref:`User Guide <tree>`.
Parameters
----------
criterion : {"gini", "entropy"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "entropy" for the information gain.
splitter : {"random", "best"}, default="random"
The strategy used to choose the split at each node. Supported
strategies are "best" to choose the best split and "random" to choose
the best random split.
max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
.. versionchanged:: 0.18
Added float values for fractions.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
.. versionchanged:: 0.18
Added float values for fractions.
min_weight_fraction_leaf : float, default=0.0
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
max_features : int, float, {"auto", "sqrt", "log2"} or None, default="auto"
The number of features to consider when looking for the best split:
- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`int(max_features * n_features)` features are considered at each
split.
- If "auto", then `max_features=sqrt(n_features)`.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
random_state : int, RandomState instance, default=None
Used to pick randomly the `max_features` used at each split.
See :term:`Glossary <random_state>` for details.
max_leaf_nodes : int, default=None
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
min_impurity_decrease : float, default=0.0
A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.
The weighted impurity decrease equation is the following::
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.
``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.
.. versionadded:: 0.19
min_impurity_split : float, (default=0)
Threshold for early stopping in tree growth. A node will split
if its impurity is above the threshold, otherwise it is a leaf.
.. deprecated:: 0.19
``min_impurity_split`` has been deprecated in favor of
``min_impurity_decrease`` in 0.19. The default value of
``min_impurity_split`` has changed from 1e-7 to 0 in 0.23 and it
will be removed in 0.25. Use ``min_impurity_decrease`` instead.
class_weight : dict, list of dict or "balanced", default=None
Weights associated with classes in the form ``{class_label: weight}``.
If None, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.
Note that for multioutput (including multilabel) weights should be
defined for each class of every column in its own dict. For example,
for four-class multilabel classification weights should be
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details.
.. versionadded:: 0.22
Attributes
----------
classes_ : ndarray of shape (n_classes,) or list of ndarray
The classes labels (single output problem),
or a list of arrays of class labels (multi-output problem).
max_features_ : int
The inferred value of max_features.
n_classes_ : int or list of int
The number of classes (for single output problems),
or a list containing the number of classes for each
output (for multi-output problems).
feature_importances_ : ndarray of shape (n_features,)
The impurity-based feature importances.
The higher, the more important the feature.
The importance of a feature is computed as the (normalized)
total reduction of the criterion brought by that feature. It is also
known as the Gini importance.
Warning: impurity-based feature importances can be misleading for
high cardinality features (many unique values). See
:func:`sklearn.inspection.permutation_importance` as an alternative.
n_features_ : int
The number of features when ``fit`` is performed.
n_outputs_ : int
The number of outputs when ``fit`` is performed.
tree_ : Tree
The underlying Tree object. Please refer to
``help(sklearn.tree._tree.Tree)`` for attributes of Tree object and
:ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py`
for basic usage of these attributes.
See Also
--------
ExtraTreeRegressor : An extremely randomized tree regressor.
sklearn.ensemble.ExtraTreesClassifier : An extra-trees classifier.
sklearn.ensemble.ExtraTreesRegressor : An extra-trees regressor.
Notes
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets. To
reduce memory consumption, the complexity and size of the trees should be
controlled by setting those parameter values.
References
----------
.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees",
Machine Learning, 63(1), 3-42, 2006.
Examples
--------
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import ExtraTreeClassifier
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=0)
>>> extra_tree = ExtraTreeClassifier(random_state=0)
>>> cls = BaggingClassifier(extra_tree, random_state=0).fit(
... X_train, y_train)
>>> cls.score(X_test, y_test)
0.8947...
"""
def __init__(self, *,
criterion="gini",
splitter="random",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features="auto",
random_state=None,
max_leaf_nodes=None,
min_impurity_decrease=0.,
min_impurity_split=None,
class_weight=None,
ccp_alpha=0.0):
super().__init__(
criterion=criterion,
splitter=splitter,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_features=max_features,
max_leaf_nodes=max_leaf_nodes,
class_weight=class_weight,
min_impurity_decrease=min_impurity_decrease,
min_impurity_split=min_impurity_split,
random_state=random_state,
ccp_alpha=ccp_alpha)
将以上代码文件放在同一个文件夹DT中,然后将下面的测试文件和文件夹DT放在一起,即可运行。
dtree_test.py
from DT.DecisionTree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" \
% (X_test.shape[0], (y_test != y_pred).sum()))
运行结果如下 :
需要注意的是,由于时间有限,并未对所有代码进行检查,仅就最简单情况下的运行过程进行了修改,确保运行成功且结果和scikit-learn中原始算法一致。酌情保留了分类相关的代码,但删掉了所有Regression相关的代码。也许在采用其他参数训练时会报错或者给出错误的结果,使用时需要注意检查并修改。