lookout.style.format.rules

Train and compile rules for multi-class classification using an sklearn base model.

Module Contents

lookout.style.format.rules.RuleAttribute

feature is the feature taken for comparison cmp is the comparison type: True is “x > v”, False is “x <= v” threshold is “v”, the threshold value

lookout.style.format.rules.RuleStats

cls is the predicted class conf is the rule confidence in [0, 1], “1” means super confident

class lookout.style.format.rules.Rule

Bases:typing.NamedTuple()

Decision rule which consists of a series of attribute comparisons, statistics and the flag which indicates whether the rule was created outside of the training (notably, in Rules.harmonize_quotes()). The statistics contain the predicted class index.

group_features(self, feature_extractor:FeatureExtractor)

Generate rule splits grouped by feature type.

Attribute indexes are from the original sequence before feature selection!

Parameters:feature_extractor – The FeatureExtractor used to create those rules.
Returns:generator
lookout.style.format.rules.QuotedNodeTriple
lookout.style.format.rules.QuotedNodeTripleMapping
class lookout.style.format.rules.Rules(rules:List[Rule], origin_config:Mapping[str, Any])

Store already trained rules for downstream prediction tasks.

CompiledNegatedRules

Each ndarray contains the rule indices which are false given the corresponding feature, threshold value and the comparison type (“false” and “true”).

CompiledFeatureRules
CompiledRulesType
_log
classification_report

Property for classification report with quality metrics.

Return empty dict if unset. Can be set for a dataset with generate_classification_report() method. :return: Classification report.

rules

Return the list of rules.

origin_config

Return the configuration used for the model training.

avg_rule_len

Compute the average length of the rules.

__str__(self)
__len__(self)
apply(self, X_csr:csr_matrix, return_winner_indices=False)

Evaluate the rules against the given features.

Parameters:
  • X_csr – input features.
  • return_winner_indices – whether to return the winning rule index for each sample.
Returns:

array of the same length as X with predictions or tuple of two arrays of the same length as X containing (predictions, winner rule indices). In case no rule was triggered for feature row, corresponding result equals to -1.

predict(self, X:csr_matrix, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode], feature_extractor:FeatureExtractor)

Predict classes given the input features and metadata.

Parameters:
  • X – Numpy 1-dimensional array of input features.
  • vnodes_y – Sequence of the labeled VirtualNode-s corresponding to labeled samples.
  • vnodes – Sequence of all the VirtualNode-s corresponding to the input.
  • feature_extractor – FeatureExtractor used to extract features.
Returns:

The predictions, the winning rules and the new Rules.

static fill_missing_predictions(y:numpy.ndarray, y_fallback:numpy.ndarray)

Fill missing predictions with original labels.

Parameters:
  • y – Array with predictions. Negative values are considered as missing predictions.
  • y_fallback – Original labels. Vector should have the same length as y.
Returns:

Filled array with labels. The array have the same size as original.

filter_by_confidence(self, confidence_threshold:float)

Filter rules according to a confidence threshold.

Parameters:confidence_threshold – Minimum confidence value.
Returns:Filtered rules.
filter_by_support(self, support_threshold:int)

Filter rules according to a support threshold.

Parameters:support_threshold – Minimum support value.
Returns:Filtered rules.
generate_classification_report(self, X:csr_matrix, y:numpy.ndarray, dataset_type:str, target_names:Sequence[str])

Calculate and store classification report with quality metrics for given dataset.

Parameters:
  • X – Features matrix.
  • y – target vector.
  • dataset_type – Can be set to “test” or “train” only. Marks passing data as train or test.
  • target_names – Classes names in y.
static _get_composite(feature_extractor:FeatureExtractor, labels:Tuple[int, ...])
_group_quote_predictions(self, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode])
harmonize_quotes(self, y_pred:numpy.ndarray, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode], winners:numpy.ndarray, feature_extractor:FeatureExtractor, grouped_quote_predictions:QuotedNodeTripleMapping)

Post-process predictions to correct mis-matched quotes.

To do so, we consider only the tuples (‘, STRING, ‘) or (“, STRING, “) in the input. We then create fake rules as needed (because a rule going from the input to the corrected quote might not exist in the trained rules).

Parameters:
  • y_pred – Predictions to correct.
  • vnodes_y – Sequence of the predicted virtual nodes.
  • vnodes – Sequence of virtual nodes representing the input.
  • winners – Indices of the rules that were used to compute the predictions.
  • feature_extractor – FeatureExtractor used to extract features.
  • grouped_quote_predictions – Quotes predictions (handled differenlty from the rest).
Returns:

Updated y, winners and new rules.

classmethod _compile(cls, rules:Sequence[Rule])
classmethod _compute_triggered(cls, compiled_rules:CompiledRulesType, rules:Sequence[Rule], x:numpy.ndarray)
lookout.style.format.rules.LabelScore
class lookout.style.format.rules.TrainableRules(*, base_model_name:str='sklearn.tree.DecisionTreeClassifier', prune_branches_algorithms=('reduced-error', 'top-down-greedy'), top_down_greedy_budget:Tuple[bool, Union[float, int]]=(False, 1.0), prune_attributes=True, confidence_threshold=0.8, attribute_similarity_threshold=0.98, prune_dataset_ratio=0.2, n_estimators=10, max_depth=None, max_features=None, min_samples_leaf=1, min_samples_split=2, random_state=42, origin_config=None)

Bases:sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Trainable rules model based on a decision tree or a random forest.

_log
base_model_name

Return the name of the base model used for training.

fitted

Return whether the model is fitted or not.

_check_fitted
rules

Return the list of rules.

fit(self, X:csr_matrix, y:numpy.ndarray)

Train the rules using the base tree model and the samples (X, y).

If base_model is already fitted, the samples may be different from the ones that were used.

Parameters:
  • X – input features.
  • y – input labels - the same length as X.
Returns:

self

prune_categorical_attributes(self, feature_extractor:FeatureExtractor)

Remove “not in” categorical assertions which are overridden by strict equalities.

Parameters:feature_extractor – FeatureExtractor which created the train samples.
Returns:Nothing
static _check_fitted(func)
predict(self, X:csr_matrix)

Evaluate the rules against the given features.

Parameters:X – Input features.
Returns:Array of the same length as X with predictions.
full_score(self, X:csr_matrix, y:numpy.ndarray)

Evaluate the trained rules and return the metrics.

Parameters:
  • X – Input data.
  • y – Output labels.
Returns:

Mapping from labels to ClassScore-s.

classmethod _tree_to_rules(cls, tree:DecisionTreeClassifier, offset:int=0, class_mapping:Optional[numpy.ndarray]=None)

Convert an sklearn decision tree to a set of rules.

Each rule is a branch in the tree.

Parameters:
  • tree – input decision tree.
  • offset – offset for the rules’ identifiers - used when there are several trees.
  • class_mapping – mapping for rules’ classes - used when there are several trees.
Returns:

list of extracted rules.

classmethod _merge_rules(cls, rules:List[Rule])
classmethod _prune_reduced_error(cls, model:DecisionTreeClassifier, X:numpy.array, y:numpy.array, step_score_drop:float=0, max_score_drop:float=0)
_build_instances_index(self, base_model:Union[DecisionTreeClassifier, RandomForestClassifier], X:numpy.ndarray, leaf2rule:Sequence[Mapping[int, int]])
_prune_branches_top_down_greedy(self, base_model:Union[DecisionTreeClassifier, RandomForestClassifier], rules:Sequence[Rule], X:numpy.ndarray, Y:numpy.ndarray, leaf2rule:Sequence[Mapping[int, int]], budget:Tuple[bool, Union[float, int]])

Prune branches using a greedy top down algorithm.

Parameters:
  • base_model – Sklearn decision tree or random forest base model.
  • rules – Rules extracted from the base model.
  • X – Samples to use to evaluate the quality of subsets of branches.
  • Y – Labels to use to evaluate the quality of subsets of branches.
  • leaf2rule – Mapping from leaves in the base model to rules.
  • budget – Tuple describing the budget: boolean to indicate if it’s absolute (True) or not (False). If the first value is True (absolute budget), the second should be an integer describing the maximum number of rules to keep. If it is False (relative budget), it should be a float between 0 and 1 to specify the proportion of rules to keep.
Returns:

Pruned list of rules.

classmethod _prune_attributes(cls, rules:Iterable[Rule], X:csr_matrix, Y:numpy.ndarray, sim_threshold:float)

Remove the attribute comparisons which do not influence the rule decision.

We treat two attribute comparisons as similar if the samples on which they trigger and mistake are similar by Jaccard metric.

Parameters:
  • rules – List of rules to simplify.
  • X – Input features, used to exclude the irrelevant attributes.
  • Y – Input labels.
  • sim_threshold – how many attributes to prune. Must be between 0 and 1. The closer to 0, the fewer attributes are left.
Returns:

New list of simplified rules.

static _sanitize_params(params:Dict[str, Any])

Normalize the parameters from get_params() so that they are suitable for serialization.

Parameters:params – Dictionary obtained from get_params().
Returns:Normalized dictionary.
classmethod _get_param_names(cls)