:mod:`lookout.style.format.rules` ================================= .. py:module:: lookout.style.format.rules .. autoapi-nested-parse:: Train and compile rules for multi-class classification using an sklearn base model. Module Contents --------------- .. data:: RuleAttribute `feature` is the feature taken for comparison `cmp` is the comparison type: True is "x > v", False is "x <= v" `threshold` is "v", the threshold value .. data:: RuleStats `cls` is the predicted class `conf` is the rule confidence \in [0, 1], "1" means super confident .. py:class:: Rule Bases::class:`typing.NamedTuple()` Decision rule which consists of a series of attribute comparisons, statistics and the flag which indicates whether the rule was created outside of the training (notably, in Rules.harmonize_quotes()). The statistics contain the predicted class index. .. method:: group_features(self, feature_extractor:FeatureExtractor) Generate rule splits grouped by feature type. Attribute indexes are from the original sequence before feature selection! :param feature_extractor: The FeatureExtractor used to create those rules. :return: generator .. data:: QuotedNodeTriple .. data:: QuotedNodeTripleMapping .. py:class:: Rules(rules:List[Rule], origin_config:Mapping[str, Any]) Store already trained rules for downstream prediction tasks. .. attribute:: CompiledNegatedRules Each ndarray contains the rule indices which are **false** given the corresponding feature, threshold value and the comparison type ("false" and "true"). .. attribute:: CompiledFeatureRules .. attribute:: CompiledRulesType .. attribute:: _log .. attribute:: classification_report Property for classification report with quality metrics. Return empty dict if unset. Can be set for a dataset with generate_classification_report() method. :return: Classification report. .. attribute:: rules Return the list of rules. .. attribute:: origin_config Return the configuration used for the model training. .. attribute:: avg_rule_len Compute the average length of the rules. .. method:: __str__(self) .. method:: __len__(self) .. method:: apply(self, X_csr:csr_matrix, return_winner_indices=False) Evaluate the rules against the given features. :param X_csr: input features. :param return_winner_indices: whether to return the winning rule index for each sample. :return: array of the same length as X with predictions or tuple of two arrays of the same length as X containing (predictions, winner rule indices). In case no rule was triggered for feature row, corresponding result equals to -1. .. method:: predict(self, X:csr_matrix, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode], feature_extractor:FeatureExtractor) Predict classes given the input features and metadata. :param X: Numpy 1-dimensional array of input features. :param vnodes_y: Sequence of the labeled `VirtualNode`-s corresponding to labeled samples. :param vnodes: Sequence of all the `VirtualNode`-s corresponding to the input. :param feature_extractor: FeatureExtractor used to extract features. :return: The predictions, the winning rules and the new Rules. .. staticmethod:: fill_missing_predictions(y:numpy.ndarray, y_fallback:numpy.ndarray) Fill missing predictions with original labels. :param y: Array with predictions. Negative values are considered as missing predictions. :param y_fallback: Original labels. Vector should have the same length as `y`. :return: Filled array with labels. The array have the same size as original. .. method:: filter_by_confidence(self, confidence_threshold:float) Filter rules according to a confidence threshold. :param confidence_threshold: Minimum confidence value. :return: Filtered rules. .. method:: filter_by_support(self, support_threshold:int) Filter rules according to a support threshold. :param support_threshold: Minimum support value. :return: Filtered rules. .. method:: generate_classification_report(self, X:csr_matrix, y:numpy.ndarray, dataset_type:str, target_names:Sequence[str]) Calculate and store classification report with quality metrics for given dataset. :param X: Features matrix. :param y: target vector. :param dataset_type: Can be set to "test" or "train" only. Marks passing data as train or test. :param target_names: Classes names in y. .. staticmethod:: _get_composite(feature_extractor:FeatureExtractor, labels:Tuple[int, ...]) .. method:: _group_quote_predictions(self, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode]) .. method:: harmonize_quotes(self, y_pred:numpy.ndarray, vnodes_y:Sequence[VirtualNode], vnodes:Sequence[VirtualNode], winners:numpy.ndarray, feature_extractor:FeatureExtractor, grouped_quote_predictions:QuotedNodeTripleMapping) Post-process predictions to correct mis-matched quotes. To do so, we consider only the tuples (', STRING, ') or (", STRING, ") in the input. We then create fake rules as needed (because a rule going from the input to the corrected quote might not exist in the trained rules). :param y_pred: Predictions to correct. :param vnodes_y: Sequence of the predicted virtual nodes. :param vnodes: Sequence of virtual nodes representing the input. :param winners: Indices of the rules that were used to compute the predictions. :param feature_extractor: FeatureExtractor used to extract features. :param grouped_quote_predictions: Quotes predictions (handled differenlty from the rest). :return: Updated y, winners and new rules. .. classmethod:: _compile(cls, rules:Sequence[Rule]) .. classmethod:: _compute_triggered(cls, compiled_rules:CompiledRulesType, rules:Sequence[Rule], x:numpy.ndarray) .. data:: LabelScore .. py:class:: TrainableRules(*, base_model_name:str='sklearn.tree.DecisionTreeClassifier', prune_branches_algorithms=('reduced-error', 'top-down-greedy'), top_down_greedy_budget:Tuple[bool, Union[float, int]]=(False, 1.0), prune_attributes=True, confidence_threshold=0.8, attribute_similarity_threshold=0.98, prune_dataset_ratio=0.2, n_estimators=10, max_depth=None, max_features=None, min_samples_leaf=1, min_samples_split=2, random_state=42, origin_config=None) Bases::class:`sklearn.base.BaseEstimator`, :class:`sklearn.base.ClassifierMixin` Trainable rules model based on a decision tree or a random forest. .. attribute:: _log .. attribute:: base_model_name Return the name of the base model used for training. .. attribute:: fitted Return whether the model is fitted or not. .. attribute:: _check_fitted .. attribute:: rules Return the list of rules. .. method:: fit(self, X:csr_matrix, y:numpy.ndarray) Train the rules using the base tree model and the samples (X, y). If `base_model` is already fitted, the samples may be different from the ones that were used. :param X: input features. :param y: input labels - the same length as X. :return: self .. method:: prune_categorical_attributes(self, feature_extractor:FeatureExtractor) Remove "not in" categorical assertions which are overridden by strict equalities. :param feature_extractor: FeatureExtractor which created the train samples. :return: Nothing .. staticmethod:: _check_fitted(func) .. method:: predict(self, X:csr_matrix) Evaluate the rules against the given features. :param X: Input features. :return: Array of the same length as X with predictions. .. method:: full_score(self, X:csr_matrix, y:numpy.ndarray) Evaluate the trained rules and return the metrics. :param X: Input data. :param y: Output labels. :return: Mapping from labels to `ClassScore`-s. .. classmethod:: _tree_to_rules(cls, tree:DecisionTreeClassifier, offset:int=0, class_mapping:Optional[numpy.ndarray]=None) Convert an sklearn decision tree to a set of rules. Each rule is a branch in the tree. :param tree: input decision tree. :param offset: offset for the rules' identifiers - used when there are several trees. :param class_mapping: mapping for rules' classes - used when there are several trees. :return: list of extracted rules. .. classmethod:: _merge_rules(cls, rules:List[Rule]) .. classmethod:: _prune_reduced_error(cls, model:DecisionTreeClassifier, X:numpy.array, y:numpy.array, step_score_drop:float=0, max_score_drop:float=0) .. method:: _build_instances_index(self, base_model:Union[DecisionTreeClassifier, RandomForestClassifier], X:numpy.ndarray, leaf2rule:Sequence[Mapping[int, int]]) .. method:: _prune_branches_top_down_greedy(self, base_model:Union[DecisionTreeClassifier, RandomForestClassifier], rules:Sequence[Rule], X:numpy.ndarray, Y:numpy.ndarray, leaf2rule:Sequence[Mapping[int, int]], budget:Tuple[bool, Union[float, int]]) Prune branches using a greedy top down algorithm. :param base_model: Sklearn decision tree or random forest base model. :param rules: Rules extracted from the base model. :param X: Samples to use to evaluate the quality of subsets of branches. :param Y: Labels to use to evaluate the quality of subsets of branches. :param leaf2rule: Mapping from leaves in the base model to rules. :param budget: Tuple describing the budget: boolean to indicate if it's absolute (True) or not (False). If the first value is True (absolute budget), the second should be an integer describing the maximum number of rules to keep. If it is False (relative budget), it should be a float between 0 and 1 to specify the proportion of rules to keep. :return: Pruned list of rules. .. classmethod:: _prune_attributes(cls, rules:Iterable[Rule], X:csr_matrix, Y:numpy.ndarray, sim_threshold:float) Remove the attribute comparisons which do not influence the rule decision. We treat two attribute comparisons as similar if the samples on which they trigger and mistake are similar by Jaccard metric. :param rules: List of rules to simplify. :param X: Input features, used to exclude the irrelevant attributes. :param Y: Input labels. :param sim_threshold: how many attributes to prune. Must be between 0 and 1. The closer to 0, the fewer attributes are left. :return: New list of simplified rules. .. staticmethod:: _sanitize_params(params:Dict[str, Any]) Normalize the parameters from get_params() so that they are suitable for serialization. :param params: Dictionary obtained from get_params(). :return: Normalized dictionary. .. classmethod:: _get_param_names(cls)