lookout.style.format.feature_extractor

Feature extraction module.

Module Contents

lookout.style.format.feature_extractor.IndexToFeature
lookout.style.format.feature_extractor.FEATURES_NUMPY_TYPE
lookout.style.format.feature_extractor.FEATURES_MIN
lookout.style.format.feature_extractor.FEATURES_MAX
class lookout.style.format.feature_extractor.FeatureExtractor(*, language:str, left_siblings_window:int, right_siblings_window:int, parents_depth:int, node_features:Sequence[str], left_features:Sequence[str], right_features:Sequence[str], parent_features:Sequence[str], no_labels_on_right:bool, select_features_number:Optional[int], debug_parsing:bool, return_sibling_indices:bool, selected_features:Optional[numpy.ndarray]=None, label_composites:Optional[List[Tuple[int, ...]]]=None, cutoff_label_support:int=0)

Extract features for downstream models.

_log
index_to_feature

Return the mapping from integer indices to the corresponding feature names.

feature_to_indices

Return the mapping from feature names to the corresponding integer indices.

features

Return the Feature-s used by this feature extractor.

feature_names

Return the names of the features.

A feature name uniquely identifies a feature. It reflects the feature structure: it is comprised of the feature group the feature belongs to, its sibling identifier, its feature name and its index if applicable.

Those names, as well as the feature layout, depend on the configuration used to launch the analyzer.

composite_class_representations

Return the class representations of composite classes.

Returns:Strings representing the composite classes.
composite_class_printables

Return the class printables of composite classes.

Returns:Strings that can be printed to represent the composite classes.
count_features(self, feature_group:Optional[FeatureGroup]=None, neighbour_index:Optional[int]=None)

Return the feature count of a given subset of features.

extract_features(self, files:Iterable[UnicodeFile], lines:Optional[List[List[int]]]=None)

Compute features and labels required by downstream models given a list of File-s.

Parameters:
  • files – the list of File-s (see service_data.proto) of the same language.
  • lines – the list of enabled line numbers per file. The lines which are not mentioned will not be extracted.
Returns:

tuple of numpy.ndarray (2 and 1 dimensional respectively): features and labels, the corresponding VirtualNode-s and the parents mapping or None in case no features were extracted.

select_features(self, X:csr_matrix, y:numpy.ndarray)

Select the most useful features based on sklearn’s univariate feature selection.

Parameters:
  • X – Scipy CSR 2-dimensional matrix of features to select.
  • y – Numpy 1-dimensional array of labels.
Returns:

Tuple of a CSR matrix with only the selected features (columns) kept and an array of the indices of the kept features for later reapplication.

label_to_str(self, label:int)

Convert a label to string.

_compute_feature_info(self)
_parse_vnodes(self, files:Iterable[UnicodeFile], lines:Optional[List[List[int]]]=None)
_convert_files_to_xy(self, parsed_files:List[Tuple[List[VirtualNode], Dict[int, bblfsh.Node], Set[int]]])
_create_neighbours(self, vnodes:Sequence[VirtualNode], vnodes_y:Sequence[VirtualNode], parents:Mapping[int, bblfsh.Node], return_sibling_indices:bool=False)
_classify_vnodes(self, nodes:Iterable[VirtualNode], path:str)

Fill “y” attribute in the VirtualNode-s extracted from _parse_file().

It is the index of the corresponding class to predict. We detect indentation changes so several whitespace nodes are merged together.

Parameters:
  • nodes – sequence of VirtualNodes.
  • path – path to file.
Returns:

new list of VirtualNodes, the size is different from the original.

_merge_classes_to_composite_labels(self, vnodes:Iterable[VirtualNode], path:str, index_labels:bool=False)

Pack successive predictable nodes into single “composite” labels.

Parameters:
  • vnodes – Iterable of VirtualNode-s to process.
  • path – Path to the file from which we are currently extracting features.
  • index_labels – Whether to index labels to define output classes or not.
Yield:

The sequence of VirtualNode-s which is identical to the input but the successive Y-nodes are merged together.

_add_noops(self, vnodes:Sequence[VirtualNode], path:str, index_labels:bool=False)

Add CLS_NOOP nodes in between tokens without labeled nodes to allow for insertions.

Parameters:
  • vnodes – The sequence of VirtualNode-s to augment with noop nodes.
  • path – path to file.
  • index_labels – Whether to index labels to define output classes or not.
Returns:

The augmented VirtualNode-s sequence.

static _find_parent(vnode_index:int, vnodes:Sequence[VirtualNode], parents:Mapping[int, bblfsh.Node], closest_left_node_id:int)

Compute vnode parent as the LCA of the closest left and right babelfish nodes.

Parameters:
  • vnode_index – the index of the current node
  • vnodes – the sequence of VirtualNode-s being transformed into features
  • parents – the id of bblfsh node to parent bblfsh node mapping
  • closest_left_node_id – bblfsh node of the closest parent already gone through
Returns:

The bblfsh.Node of the found parent or None if no parent was found.

_keep_sibling(self, sibling:VirtualNode, vnode:VirtualNode, include_labeled:bool)
_parse_file(self, contents:str, root:bblfsh.Node, path:str)

Parse a file into a sequence of VirtuaNode-s and a mapping from VirtualNode to parent.

Given the source text and the corresponding UAST this function compiles the list of VirtualNode-s and the parents mapping. That list of nodes equals to the original source text bit-to-bit after “”.join(n.value for n in nodes). parents map from id(node) to its parent bblfsh.Node.

Parameters:
  • contents – source file text
  • root – UAST root node
  • path – path to the file, used for debugging
Returns:

list of VirtualNode-s and the parents.

_compute_labels_mappings(self, vnodes:Iterable[VirtualNode])

Calculate the label to class sequence and class sequence to label mappings.

Takes into account self.cutoff_label_support and discard those with too little value.

Parameters:vnodes – The virtual nodes extracted from all the files.
_fill_vnode_parents(self, file_parents:Mapping[int, bblfsh.Node], file_vnodes:List[VirtualNode], uast:bblfsh.Node, vnode_parents:Mapping[int, bblfsh.Node])