lookout.style.format.feature_extractor

Feature extraction module.

Module Contents

lookout.style.format.feature_extractor.IndexToFeature
lookout.style.format.feature_extractor.FEATURES_NUMPY_TYPE
lookout.style.format.feature_extractor.FEATURES_MIN
lookout.style.format.feature_extractor.FEATURES_MAX
class lookout.style.format.feature_extractor.FeatureExtractor(*, language:str, left_siblings_window:int, right_siblings_window:int, parents_depth:int, node_features:Sequence[str], left_features:Sequence[str], right_features:Sequence[str], parent_features:Sequence[str], no_labels_on_right:bool, select_features_number:Optional[int], debug_parsing:bool, return_sibling_indices:bool, selected_features:Optional[numpy.ndarray]=None, label_composites:Optional[List[Tuple[int, ...]]]=None, cutoff_label_support:int=0)

Extract features for downstream models.

_log
index_to_feature

Return the mapping from integer indices to the corresponding feature names.

feature_to_indices

Return the mapping from feature names to the corresponding integer indices.

features

Return the Feature-s used by this feature extractor.

feature_names

Return the names of the features.

A feature name uniquely identifies a feature. It reflects the feature structure: it is comprised of the feature group the feature belongs to, its sibling identifier, its feature name and its index if applicable.

Those names, as well as the feature layout, depend on the configuration used to launch the analyzer.

composite_class_representations

Return the class representations of composite classes.

Returns:Strings representing the composite classes.
composite_class_printables

Return the class printables of composite classes.

Returns:Strings that can be printed to represent the composite classes.
count_features(self, feature_group:Optional[FeatureGroup]=None, neighbour_index:Optional[int]=None)

Return the feature count of a given subset of features.

extract_features(self, files:Iterable[UnicodeFile], lines:Optional[List[List[int]]]=None)

Compute features and labels required by downstream models given a list of File-s.

Parameters:
  • files – the list of File-s (see service_data.proto) of the same language.
  • lines – the list of enabled line numbers per file. The lines which are not mentioned will not be extracted.
Returns:

tuple of numpy.ndarray (2 and 1 dimensional respectively): features and labels, the corresponding VirtualNode-s and the parents mapping or None in case no features were extracted.

select_features(self, X:csr_matrix, y:numpy.ndarray)

Select the most useful features based on sklearn’s univariate feature selection.

Parameters:
  • X – Scipy CSR 2-dimensional matrix of features to select.
  • y – Numpy 1-dimensional array of labels.
Returns:

Tuple of a CSR matrix with only the selected features (columns) kept and an array of the indices of the kept features for later reapplication.

label_to_str(self, label:int)

Convert a label to string.

_compute_feature_info(self)
_annotate_files(self, files:Iterable[UnicodeFile], lines:Optional[List[List[int]]]=None)
_convert_files_to_xy(self, parsed_files:List[Tuple[List[VirtualNode], Dict[int, bblfsh.Node], Set[int]]])
_create_neighbours(self, vnodes:Sequence[VirtualNode], vnodes_y:Sequence[VirtualNode], parents:Mapping[int, bblfsh.Node], return_sibling_indices:bool=False)
_classify_vnodes(self, file:AnnotationManager)

Annotate source code with AtomicTokenAnnotation, ClassAnnotation and AccumulatedIndentationAnnotation.

ClassAnnotation contains the index of the corresponding class to predict. We detect indentation changes, so several whitespace nodes are merged together.

Parameters:file – Source code annotated with RawTokenAnnotation.
_merge_classes_to_composite_labels(self, file:AnnotationManager)

Build “composite” TokenAnnotation and LabelAnnotation from predictable atomic tokens.

Parameters:file – Source code annotated with AtomicTokenAnnotation, ClassAnnotation, AccumulatedIndentationAnnotation.
_add_noops(self, file:AnnotationManager)

Add TokenAnnotation with zero length in between TokenAnnotation without labeled nodes.

Such zero length annotations means that some formatting sequence can be inserted to the annotation position.

Parameters:file – Source code annotated with TokenAnnotation and LabelAnnotation.
static _find_parent(search_start_offset:int, file:AnnotationManager, closest_left_node_id:int)

Compute the UAST parent of the TokenAnnotation as the LCA of the closest left and right Babelfish nodes.

Parameters:
  • search_start_offset – Offset of the current node.
  • file – Source code annotated with UASTAnnotation and TokenAnnotation.
  • closest_left_node_id – bblfsh node of the closest parent already gone through.
Returns:

The bblfsh.Node of the found parent or None if no parent was found.

static _find_parent_old(vnode_index:int, vnodes:Sequence[VirtualNode], parents:Mapping[int, bblfsh.Node], closest_left_node_id:int)

Compute vnode parent as the LCA of the closest left and right babelfish nodes.

Parameters:
  • vnode_index – the index of the current node
  • vnodes – the sequence of VirtualNode-s being transformed into features
  • parents – the id of bblfsh node to parent bblfsh node mapping
  • closest_left_node_id – bblfsh node of the closest parent already gone through
Returns:

The bblfsh.Node of the found parent or None if no parent was found.

_keep_sibling(self, sibling:VirtualNode, vnode:VirtualNode, include_labeled:bool)
_parse_file(self, file:AnnotationManager)

Annotate source code with RawTokenAnnotation-s.

Given the source text and the corresponding UAST this function covers all code with a RawTokenAnnotation-s.

Parameters:file – Source code annotated with UASTAnnotation.
_compute_labels_mappings(self, vnodes:Iterable[VirtualNode])

Calculate the label to class sequence and class sequence to label mappings.

Takes into account self.cutoff_label_support and discard those with too little value.

Parameters:vnodes – The virtual nodes extracted from all the files.
_fill_vnode_parents(self, file:AnnotationManager)
lookout.style.format.feature_extractor._to_position(raw_lines_data, _lines_start_offset, offset)
lookout.style.format.feature_extractor.file_to_old_parse_file_format(file:AnnotationManager)

Convert AnnotationManager instance to the deprecated output format of FeatureExtractor._parse_file().

The function exists for backward compatibility and should be removed after the refactoring is finished.

Parameters:file – file annotated with UASTAnnotation, PathAnnotation and RawTokenAnnotation. It is expected to be the output of FeatureExtractor._parse_file().
Returns:The old FeatureExtractor._parse_file() output format, that is Tuple with VirtualNode-s and bbfsh.Node id to parent mapping.
lookout.style.format.feature_extractor._file_to_vnodes_and_parents(file:AnnotationManager)

Convert one AnnotationManager instance to the deprecated format of FeatureExtractor._annotate_files() (_parse_vnodes() before refactoring).

The old format is a sequence of vnodes and vnodes parents mapping. Used by files_to_old_parse_file_format to generate the old _parse_vnodes-like output format for a sequence of AnnotationManager-s. This function is different from file_to_old_parse_file_format() because it is created for _parse_vnodes() backward compatibility and file_to_old_parse_file_format() for _parse_file() backward compatibility.

The function exists for backward compatibility and should be removed after the refactoring is finished.

Parameters:file – file annotated with Path-, Token-, Label-, TokenParent- Annotation.
Returns:Tuple with VirtualNode-s and node id to parents mapping.
lookout.style.format.feature_extractor.files_to_old_parse_vnodes_format(files:Sequence[AnnotationManager])

Convert a sequence of AnnotationManager instances to the deprecated output format of FeatureExtractor._annotate_files() (_parse_vnodes() before refactoring).

In addition to _file_to_vnodes_and_parents() it provides the node_parents mapping.

The function exists for backward compatibility and should be removed after the refactoring is finished.

Parameters:files – Sequence of fully annotated files. It is expected to be the output of FeatureExtractor._parse_vnodes().
Returns:The old FeatureExtractor._parse_vnodes() output format, that is Tuple with VirtualNode-s, node parents mapping and vnode parents mapping.