NgramSet Pattern#

class paspailleur.bip.NgramSetPattern(value: PatternValueType)#

A class representing a set of n-grams as a pattern.

Attributes#

PatternValueType:

The type of the pattern’s value, which is a frozenset of tuples.

StopWords: set[str]

A set of exclusively stop words to be excluded from the n-grams. But if a set has both stop words and non stop words it is kept in the analysis

Properties#

atomic_patterns

Return the set of all less precise patterns that cannot be obtained by intersection of other patterns.

min_pattern

Return the minimal possible pattern for the n-gram set pattern.

Examples

>>> p1 = NgramSetPattern({('hello', 'world')})  # explicit way to define a pattern with a single "hello world" ngram
>>> p2 = NgramSetPattern('hello world')  # simplified way to define {'hello world'} pattern
>>> print(p1 == p2)
True
>>> p3 = NgramSetPattern( {('hello', 'world'), ('foo',), ('bar',) } )  # explicit way to define a pattern with 3 ngrams
>>> p4 = NgramSetPattern(['hello world', 'foo', 'bar'])  # simplified way to define a pattern with 3 ngrams
>>> print(p3 == p4)
True
PatternValueType#

alias of frozenset[tuple[str, …]]

classmethod parse_string_description(value: str) frozenset[tuple[str, ...]]#

Parse a string description into an n-gram set pattern value.

Parameters:

value (str) – The string description of the n-gram set pattern.

Returns:

parsed – The parsed n-gram set pattern value.

Return type:

PatternValueType

Raises:

ValueError – If the value cannot be parsed into an NgramSetPattern.

Examples

>>> NgramSetPattern.parse_string_description("{'hello world', 'foo bar'}")
frozenset({('hello', 'world'), ('foo', 'bar')})
classmethod preprocess_value(value) frozenset[tuple[str, ...]]#

Preprocess the value before storing it in the n-gram set pattern.

Parameters:

value – The value to preprocess.

Returns:

value – The preprocessed value as a frozenset of tuples.

Return type:

PatternValueType

Examples

>>> NgramSetPattern.preprocess_value(['hello world', 'foo'])
frozenset({('hello', 'world'), ('foo',)})
classmethod filter_max_ngrams(ngrams: frozenset[tuple[str, ...]]) frozenset[tuple[str, ...]]#

Filter maximal n-grams from the set of n-grams.

Parameters:

ngrams (PatternValueType) – The set of n-grams to filter.

Returns:

maximal_ngrams – The filtered set of maximal n-grams, excluding the rest.

Return type:

PatternValueType

Examples

>>> NgramSetPattern.filter_max_ngrams({('hello',), ('hello', 'world')})
frozenset({('hello', 'world')}) # it filters out hello beacause there is a greater N-gram
atomise(atoms_configuration: Literal['min', 'max'] = 'min') set[Self]#

Split the pattern into atomic patterns, i.e. the singleton sets of ngrams

Parameters:

atoms_configuration (Literal['min', 'max']) – If set to ‘min’, then return the set of individual ngrams from the value of the original pattern. If set to ‘max’, the return all sub-ngrams that can only be found in the original pattern. Defaults to ‘min’.

Returns:

atomic_patterns – The set of atomic patterns, i.e. the set of unsplittable patterns whose join equals to the pattern.

Return type:

set[Self]

Notes

Speaking in terms of Ordered Set Theory: We say that every pattern can be represented as the join of a subset of atomic patterns, that are join-irreducible elements of the lattice of all patterns.

Considering the set of atomic patterns as a partially ordered set (where the order follows the order on patterns), every pattern can be represented by an _antichain_ of atomic patterns (when atoms_configuration = ‘min’), and by an order ideal of atomic patterns (when atoms_configuration = ‘max’).

property atomic_patterns: set[Self]#

Return the set of every individual sub-ngram for each ngram from the given pattern.

For an NgramSetPattern an atomic pattern is a set containing just one ngram.

Returns:

atoms – A set of atomic patterns derived from the n-grams.

Return type:

set[Self]

Examples

>>> p1 = NgramSetPattern({('hello', 'world')})
>>> p1.atomic_patterns
{{'hello world'}, {'hello'}, {'world'}}
>>> p2 = NgramSetPattern(["hello world !", "foo"])
>>> p2.atomic_patterns
{{'hello world !'}, {'foo'}, {'hello'}, {'!'}, {'hello world'}, {'world !'}, {'world'}}
classmethod get_min_pattern() Self | None#

Return the minimal possible pattern for the n-gram set pattern, i.e. empty NgramSet

Returns:

min – The minimal n-gram set pattern, which is an empty NgramSet

Return type:

Optional[Self]

Examples

>>> NgramSetPattern.get_min_pattern
NgramSetPattern(set())