NgramSet Pattern#
- class paspailleur.bip.NgramSetPattern(value: PatternValueType)#
A class representing a set of n-grams as a pattern.
Attributes#
- PatternValueType:
The type of the pattern’s value, which is a frozenset of tuples.
- StopWords: set[str]
A set of exclusively stop words to be excluded from the n-grams. But if a set has both stop words and non stop words it is kept in the analysis
Properties#
- atomic_patterns
Return the set of all less precise patterns that cannot be obtained by intersection of other patterns.
- min_pattern
Return the minimal possible pattern for the n-gram set pattern.
Examples
>>> p1 = NgramSetPattern({('hello', 'world')}) # explicit way to define a pattern with a single "hello world" ngram >>> p2 = NgramSetPattern('hello world') # simplified way to define {'hello world'} pattern >>> print(p1 == p2) True >>> p3 = NgramSetPattern( {('hello', 'world'), ('foo',), ('bar',) } ) # explicit way to define a pattern with 3 ngrams >>> p4 = NgramSetPattern(['hello world', 'foo', 'bar']) # simplified way to define a pattern with 3 ngrams >>> print(p3 == p4) True
- PatternValueType#
alias of
frozenset
[tuple
[str
, …]]
- classmethod parse_string_description(value: str) frozenset[tuple[str, ...]] #
Parse a string description into an n-gram set pattern value.
- Parameters:
value (str) – The string description of the n-gram set pattern.
- Returns:
parsed – The parsed n-gram set pattern value.
- Return type:
PatternValueType
- Raises:
ValueError – If the value cannot be parsed into an NgramSetPattern.
Examples
>>> NgramSetPattern.parse_string_description("{'hello world', 'foo bar'}") frozenset({('hello', 'world'), ('foo', 'bar')})
- classmethod preprocess_value(value) frozenset[tuple[str, ...]] #
Preprocess the value before storing it in the n-gram set pattern.
- Parameters:
value – The value to preprocess.
- Returns:
value – The preprocessed value as a frozenset of tuples.
- Return type:
PatternValueType
Examples
>>> NgramSetPattern.preprocess_value(['hello world', 'foo']) frozenset({('hello', 'world'), ('foo',)})
- classmethod filter_max_ngrams(ngrams: frozenset[tuple[str, ...]]) frozenset[tuple[str, ...]] #
Filter maximal n-grams from the set of n-grams.
- Parameters:
ngrams (PatternValueType) – The set of n-grams to filter.
- Returns:
maximal_ngrams – The filtered set of maximal n-grams, excluding the rest.
- Return type:
PatternValueType
Examples
>>> NgramSetPattern.filter_max_ngrams({('hello',), ('hello', 'world')}) frozenset({('hello', 'world')}) # it filters out hello beacause there is a greater N-gram
- atomise(atoms_configuration: Literal['min', 'max'] = 'min') set[Self] #
Split the pattern into atomic patterns, i.e. the singleton sets of ngrams
- Parameters:
atoms_configuration (Literal['min', 'max']) – If set to ‘min’, then return the set of individual ngrams from the value of the original pattern. If set to ‘max’, the return all sub-ngrams that can only be found in the original pattern. Defaults to ‘min’.
- Returns:
atomic_patterns – The set of atomic patterns, i.e. the set of unsplittable patterns whose join equals to the pattern.
- Return type:
set[Self]
Notes
Speaking in terms of Ordered Set Theory: We say that every pattern can be represented as the join of a subset of atomic patterns, that are join-irreducible elements of the lattice of all patterns.
Considering the set of atomic patterns as a partially ordered set (where the order follows the order on patterns), every pattern can be represented by an _antichain_ of atomic patterns (when atoms_configuration = ‘min’), and by an order ideal of atomic patterns (when atoms_configuration = ‘max’).
- property atomic_patterns: set[Self]#
Return the set of every individual sub-ngram for each ngram from the given pattern.
For an NgramSetPattern an atomic pattern is a set containing just one ngram.
- Returns:
atoms – A set of atomic patterns derived from the n-grams.
- Return type:
set[Self]
Examples
>>> p1 = NgramSetPattern({('hello', 'world')}) >>> p1.atomic_patterns {{'hello world'}, {'hello'}, {'world'}} >>> p2 = NgramSetPattern(["hello world !", "foo"]) >>> p2.atomic_patterns {{'hello world !'}, {'foo'}, {'hello'}, {'!'}, {'hello world'}, {'world !'}, {'world'}}
- classmethod get_min_pattern() Self | None #
Return the minimal possible pattern for the n-gram set pattern, i.e. empty NgramSet
- Returns:
min – The minimal n-gram set pattern, which is an empty NgramSet
- Return type:
Optional[Self]
Examples
>>> NgramSetPattern.get_min_pattern NgramSetPattern(set())