Titanic Study#

first thing we need to do is to install paspailleur from git:

!pip install --quiet git+https://github.com/smartFCA/paspailleur.git
USE_TQDM = False  # set to False when used within documentation

Before the start: Download the data#

Second is to initiate the dataset:

import pandas as pd

df_full = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv', index_col=0)
print(df_full.shape)
print(df_full.columns)
(891, 11)
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Now we do some modifications to make the results look better and reorganize the table with only the needed columns:

# change the values of the Embarked column into the full names instead of the letters
df_full['Embarked'] = df_full['Embarked'].map({'S': 'Southampton', 'C': 'Cherbourg', 'Q': 'Queenstown'})
# change the values of the Survived column into yes and no instead of 0 1
df_full['Survived'] = df_full['Survived'].map(['No', 'Yes'].__getitem__)

# the removed columns are ticket and cabin since they are not needed for the study
df = df_full.drop(columns=['Ticket', 'Cabin'])
print(df.shape)
df.head()
(891, 9)
Survived Pclass Name Sex Age SibSp Parch Fare Embarked
PassengerId
1 No 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 Southampton
2 Yes 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 Cherbourg
3 Yes 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 Southampton
4 Yes 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 Southampton
5 No 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 Southampton

Part One: Describe the data with Patterns#

Now we should describe how we treat every column in the data.

  • CategorySetPattern is designed for categorical data. That is, every object is described by a categorical value. Then a pattern would be a subset of categories that covers rows marked by any of the categories of the pattern;

  • IntervalPattern treats numerical data. Any row (marked by either a number of an interval of numbers) is either lies inside some interval pattern or nor;

  • NgramSetPattern treats textual data. Every text is represented as an ngram (i.e. a sequence of words). The task here is to find subngrams that can often be found in the data;

  • CartesianPattern combined independent dimensions in the tabular data. Every dimension represents a column in the data described by its own Pattern.

With this, we’ll be able to initialize and do simple comaprisons between custom patterns

import paspailleur as psp

# The classical way to inherit a new Pattern class
class SurvivedPattern(psp.bip.CategorySetPattern):
    # CategorySetPattern required the definition of the Universe of categories,
    # that is the set of all possible categories that can be found in the data
    Universe = ('No', 'Yes')

# A simplified way to inherit Pattern classes
SexPattern = psp.pattern_factory(psp.bip.CategorySetPattern, Universe=('female', 'male'))

# Built-in pattern can be called by their names
EmbarkedPattern = psp.pattern_factory('CategorySetPattern', Universe=('Southampton', 'Cherbourg', 'Queenstown'))

PassengerClassPattern = psp.pattern_factory(psp.bip.IntervalPattern, BoundsUniverse=(1,2,3))
AgePattern = psp.pattern_factory('IntervalPattern', BoundsUniverse=(0, 20, 40, 60, 80))
NSiblingsPattern = psp.pattern_factory('IntervalPattern', BoundsUniverse=(0, 1, 2, 8))
NParentsPattern = psp.pattern_factory('IntervalPattern', BoundsUniverse=(0, 1, 2, 6))
FarePattern = psp.pattern_factory('IntervalPattern', BoundsUniverse=(0, 30, 100, 300, 515))

NamePattern = psp.pattern_factory(psp.bip.NgramSetPattern, StopWords=set())

# CartesianPattern combines Patterns for each column in the data
class PassengerPattern(psp.bip.CartesianPattern):
    DimensionTypes = {
        'Survived': SurvivedPattern, 
        'Sex': SexPattern,
        'Embarked': EmbarkedPattern,
        'Pclass': PassengerClassPattern,
        'Age': AgePattern,
        'SibSp': NSiblingsPattern,
        'Parch': NParentsPattern,
        'Fare': FarePattern,
        'Name': NamePattern
    }
/home/runner/.local/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

For the moment, paspailleur cannot treat None and NaN values in the data. So we manually fill them with the minimal patterns.

for f in df.columns:
    if df[f].isna().any():
        p = PassengerPattern.DimensionTypes[f].get_min_pattern()
        df[f] = df[f].fillna(p)
        print(f"Found NaN values in dimension {f}. Replace them by pattern: {p}")
Found NaN values in dimension Age. Replace them by pattern: [-inf, inf]
Found NaN values in dimension Embarked. Replace them by pattern: {'Queenstown', 'Southampton', 'Cherbourg'}

Now, let us create a context dictionary, where the keys of the dictionary are objects (the names of rows in the data) and values are patterns of these objects.

For every object there should be just one pattern.

Now we create a PatternStructure that will let us analyse the context.

Every pattern in the pattern structure would be created by joining atomic patterns together. So a pattern would describe objects that are covered by all atomic patterns it consists of.

%%time
ps = psp.PatternStructure(PassengerPattern)
ps.fit(df.to_dict('index'), min_atom_support=0.1, use_tqdm=USE_TQDM)
CPU times: user 869 ms, sys: 7.98 ms, total: 876 ms
Wall time: 876 ms

Here we mine stable pattern concepts where every concept can be treated as an individual cluster.

There are two important parameters to the function ps.mine_concepts:

  • min support which is the minimum number of objects covered by the concept.

  • min_delta_stability which means that all more precise concepts will cover less objects.

Part Two: Mining patterns#

Mining concepts#

%%time
concepts = ps.mine_concepts(min_delta_stability=0.01, min_support=0.1, algorithm='gSofia', use_tqdm=USE_TQDM)
CPU times: user 5.42 s, sys: 44.9 ms, total: 5.46 s
Wall time: 5.46 s
print(f"# concepts: {len(concepts):,}")
extent, intent = concepts[50]
print("Concept #50")
print(f"* objects in the concept: {list(extent)[:10]} (some of them)")
print(f"* pattern of the concept: {intent}")
# concepts: 4,868
Concept #50
* objects in the concept: [1, 2, 3, 4, 5, 7, 8, 9, 10, 11] (some of them)
* pattern of the concept: {'Age': [0.0, 60.0], 'Embarked': NOT({'Queenstown'}), 'Fare': [0.0, 300.0], 'Parch': [0.0, 6.0], 'Pclass': [1.0, 3.0], 'SibSp': [0.0, 8.0]}

Mining implications#

Now for the implications

%%time
implications = ps.mine_implications(min_support=0.2, max_key_length=3, 
                                    algorithm='Talky-GI',
                                    reduce_conclusions=True, use_tqdm=USE_TQDM)
print(len(implications))
47
CPU times: user 155 ms, sys: 23 μs, total: 155 ms
Wall time: 154 ms
print("Mined Implications:")
for premise, conclusion in implications.items():
    print(premise, f'=> ({ps.measure_support(premise)} examples)', conclusion, sep='\n')
    print()
Mined Implications:
{}
=> (891 examples)
{'Fare': [0.0, 515.0], 'Parch': [0.0, 6.0], 'Pclass': [1.0, 3.0], 'SibSp': [0.0, 8.0]}

{'Fare': < 515.0}
=> (888 examples)
{'Fare': <= 300.0}

{'Parch': < 6.0}
=> (876 examples)
{'Parch': <= 2.0}

{'SibSp': < 8.0}
=> (845 examples)
{'SibSp': <= 2.0}

{'Fare': < 300.0}
=> (838 examples)
{'Fare': <= 100.0}

{'SibSp': < 2.0}
=> (817 examples)
{'SibSp': <= 1.0}

{'Parch': < 2.0}
=> (796 examples)
{'Parch': <= 1.0}

{'Embarked': NOT({'Cherbourg'})}
=> (721 examples)
{'Fare': <= 300.0}

{'Age': <= 80.0}
=> (714 examples)
{'Age': >= 0.0}

{'Age': >= 0.0}
=> (714 examples)
{'Age': <= 80.0}

{'Age': < 80.0}
=> (692 examples)
{'Age': <= 60.0}

{'Parch': < 1.0}
=> (678 examples)
{'Parch': <= 0.0}

{'Pclass': > 1.0}
=> (675 examples)
{'Fare': <= 100.0, 'Pclass': >= 2.0}

{'Fare': < 100.0}
=> (657 examples)
{'Fare': <= 30.0}

{'SibSp': < 1.0}
=> (608 examples)
{'SibSp': <= 0.0}

{'Parch': < 1.0, 'Sex': {'male'}}
=> (484 examples)
{'SibSp': <= 2.0}

{'Sex': {'male'}, 'SibSp': < 1.0}
=> (434 examples)
{'Parch': <= 2.0}

{'Age': < 60.0}
=> (564 examples)
{'Age': <= 40.0}

{'Age': > 0.0}
=> (550 examples)
{'Age': >= 20.0}

{'Age': > 0.0, 'Sex': {'male'}}
=> (364 examples)
{'SibSp': <= 2.0}

{'Survived': {'No'}}
=> (549 examples)
{'Fare': <= 300.0}

{'Parch': < 1.0, 'Survived': {'No'}}
=> (445 examples)
{'SibSp': <= 2.0}

{'Age': > 0.0, 'Survived': {'No'}}
=> (339 examples)
{'SibSp': <= 2.0}

{'Name': {'Mr.'}}
=> (517 examples)
{'Sex': {'male'}}

{'Name': {'Mr.'}, 'Parch': < 1.0}
=> (465 examples)
{'SibSp': <= 2.0}

{'Fare': < 100.0, 'Name': {'Mr.'}}
=> (421 examples)
{'SibSp': <= 2.0}

{'Name': {'Mr.'}, 'SibSp': < 1.0}
=> (413 examples)
{'Parch': <= 2.0}

{'Age': > 0.0, 'Name': {'Mr.'}}
=> (345 examples)
{'SibSp': <= 2.0}

{'Pclass': > 2.0}
=> (491 examples)
{'Pclass': >= 3.0}

{'Age': > 0.0, 'Pclass': > 2.0, 'Sex': {'male'}}
=> (190 examples)
{'Name': {'Mr.'}}

{'Pclass': < 3.0}
=> (400 examples)
{'Pclass': <= 2.0}

{'Pclass': < 3.0, 'SibSp': < 1.0}
=> (257 examples)
{'Parch': <= 2.0}

{'Fare': < 300.0, 'Pclass': < 3.0, 'Sex': {'male'}}
=> (211 examples)
{'Parch': <= 2.0, 'SibSp': <= 2.0}

{'Parch': < 2.0, 'Pclass': < 3.0, 'Sex': {'male'}}
=> (216 examples)
{'SibSp': <= 2.0}

{'Age': < 80.0, 'Pclass': < 3.0, 'Sex': {'male'}}
=> (185 examples)
{'Parch': <= 2.0}

{'Age': < 60.0, 'Pclass': < 3.0, 'SibSp': < 2.0}
=> (235 examples)
{'Parch': <= 2.0}

{'Name': {'Mr.'}, 'Pclass': < 3.0}
=> (198 examples)
{'Embarked': NOT({'Queenstown'})}

{'Fare': < 300.0, 'Name': {'Mr.'}, 'Pclass': < 3.0}
=> (182 examples)
{'Parch': <= 2.0, 'SibSp': <= 2.0}

{'Name': {'Mr.'}, 'Parch': < 2.0, 'Pclass': < 3.0}
=> (189 examples)
{'SibSp': <= 2.0}

{'SibSp': > 0.0}
=> (283 examples)
{'Fare': <= 300.0, 'SibSp': >= 1.0}

{'Embarked': NOT({'Southampton'}), 'Parch': < 1.0}
=> (192 examples)
{'SibSp': <= 2.0}

{'Fare': > 0.0}
=> (240 examples)
{'Fare': >= 30.0}

{'Pclass': < 2.0}
=> (216 examples)
{'Pclass': <= 1.0}

{'Parch': < 2.0, 'Pclass': < 2.0}
=> (194 examples)
{'SibSp': <= 2.0}

{'Parch': > 0.0}
=> (213 examples)
{'Parch': >= 1.0}

{'Name': {'Miss.'}}
=> (182 examples)
{'Parch': <= 2.0, 'Sex': {'female'}}

{'Age': < 40.0}
=> (179 examples)
{'Age': <= 20.0, 'Fare': <= 300.0}

Mining subgroups#

Now for mining subgroups with the goal which is the survivors

goal_objects = set(df[df['Survived'] == "Yes"].index)
subgroups_iterator = ps.iter_subgroups(
    goal_objects=goal_objects,
    quality_measure='Precision', quality_threshold=0.65,
    max_length=2,
    use_tqdm=USE_TQDM
)
%%time
subgroups = list(subgroups_iterator)
print(len(subgroups))
16
CPU times: user 3.9 ms, sys: 0 ns, total: 3.9 ms
Wall time: 3.9 ms
# Order subgroups 1) by simplicity of pattern, 2) by their quality
subgroups = sorted(subgroups, key=lambda sg_data: (len(sg_data.pattern), -sg_data.quality_value))
print("Subgroups for Survived Passengers:")
for pattern, objects, quality, quality_name in subgroups:
    print(f"Pattern: {pattern}")
    print(f"{quality_name}: {quality:.2%}, Support: {len(objects)} ({ps.measure_frequency(pattern):.0%} of data)")
    print()
Subgroups for Survived Passengers:
Pattern: {'Survived': {'Yes'}}
Precision: 100.00%, Support: 342 (38% of data)

Pattern: {'Name': {'Mrs.'}}
Precision: 79.20%, Support: 125 (14% of data)

Pattern: {'Sex': {'female'}}
Precision: 74.20%, Support: 314 (35% of data)

Pattern: {'Name': {'Miss.'}}
Precision: 69.78%, Support: 182 (20% of data)

Pattern: {'Fare': 30.0}
Precision: 83.33%, Support: 6 (1% of data)

Pattern: {'Age': <= 20.0, 'Pclass': <= 2.0}
Precision: 76.79%, Support: 56 (6% of data)

Pattern: {'Parch': >= 1.0, 'Pclass': <= 2.0}
Precision: 73.79%, Support: 103 (12% of data)

Pattern: {'Embarked': NOT({'Southampton'}), 'Fare': >= 30.0}
Precision: 72.62%, Support: 84 (9% of data)

Pattern: {'Pclass': <= 2.0, 'SibSp': >= 1.0}
Precision: 67.13%, Support: 143 (16% of data)

Pattern: {'Fare': >= 30.0, 'Pclass': <= 2.0}
Precision: 66.83%, Support: 199 (22% of data)

Pattern: {'Embarked': NOT({'Southampton'}), 'Pclass': <= 2.0}
Precision: 66.36%, Support: 107 (12% of data)

Pattern: {'Parch': >= 1.0, 'SibSp': <= 0.0}
Precision: 66.20%, Support: 71 (8% of data)

Pattern: {'Age': <= 80.0, 'Pclass': <= 1.0}
Precision: 65.59%, Support: 186 (21% of data)

Pattern: {'Age': >= 0.0, 'Pclass': <= 1.0}
Precision: 65.59%, Support: 186 (21% of data)

Pattern: {'Fare': >= 30.0, 'SibSp': <= 1.0}
Precision: 65.50%, Support: 200 (22% of data)

Pattern: {'Fare': >= 30.0, 'Parch': <= 0.0}
Precision: 65.22%, Support: 138 (15% of data)