The main interface

Consistency testing (`check`)

The test functions implemented in the mlscorecheck.check module.

Binary classification

mlscorecheck.check.binary.check_1_testset_no_kfold(testset: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) → dict[source]

Use this check if the scores are calculated on one single test set with no kfolding. The test is performed by exhaustively testing all possible confusion matrices.

Parameters:

testset (dict) – the specification of a testset with p, n or its name
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty (potentially for each score)
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_1_testset_no_kfold
>>> testset = {'p': 530, 'n': 902}
>>> scores = {'acc': 0.62, 'sens': 0.22, 'spec': 0.86, 'f1p': 0.3, 'fm': 0.32}
>>> result = check_1_testset_no_kfold(testset=testset,
                                        scores=scores,
                                        eps=1e-2)
>>> result['inconsistency']
# False

>>> testset = {'p': 530, 'n': 902}
>>> scores = {'acc': 0.92, 'sens': 0.22, 'spec': 0.86, 'f1p': 0.3, 'fm': 0.32}
>>> result = check_1_testset_no_kfold(testset=testset,
                                        scores=scores,
                                        eps=1e-2)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_1_dataset_kfold_som(dataset: dict, folding: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) → dict[source]

This function checks the consistency of scores calculated by applying k-fold cross validation to a single dataset and aggregating the figures over the folds in the score of means fashion. The test is performed by exhaustively testing all possible confusion matrices.

Parameters:

dataset (dict) – The dataset specification.
folding (dict) – The folding specification.
scores (dict(str,float)) – The scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
numerical_tolerance (float, optional) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_1_dataset_kfold_som
>>> dataset = {'dataset_name': 'common_datasets.monk-2'}
>>> folding = {'n_folds': 4, 'n_repeats': 3, 'strategy': 'stratified_sklearn'}
>>> scores = {'spec': 0.668, 'npv': 0.744, 'ppv': 0.667,
                'bacc': 0.706, 'f1p': 0.703, 'fm': 0.704}
>>> result = check_1_dataset_kfold_som(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-3)
>>> result['inconsistency']
# False

>>> dataset = {'p': 10, 'n': 20}
>>> folding = {'n_folds': 5, 'n_repeats': 1}
>>> scores = {'acc': 0.428, 'npv': 0.392, 'bacc': 0.442, 'f1p': 0.391}
>>> result = check_1_dataset_kfold_som(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-3)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_1_dataset_known_folds_mos(dataset: dict, folding: dict, scores: dict, eps, fold_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated by applying k-fold cross validation to a single dataset and aggregating the figures over the folds in the mean of scores fashion.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add fold_score_bounds when, for example, the minimum and the maximum scores over the folds are also provided. Full names in camel case, like

Parameters:

dataset (dict) – The dataset specification.
folding (dict) – The folding specification.
scores (dict(str,float)) – The scores to check.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
fold_score_bounds (None|dict(str,tuple(float,float))) – Bounds on the scores in the folds.
solver_name (None|str) – The solver to use.
timeout (None|int) – The timeout for the linear programming solver in seconds.
verbosity (int) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose.
numerical_tolerance (float) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_1_dataset_known_folds_mos
>>> dataset = {'p': 126, 'n': 131}
>>> folding = {'folds': [{'p': 52, 'n': 94}, {'p': 74, 'n': 37}]}
>>> scores = {'acc': 0.573, 'sens': 0.768, 'bacc': 0.662}
>>> result = check_1_dataset_known_folds_mos(dataset=dataset,
                                            folding=folding,
                                            scores=scores,
                                            eps=1e-3)
>>> result['inconsistency']
# False

>>> dataset = {'p': 398, 'n': 569}
>>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}
>>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6}
>>> result = check_1_dataset_known_folds_mos(dataset=dataset,
                                            folding=folding,
                                            scores=scores,
                                            eps=1e-2)
>>> result['inconsistency']
# True

>>> dataset = {'dataset_name': 'common_datasets.glass_0_1_6_vs_2'}
>>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}
>>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6, 'bacc': 0.1, 'f1': 0.95}
>>> result = check_1_dataset_known_folds_mos(dataset=dataset,
                                            folding=folding,
                                            fold_score_bounds={'acc': (0.8, 1.0)},
                                            scores=scores,
                                            eps=1e-2,
                                            numerical_tolerance=1e-6)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_1_dataset_unknown_folds_mos(dataset: dict, folding: dict, scores: dict, eps, fold_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores calculated in a k-fold cross validation on a single dataset, in a mean-of-scores fashion, without knowing the fold configuration. The function generates all possible fold configurations and tests the consistency of each. The scores are inconsistent if all the k-fold configurations lead to inconsistencies identified.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add fold_score_bounds when, for example, the minimum and the maximum scores over the folds are also provided. Full names in camel case, like

Note that depending on the size of the dataset (especially the number of minority instances) and the folding configuration, this test might lead to an untractable number of problems to be solved. Use the function estimate_n_evaluations to get an upper bound estimate on the number of fold combinations.

The evaluation of possible fold configurations stops when a feasible configuration is found.

Parameters:

dataset (dict) – the dataset specification
folding (dict) – the folding specification
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
fold_score_bounds (None|dict(str,dict(str,str))) – bounds on the scores in the folds
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_1_dataset_unknown_folds_mos
>>> dataset = {'p': 126, 'n': 131}
>>> folding = {'n_folds': 2, 'n_repeats': 1}
>>> scores = {'acc': 0.573, 'sens': 0.768, 'bacc': 0.662}
>>> result = check_1_dataset_unknown_folds_mos(dataset=dataset,
                                                folding=folding,
                                                scores=scores,
                                                eps=1e-3)
>>> result['inconsistency']
# False

>>> dataset = {'p': 19, 'n': 97}
>>> folding = {'n_folds': 3, 'n_repeats': 1}
>>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6}
>>> result = check_1_dataset_unknown_folds_mos(dataset=dataset,
                                                folding=folding,
                                                scores=scores,
                                                eps=1e-4)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_testsets_mos_no_kfold(testsets: list, scores: dict, eps, testset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated on multiple testsets with no k-fold and aggregating the figures over the testsets in the mean of scores fashion.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add testset_score_bounds when, for example, the minimum and the maximum scores over the testsets are also provided. Full names in camel case, like

Parameters:

testsets (list(dict)) – the list of testset specifications
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
testset_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the testsets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_n_testsets_mos_no_kfold
>>> testsets = [{'p': 349, 'n': 50},
                {'p': 478, 'n': 323},
                {'p': 324, 'n': 83},
                {'p': 123, 'n': 145}]
>>> scores = {'acc': 0.6441, 'sens': 0.6706, 'spec': 0.3796, 'bacc': 0.5251}
>>> result = check_n_testsets_mos_no_kfold(testsets=testsets,
                                            eps=1e-4,
                                            scores=scores)
>>> result['inconsistency']
# False

>>> scores['sens'] = 0.6756
>>> result = check_n_datasets_mos_no_kfold(testsets=testsets,
                                            eps=1e-4,
                                            scores=scores)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_testsets_som_no_kfold(testsets: list, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True)[source]

Checking the consistency of scores calculated by aggregating the figures over testsets in the score of means fashion, without k-folding.

The test is performed by exhaustively testing all possible confusion matrices.

Parameters:

datasets (list(dict)) – the specification of the evaluations
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_n_datasets_som_no_kfold
>>> testsets = [{'p': 405, 'n': 223}, {'p': 3, 'n': 422}, {'p': 109, 'n': 404}]
>>> scores = {'acc': 0.4719, 'npv': 0.6253, 'f1p': 0.3091}
>>> result = check_n_datasets_som_no_kfold(testsets=testsets,
                                            scores=scores,
                                            eps=1e-3)
>>> result['inconsistency']
# False

>>> scores['npv'] = 0.6263
>>> result = check_n_datasets_som_no_kfold(testsets=testsets,
                                            scores=scores,
                                            eps=1e-3)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_datasets_som_kfold_som(evaluations: list, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True)[source]

Checking the consistency of scores calculated by applying k-fold cross validation to multiple datasets and aggregating the figures over the folds and datasets in the score of means fashion. The test is performed by exhaustively testing all possible confusion matrices.

Parameters:

evaluations (list(dict)) – the specification of the evaluations
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)

Returns:

A dictionary containing the results of the consistency check. The dictionary

Return type:

dict

includes the following keys:

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_n_datasets_som_kfold_som
>>> evaluation0 = {'dataset': {'p': 389, 'n': 630},
                    'folding': {'n_folds': 5, 'n_repeats': 2,
                                'strategy': 'stratified_sklearn'}}
>>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.saheart'},
                    'folding': {'n_folds': 5, 'n_repeats': 2,
                                'strategy': 'stratified_sklearn'}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.631, 'sens': 0.341, 'spec': 0.802, 'f1p': 0.406, 'fm': 0.414}
>>> result = check_n_datasets_som_kfold_som(scores=scores,
                                            evaluations=evaluations,
                                            eps=1e-3)
>>> result['inconsistency']
# False

>>> evaluation0 = {'dataset': {'p': 389, 'n': 630},
                    'folding': {'n_folds': 5, 'n_repeats': 2,
                                'strategy': 'stratified_sklearn'}}
>>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.saheart'},
                    'folding': {'n_folds': 5, 'n_repeats': 2,
                                'strategy': 'stratified_sklearn'}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.731, 'sens': 0.341, 'spec': 0.802, 'f1p': 0.406, 'fm': 0.414}
>>> result = check_n_datasets_som_kfold_som(scores=scores,
                                            evaluations=evaluations,
                                            eps=1e-3)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_datasets_mos_kfold_som(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated on multiple datasets with k-fold cross-validation, applying score of means aggregation over the folds and mean of scores aggregation over the datasets.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add dataset_score_bounds when, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like

Parameters:

evaluations (list(dict)) – the list of evaluation specifications
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
dataset_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the datasets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_n_datasets_mos_kfold_som
>>> evaluation0 = {'dataset': {'p': 39, 'n': 822},
                    'folding': {'n_folds': 5, 'n_repeats': 3,
                                'strategy': 'stratified_sklearn'}}
>>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.winequality-white-3_vs_7'},
                    'folding': {'n_folds': 5, 'n_repeats': 3,
                                'strategy': 'stratified_sklearn'}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.312, 'sens': 0.45, 'spec': 0.312, 'bacc': 0.381}
>>> result = check_n_datasets_mos_kfold_som(evaluations=evaluations,
                                            dataset_score_bounds={'acc': (0.0, 0.5)},
                                            eps=1e-4,
                                            scores=scores)
>>> result['inconsistency']
# False

>>> evaluation0 = {'dataset': {'p': 39, 'n': 822},
                    'folding': {'n_folds': 5, 'n_repeats': 3,
                                'strategy': 'stratified_sklearn'}}
>>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.winequality-white-3_vs_7'},
                    'folding': {'n_folds': 5, 'n_repeats': 3,
                                'strategy': 'stratified_sklearn'}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.412, 'sens': 0.45, 'spec': 0.312, 'bacc': 0.381}
>>> result = check_n_datasets_mos_kfold_som(evaluations=evaluations,
                                            dataset_score_bounds={'acc': (0.5, 1.0)},
                                            eps=1e-4,
                                            scores=scores)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_datasets_mos_known_folds_mos(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated by applying k-fold cross validation to N datasets and aggregating the figures over the folds and datasets in the mean of scores fashion.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add dataset_score_bounds when, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.

Parameters:

evaluations (list) – The list of evaluation specifications.
scores (dict(str,float)) – The scores to check.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
dataset_score_bounds (None|dict(str,tuple(float,float))) – Bounds on the scores for the datasets.
solver_name (None|str) – The solver to use.
timeout (None|int) – The timeout for the linear programming solver in seconds.
verbosity (int) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose.
numerical_tolerance (float) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary check_n_datasets_mos_known_folds_mos
>>> evaluation0 = {'dataset': {'p': 118, 'n': 95},
                'folding': {'folds': [{'p': 22, 'n': 23}, {'p': 96, 'n': 72}]}}
>>> evaluation1 = {'dataset': {'p': 781, 'n': 423},
                'folding': {'folds': [{'p': 300, 'n': 200}, {'p': 481, 'n': 223}]}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.61, 'sens': 0.709, 'spec': 0.461, 'bacc': 0.585}
>>> result = check_n_datasets_mos_known_folds_mos(evaluations=evaluations,
                                                    scores=scores,
                                                    eps=1e-3)
>>> result['inconsistency']
# False

>>> evaluation0 = {'dataset': {'p': 118, 'n': 95},
                'folding': {'folds': [{'p': 22, 'n': 23}, {'p': 96, 'n': 72}]}}
>>> evaluation1 = {'dataset': {'p': 781, 'n': 423},
                'folding': {'folds': [{'p': 300, 'n': 200}, {'p': 481, 'n': 223}]}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.71, 'sens': 0.709, 'spec': 0.461}
>>> result = check_n_datasets_mos_known_folds_mos(evaluations=evaluations,
                                                scores=scores,
                                                eps=1e-3)
>>> result['inconsistency']
# True

mlscorecheck.check.binary.check_n_datasets_mos_unknown_folds_mos(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores calculated in k-fold cross validation on multiple datasets, in mean-of-scores fashion, without knowing the fold configurations. The function generates all possible fold configurations and tests the consistency of each. The scores are inconsistent if all the k-fold configurations lead to inconsistencies identified.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add dataset_score_bounds when, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like

Note that depending on the size of the dataset (especially the number of minority instances) and the folding configuration, this test might lead to an untractable number of problems to be solved. Use the function estimate_n_experiments to get an upper bound estimate on the number of fold combinations.

The evaluation of possible fold configurations stops when a feasible configuration is found.

Parameters:

evaluations (list(dict)) – the list of evaluation specifications
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
dataset_score_bounds (None|dict(str,dict(float,float))) – bounds on the scores in the datasets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the pulp linear programming solver, 0: silent, non-zero: verbose
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.binary import check_n_datasets_mos_unknown_folds_mos
>>> evaluation0 = {'dataset': {'p': 13, 'n': 73},
                'folding': {'n_folds': 4, 'n_repeats': 1}}
>>> evaluation1 = {'dataset': {'p': 7, 'n': 26},
                'folding': {'n_folds': 3, 'n_repeats': 1}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.357, 'sens': 0.323, 'spec': 0.362, 'bacc': 0.343}
>>> result = check_n_datasets_mos_unknown_folds_mos(evaluations=evaluations,
                                                    scores=scores,
                                                    eps=1e-3)
>>> result['inconsistency']
# False

>>> evaluation0 = {'dataset': {'p': 13, 'n': 73},
                'folding': {'n_folds': 4, 'n_repeats': 1}}
>>> evaluation1 = {'dataset': {'p': 7, 'n': 26},
                'folding': {'n_folds': 3, 'n_repeats': 1}}
>>> evaluations = [evaluation0, evaluation1]
>>> scores = {'acc': 0.357, 'sens': 0.323, 'spec': 0.362, 'bacc': 0.9}
>>> result = check_n_datasets_mos_unknown_folds_mos(evaluations=evaluations,
                                                    scores=scores,
                                                    eps=1e-3)
>>> result['inconsistency']
# True

Multiclass classification

mlscorecheck.check.multiclass.check_1_testset_no_kfold_macro(testset: dict, scores: dict, eps, *, class_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

The function tests the consistency of scores calculated by taking the macro average of class level scores on one single multiclass dataset.

The test operates by constructing a linear programming problem representing the experiment and checking its feasibility.

Note that this test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. Note that without bounds, if there is a large number of classes, it is likely that there will be a configuration matching the scores provided. In order to increase the strength of the test, one can add class_scores_bounds when, for example, besides the average score, the minimum and the maximum scores over the classes are also provided. Full names in camel case, like

Parameters:

testset (dict) – the specification of the testset
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
class_score_bounds (None|dict(str,tuple(float,float))) – bounds on the scores in the classes
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.multiclass import check_1_testset_no_kfold_macro
>>> testset = {0: 10, 1: 100, 2: 80}
>>> scores = {'acc': 0.6, 'sens': 0.3417, 'spec': 0.6928, 'f1p': 0.3308}
>>> results = check_1_testset_no_kfold_macro(scores=scores, testset=testset, eps=1e-4)
>>> results['inconsistency']
# False

>>> scores['acc'] = 0.6020
>>> results = check_1_testset_no_kfold_macro(scores=scores, testset=testset, eps=1e-4)
>>> results['inconsistency']
# True

mlscorecheck.check.multiclass.check_1_testset_no_kfold_micro(testset: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) → dict[source]

Checking the consistency of scores calculated by taking the micro average of class level scores on one single multiclass dataset.

The test operates by the exhaustive enumeration of all potential confusion matrices.

Parameters:

testset (dict) – the specification of the testset
scores (dict(str,float)) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.multiclass import check_1_testset_no_kfold_micro
>>> testset = {0: 10, 1: 100, 2: 80}
>>> scores = {'acc': 0.5158, 'sens': 0.2737, 'spec': 0.6368,
    'bacc': 0.4553, 'ppv': 0.2737, 'npv': 0.6368}
>>> results = check_1_testset_no_kfold_micro(testset=testset,
                                    scores=scores,
                                    eps=1e-4)
>>> results['inconsistency']
# False

>>> scores['acc'] = 0.5258
>>> results = check_1_testset_no_kfold_micro(testset=testset,
                                    scores=scores,
                                    eps=1e-4)
>>> results['inconsistency']
# True

mlscorecheck.check.multiclass.check_1_dataset_known_folds_mos_macro(dataset: dict, folding: dict, scores: dict, eps, *, class_score_bounds: dict = None, fold_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores calculated by taking the macro average of class-level scores on one single multiclass dataset with k-fold cross-validation.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add class_score_bounds or fold_score_bounds when, for example, the minimum and the maximum scores over the classes or folds are available. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.

Parameters:

testset (dict) – the specification of the testset
scores (dict(str,float)) – the scores to check
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
class_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores for the classes
fold_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the folds
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_macro
>>> dataset = {0: 149, 1: 118, 2: 83, 3: 154}
>>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}
>>> scores = {'acc': 0.626, 'sens': 0.2483, 'spec': 0.7509, 'f1p': 0.2469}
>>> result = check_1_dataset_known_folds_mos_macro(dataset=dataset,
                                                    folding=folding,
                                                    scores=scores,
                                                    eps=1e-4)
>>> results['inconsistency']
# False

>>> scores['acc'] = 0.656
>>> result = check_1_dataset_known_folds_mos_macro(dataset=dataset,
                                                folding=folding,
                                                scores=scores,
                                                eps=1e-4)
>>> result['inconsistency']
# True

mlscorecheck.check.multiclass.check_1_dataset_known_folds_mos_micro(dataset: dict, folding: dict, scores: dict, eps, *, fold_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated by taking the micro average on a single multiclass dataset with known folds.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add fold_score_bounds when, for example, the minimum and the maximum scores over the folds are available. Full names in camel case, like

Parameters:

dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding strategy.
scores (dict(str,float)) – The scores to check.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
fold_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the folds
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float, optional) – Beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It ensures that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

Examples

>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_micro
>>> dataset = {0: 66, 1: 178, 2: 151}
>>> folding = {'folds': [{0: 33, 1: 89, 2: 76}, {0: 33, 1: 89, 2: 75}]}
>>> scores = {'acc': 0.5646, 'sens': 0.3469, 'spec': 0.6734, 'f1p': 0.3469}
>>> result = check_1_dataset_known_folds_mos_micro(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# False

>>> scores['acc'] = 0.5746
>>> result = check_1_dataset_known_folds_mos_micro(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# True

mlscorecheck.check.multiclass.check_1_dataset_known_folds_som_macro(dataset: dict, folding: dict, scores: dict, eps, *, class_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

This function checks the consistency of scores calculated by taking the macro average on a single multiclass dataset and averaging the scores across the folds in the SoM manner.

The test operates by constructing a linear program describing the experiment and checkings its feasibility.

The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add class_score_bounds when, for example, the minimum and the maximum scores over the classes are available. Full names in camel case, like

Parameters:

dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding.
scores (dict(str,float)) – The scores to check.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
class_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores for the classes
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float, optional) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – If the provided scores are not consistent with the dataset.

Examples

>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_som_macro
>>> dataset = {0: 129, 1: 81, 2: 135}
>>> folding = {'n_folds': 2, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}
>>> scores = {'acc': 0.5662, 'sens': 0.3577, 'spec': 0.6767, 'f1p': 0.3481}
>>> result = check_1_dataset_known_folds_som_macro(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# False

>>> scores['acc'] = 0.6762
>>> result = check_1_dataset_known_folds_som_macro(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# True

mlscorecheck.check.multiclass.check_1_dataset_known_folds_som_micro(dataset: dict, folding: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) → dict[source]

This function checks the consistency of scores calculated by taking the micro average of class level scores on a single multiclass dataset and averaging across the folds in the SoM manner.

The test is performed by exhaustively testing all possible confusion matrices.

Parameters:

dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding strategy.
scores (dict(str,float)) – The scores to check.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Raises:

ValueError – If the provided scores are not consistent with the dataset.

Examples

>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_som_micro
>>> dataset = {0: 86, 1: 96, 2: 59, 3: 105}
>>> folding = {'folds': [{0: 43, 1: 48, 2: 30, 3: 52}, {0: 43, 1: 48, 2: 29, 3: 53}]}
>>> scores =  {'acc': 0.6272, 'sens': 0.2543, 'spec': 0.7514, 'f1p': 0.2543}
>>> result = check_1_dataset_known_folds_som_micro(dataset=dataset,
                                                    folding=folding,
                                                    scores=scores,
                                                    eps=1e-4)
>>> result['inconsistency']
# False

>>> scores['sens'] = 0.2553
>>> result = check_1_dataset_known_folds_som_micro(dataset=dataset,
                                        folding=folding,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# True

Regression

mlscorecheck.check.regression.check_1_testset_no_kfold(var: float, n_samples: int, scores: dict, eps, numerical_tolerance: float = 1e-06) → dict[source]

The consistency test for regression scores calculated on a single test set with no k-folding

Parameters:

var (float) – the variance of the evaluation set
n_samples (int) – the number of samples in the evaluation set
scores (dict(str,float)) – the scores to check (‘mae’, ‘rmse’, ‘mse’, ‘r2’)
eps (float,dict(str,float)) – the numerical uncertainty of the scores
numerical_tolerance (float) – the numerical tolerance of the test

Returns:

a summary of the analysis, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.regression import check_1_testset_no_kfold
>>> var = 0.08316192579267838
>>> n_samples = 100
>>> scores =  {'mae': 0.0254, 'r2': 0.9897}
>>> result = check_1_testset_no_kfold(var=var,
                                        n_samples=n_samples,
                                        scores=scores,
                                        eps=1e-4)
>>> result['inconsistency']
# False

>>> scores['mae'] = 0.03
>>> result = check_1_testset_no_kfold(var=var,
                                    n_samples=n_samples,
                                    scores=scores,
                                    eps=1e-4)
>>> result['inconsistency']
# True

Test bundles (`bundles`)

The test bundles dedicated to specific problems in the mlscorecheck.bundles module.

Retina Image Processing

The test functions dedicated to retina image processing problems.

DRIVE

mlscorecheck.check.bundles.retina.check_drive_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the DRIVE dataset with both assumptions on the region of evaluation (‘fov’/’all’).

Parameters:

image_identifier (str) – the identifier of the image (like “21”)
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_drive_vessel_image
>>> scores = {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849}
>>> identifier = '01'
>>> k = 4
>>> results = check_drive_vessel_image(scores=scores,
                                        eps=10**(-k),
                                        image_identifier=identifier,
                                        annotator=1)
>>> results['inconsistency']
# {'inconsistency_fov': True, 'inconsistency_all': False}

mlscorecheck.check.bundles.retina.check_drive_vessel_image_assumption(image_identifier: str, assumption: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the DRIVE dataset with a particular assumption on the region of evaluation.

Parameters:

image_identifier (str) – the identifier of the image (like “21”)
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated(imageset, annotator: int, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Testing the scores calculated for the DRIVE dataset with both assumptions regarding the region of evaluation (using the FoV or all pixels of the images).

The strength of the test can be improved by specifying the score_bounds (minimum and maximum scores) for the images when available.

Parameters:

imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_drive_vessel_aggregated
>>> scores = {'acc': 0.9494, 'sens': 0.7450, 'spec': 0.9793}
>>> k = 4
>>> results = check_drive_vessel_aggregated(scores=scores,
                                            eps=10**(-k),
                                            imageset='test',
                                            annotator=1,
                                            verbosity=0)
>>> results['inconsistency']
# {'inconsistency_fov_mos': False,
#  'inconsistency_fov_som': False,
#  'inconsistency_all_mos': True,
#  'inconsistency_all_som': True}

mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated_mos_assumption(imageset, assumption: str, annotator: int, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores calculated for some images of the DRIVE dataset with the mean of scores aggregation and a particular assumption on the region of evaluation.

Parameters:

imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to be used (1/2) (typically annotator 1 is used in papers)
scores (dict) – the scores to check the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’)
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated_som_assumption(imageset, assumption: str, annotator: int, scores: dict, eps, *, numerical_tolerance=1e-06)[source]

Tests the consistency of scores calculated on the DRIVE dataset using the score of means aggregation and a particular assumption on the region of evaluation.

Parameters:

imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to be used (1/2) (typically annotator 1 is used in papers)
scores (dict) –
the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

STARE

mlscorecheck.check.bundles.retina.check_stare_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the STARE dataset

Parameters:

image_identifier (str) – the identifier of the image (like “im0235”)
annotator (str) – the annotation to use (‘ah’/’vk’)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_stare_vessel_image
>>> img_identifier = 'im0235'
>>> scores = {'acc': 0.4699, 'npv': 0.8993, 'f1p': 0.134}
>>> results = check_stare_vessel_image(image_identifier=img_identifier,
                                        annotator='ah',
                                        scores=scores,
                                        eps=1e-4)
>>> results['inconsistency']
# False

mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Testing the scores calculated for the STARE dataset

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_stare_vessel_aggregated
>>> scores = {'acc': 0.4964, 'sens': 0.5793, 'spec': 0.4871, 'bacc': 0.5332}
>>> results = check_stare_vessel_aggregated(imageset='all',
                                            annotator='ah',
                                            scores=scores,
                                            eps=1e-4,
                                            verbosity=0)
>>> results['inconsistency']
# {'inconsistency_mos': False, 'inconsistency_som': True}

mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated_mos(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores calculated for some images of the STARE dataset with the mean of scores aggregation.

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’). Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated_som(imageset, annotator, scores, eps, numerical_tolerance=1e-06)[source]

Tests the consistency of scores calculated on the STARE dataset using the score-of-means aggregation.

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
scores (dict) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

HRF

mlscorecheck.check.bundles.retina.check_hrf_vessel_image(image_identifier: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the HRF dataset with both assumptions on the region of evaluation (‘fov’/’all’)

Parameters:

image_identifier (str) – the identifier of the image (like “01_g”)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_hrf_vessel_image
>>> scores = {'acc': 0.5562, 'sens': 0.5049, 'spec': 0.5621}
>>> identifier = '13_h'
>>> k = 4
>>> results = check_hrf_vessel_image(scores=scores,
                                        eps=10**(-k),
                                        image_identifier=identifier)
>>> results['inconsistency']
# {'inconsistency_fov': False, 'inconsistency_all': True}

mlscorecheck.check.bundles.retina.check_hrf_vessel_image_assumption(image_identifier: str, assumption: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the HRF dataset using an assumption on the region of evaluation.

Parameters:

image_identifier (str) – the identifier of the image (like “01_g”)
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – the additional numerical tolerance

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated(imageset, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Testing the scores calculated for the HRF dataset with both assumptions on the region of evaluation (‘fov’/’all’) and both aggregation methods (‘mean of scores’,

Parameters:

imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_hrf_vessel_aggregated
>>> scores = {'acc': 0.4841, 'sens': 0.5665, 'spec': 0.475}
>>> k = 4
>>> results = check_hrf_vessel_aggregated(scores=scores,
                                            eps=10**(-k),
                                            imageset='all',
                                            verbosity=0)
>>> results['inconsistency']
# {'inconsistency_fov_mos': False,
# 'inconsistency_fov_som': True,
# 'inconsistency_all_mos': False,
# 'inconsistency_all_som': True}

mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated_mos_assumption(imageset, assumption: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores with calculated for some images of the HRF dataset with the mean of scores aggregation and an assumption on the region of evaluation.

Parameters:

imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’). Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated_som_assumption(imageset, assumption: str, scores: dict, eps, numerical_tolerance=1e-06)[source]

Tests the consistency of scores calculated on the HRF dataset using the score-of-means aggregation and an assumption on the region of evaluation.

Parameters:

imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
scores (dict) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

CHASE_DB1

mlscorecheck.check.bundles.retina.check_chasedb1_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]

Testing the scores calculated for one image of the CHASEDB1 dataset

Parameters:

image_identifier (str) – the identifier of the image (like “11R”)
annotator (str) – the annotation to use (‘manual1’/’manual2’)
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_chasedb1_vessel_image
>>> img_identifier = '11R'
>>> scores = {'acc': 0.4457, 'sens': 0.0051, 'spec': 0.4706}
>>> results = check_chasedb1_vessel_image(image_identifier=img_identifier,
                                        annotator='manual1',
                                        scores=scores,
                                        eps=1e-4)
>>> results['inconsistency']
# False

mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Testing the scores calculated for the CHASEDB1 dataset with both assumptions on the mode of aggregation.

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_chasedb1_vessel_aggregated
>>> scores = {'acc': 0.5063, 'sens': 0.4147, 'spec': 0.5126}
>>> k = 4
>>> results = check_chasedb1_vessel_aggregated(imageset='all',
                                            annotator='manual1',
                                            scores=scores,
                                            eps=1e-4,
                                            verbosity=0)
>>> results['inconsistency']
# {'inconsistency_mos': False, 'inconsistency_som': True}

mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated_mos(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checking the consistency of scores with calculated for some images of the CHASEDB1 dataset with the mean of scores aggregation.

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’)
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated_som(imageset, annotator, scores, eps, numerical_tolerance=1e-06)[source]

Tests the consistency of scores calculated on the CHASEDB1 dataset using the score-of-means aggregation.

Parameters:

imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
scores (dict) –
the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

DIARETDB0

mlscorecheck.check.bundles.retina.check_diaretdb0_class(subset: str, batch, class_name, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Testing the scores calculated for the DIARETDB0 dataset. The dataset is an image labeling dataset, where various images can be labeled by the lesion recognized on the images. There are 5 different lesion labels, referred to as class_name in the arguments. The test considers the labeling of a certain lesion (class) as a binary classification problem as the images with the label treated as positive and the images without the label treated as negative samples. Furthermore, there are multiple batches of train and test images (9), the list of batches used for the evaluation can be passed with the batch argument. The actual subset from the batches being evaluated is passed through the subset argument. The test assumes that the scores are aggregated across the batches, thus, executes the tests with both the SoM and MoS aggregation assumptions.

Parameters:

subset (str) – ‘train’/’test’
batch (str|list) – the list of batches used, ‘all’ for all batches, or a subset of [‘1’, ‘2’, …, ‘9’]
class_name (str|list) – the name of the class being evaluated (‘neovascularisation’| ‘hardexudates’|’softexudates’|’hemorrhages’|’redsmalldots’), a list if a list of classes is treated as positive
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_diaretdb0_class
>>> scores = {'acc': 0.4271, 'sens': 0.406, 'spec': 0.4765}
>>> results = check_diaretdb0_class(subset='test',
                                    batch='all',
                                    class_name='hardexudates',
                                    scores=scores,
                                    eps=1e-4)
>>> results['inconsistency']
# {'inconsistency_som': True, 'inconsistency_mos': False}

DIARETDB1

mlscorecheck.check.bundles.retina.check_diaretdb1_class(*, subset: str, class_name, confidence: float, scores: dict, eps, numerical_tolerance: float = 1e-06) → dict[source]

Tests the scores describing the labeling of images in DIARETDB1. The problem is a multi-labeling problem, this test function supports binary the testing of binary subproblems (for example, the ‘hardexudates’ class being treated as the positive label).

Parameters:

subset (str) – the subset to be used (‘train’/’test’), typically ‘test’
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_class
>>> scores = {'acc': 0.3115, 'sens': 1.0, 'spec': 0.0455, 'f1p': 0.4474}
>>> results = check_diaretdb1_class(subset='test',
                        class_name=['hardexudates', 'softexudates'],
                        confidence=0.75,
                        scores=scores,
                        eps=1e-4)
>>> results['inconsistency']
# False

mlscorecheck.check.bundles.retina.check_diaretdb1_segmentation_image(*, image_identifier: str, class_name, confidence: float, scores: dict, eps, numerical_tolerance: float = 1e-06) → dict[source]

Tests the scores describing the segmentation of images in DIARETDB1. This test function supports binary the testing of binary subproblems (for example, the pixels of the ‘hardexudates’ class being segmented in an image). The test evaluates both assumptions of using the FoV or all pixels for evaluation.

Parameters:

image_identifier (str) – the identifier of the image to be tested (e.g. ‘001’)
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
scores (dict(str,float)) –

the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_segmentation_image
>>> scores = {'acc': 0.5753, 'sens': 0.0503, 'spec': 0.6187, 'f1p': 0.0178}
>>> results = check_diaretdb1_segmentation_image(image_identifier='005',
                        class_name=['hardexudates', 'softexudates'],
                        confidence=0.75,
                        scores=scores,
                        eps=1e-4)
>>> results['inconsistency']
# {'inconsistency_fov': True, 'inconsistency_all': False}

mlscorecheck.check.bundles.retina.check_diaretdb1_segmentation_aggregated(*, subset: str, class_name, confidence: float, only_valid: bool, scores: dict, eps, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Tests the scores describing the segmentation of multiple images of DIARETDB1 in an aggregated way. This test function supports binary the testing of binary subproblems (for example, the pixels of the ‘hardexudates’ class being segmented in an image). The test evaluates both assumption on the region of evaluation.

Parameters:

subset (str|list) – the subset of images to be used (‘train’/’test’) or the list of image identifiers to be tested (e.g. ‘001’)
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
only_valid (bool) – if True, works with that subset of the images, where both positives and negatives are present (e.g. images where the class class_name=’hardexudates’ is not present with confidence=0.75 level are discarded). If False, sensitivity is specified in scores and one of the images has 0 positives, the MoS test cannot be executed
scores (dict(str,float)) – the scores to be tested
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_segmentation_aggregated
>>> scores = {'acc': 0.7143, 'sens': 0.3775, 'spec': 0.7244}
>>> results = check_diaretdb1_segmentation_aggregated(subset='test',
                        class_name='hardexudates',
                        confidence=0.5,
                        only_valid=True,
                        scores=scores,
                        eps=1e-4)
>>> results['inconsistency']
# {'inconsistency_fov_som': True,
# 'inconsistency_all_som': True,
# 'inconsistency_fov_mos': False,
# 'inconsistency_all_mos': False}

DRISHTI_GS

mlscorecheck.check.bundles.retina.check_drishti_gs_segmentation_image(image_identifier: str, confidence: float, target: str, scores: dict, eps: float, *, numerical_tolerance: float = 1e-06)[source]

Testing the segmentation results on one image.

Parameters:

image_identifier (str) – the image identifier (e.g. ‘053’)
confidence (float) – the confidence level (in [0,1]), used for thresholding the soft segmentation ground truth image at threshold*255
target (str) – the target anatomical part (‘OD’/’OC’)
scores (dict) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_drishti_gs_segmentation_image
>>> scores = {'acc': 0.5966, 'sens': 0.3, 'spec': 0.6067, 'f1p': 0.0468}
>>> results = check_drishti_gs_segmentation_image(image_identifier='053',
                            confidence=0.75,
                            target='OD',
                            scores=scores,
                            eps=1e-4)
>>> results['inconsistency']
# False

mlscorecheck.check.bundles.retina.check_drishti_gs_segmentation_aggregated(subset: str, confidence: float, target: str, scores: dict, eps: float, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06)[source]

Testing the scores shared for a set of images with both the MoS and SoM aggregations.

Parameters:

subset (str|list) – the subset (‘test’/’train’) or the list of identifiers, e.g. [‘053’, ‘086’]
confidence (float) – the confidence level (in [0,1]), used for thresholding the soft segmentation ground truth image at threshold*255
target (str) – the target anatomical part (‘OD’/’OC’)
scores (dict(str,float)) – the scores to be tested
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

The summary of the results, with the following entries:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.retina import check_drishti_gs_segmentation_aggregated
>>> scores = {'acc': 0.4767, 'sens': 0.4845, 'spec': 0.4765, 'f1p': 0.0512}
>>> results = check_drishti_gs_segmentation_aggregated(subset='test',
                            confidence=0.75,
                            target='OD',
                            scores=scores,
                            eps=1e-4)
>>> results['inconsistency']
# {'inconsistency_som': False, 'inconsistency_mos': False}

Preterm delivery prediction by EHG signals

The test bundle dedicated to the testing of electrohsyterogram data.

mlscorecheck.check.bundles.ehg.check_tpehg(scores: dict, eps, n_folds: int, n_repeats: int, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Checks the cross-validated TPEHG scores

Parameters:

scores (dict(str,float)) –

the dictionary of scores (supports only ‘acc’, ‘sens’, ‘spec’,
’bacc’). Full names in camel case, like

’positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainties
n_folds (int) – the number of folds
n_repeats (int) – the number of repetitions
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the folds
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.ehg import check_tpehg
>>> # the 5-fold cross-validation scores reported in the paper
>>> scores = {'acc': 0.9447, 'sens': 0.9139, 'spec': 0.9733}
>>> eps = 0.0001
>>> results = check_tpehg(scores=scores,
                            eps=eps,
                            n_folds=5,
                            n_repeats=1)
>>> results['inconsistency']
# True

Skin lesion classification

The test bundle dedicated to the testing of skin lesion classification.

ISIC2016

mlscorecheck.check.bundles.skinlesion.check_isic2016(*, scores: dict, eps: float, numerical_tolerance: float = 1e-06)[source]

Tests if the scores are consistent with the test set of the ISIC2016 melanoma classification dataset

Parameters:

scores (dict) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.skinlesion import check_isic2016
>>> scores = {'acc': 0.7916, 'sens': 0.2933, 'spec': 0.9145}
>>> results = check_isic2016(scores=scores, eps=1e-4)
>>> results['inconsistency']
# False

ISIC2017

mlscorecheck.check.bundles.skinlesion.check_isic2017(*, target, against, scores: dict, eps: float, numerical_tolerance: float = 1e-06)[source]

Tests if the scores are consistent with the test set of the ISIC2017 skin lesion classification dataset. The dataset contains three classes, the test covers the binary classification aspect of the problem, when one (or two) of the classes are classified against the other two (or one) class.

Parameters:

target (str|list) – the target (positive) class(es), with the encoding ‘M’ for melanoma, ‘SK’ for seborrheic keratosis and ‘N’ for nevus.
against (str|list) – specification of the negative classes, with the encoding ‘M’ for melanoma, ‘SK’ for seborrheic keratosis and ‘N’ for nevus.
scores (dict) –

the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta

positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

A dictionary containing the results of the consistency check. The dictionary includes the following keys:

Return type:

dict

Examples

>>> from mlscorecheck.check.bundles.skinlesion import check_isic2017
>>> scores = {'acc': 0.6183, 'sens': 0.4957, 'ppv': 0.2544, 'f1p': 0.3362}
>>> results = check_isic2017(target='M',
                    against=['SK', 'N'],
                    scores=scores,
                    eps=1e-4)
>>> results['inconsistency']
# False

Experiments (`experiments`)

The predefined dataset and experiment statistics to look up are stored in the mlscorecheck.experiments module.

mlscorecheck.experiments.load_ml_datasets()[source]: Load the ML datasets

mlscorecheck.experiments.lookup_dataset(dataset: str) → dict[source]

Look up a dataset

Parameters:: dataset (str) – the dataset to look up
Returns:: the count statistics of the dataset
Return type:: dict

mlscorecheck.experiments.load_drive() → dict[source]

Loading the drive experiments

Returns:: the drive experiments
Return type:: dict

The core modules

Score functions (`scores`)

mlscorecheck.scores.accuracy(*, tp, tn, p, n)[source]

The accuracy score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.error_rate(*, fp, fn, p, n)[source]

The error_rate score

Parameters:

fp (int|float|Interval|IntervalUnion) – The number of false positives
fn (int|float|Interval|IntervalUnion) – The number of false negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.sensitivity(*, tp, p)[source]

The sensitivity score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_negative_rate(*, fn, p)[source]

The false_negative_rate score

Parameters:

fn (int|float|Interval|IntervalUnion) – The number of false negatives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_positive_rate(*, fp, n)[source]

The false_positive_rate score

Parameters:

fp (int|float|Interval|IntervalUnion) – The number of false positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.specificity(*, tn, n)[source]

The specificity score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.positive_predictive_value(*, tp, fp)[source]

The positive_predictive_value score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_discovery_rate(*, tp, fp)[source]

The false_discovery_rate score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_omission_rate(*, tn, fn)[source]

The false_omission_rate score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
fn (int|float|Interval|IntervalUnion) – The number of false negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.negative_predictive_value(*, tn, fn)[source]

The negative_predictive_value score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
fn (int|float|Interval|IntervalUnion) – The number of false negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f_beta_positive(*, tp, fp, p, beta_positive)[source]

The f_beta_positive score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives
beta_positive (int|float|Interval|IntervalUnion) – the beta parameter

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f_beta_negative(*, tn, fn, n, beta_negative)[source]

The f_beta_negative score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
fn (int|float|Interval|IntervalUnion) – The number of false negatives
n (int|float|Interval|IntervalUnion) – The number of negatives
beta_negative (int|float|Interval|IntervalUnion) – the beta parameter

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f1_positive(*, tp, fp, p)[source]

The f1_positive score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f1_negative(*, tn, fn, n)[source]

The f1_negative score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
fn (int|float|Interval|IntervalUnion) – The number of false negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.unified_performance_measure(*, tp, tn, p, n)[source]

The unified_performance_measure score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.geometric_mean(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The geometric_mean score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.fowlkes_mallows_index(*, tp, fp, p, sqrt=<built-in function sqrt>)[source]

The fowlkes_mallows_index score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.markedness(*, tp, tn, p, n)[source]

The markedness score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.positive_likelihood_ratio(*, tp, fp, p, n)[source]

The positive_likelihood_ratio score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.negative_likelihood_ratio(*, tn, fn, p, n)[source]

The negative_likelihood_ratio score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
fn (int|float|Interval|IntervalUnion) – The number of false negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.matthews_correlation_coefficient(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The matthews_correlation_coefficient score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.bookmaker_informedness(*, tp, tn, p, n)[source]

The bookmaker_informedness score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.prevalence_threshold(*, tp, fp, p, n, sqrt=<built-in function sqrt>)[source]

The prevalence_threshold score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.diagnostic_odds_ratio(*, tp, tn, p, n)[source]

The diagnostic_odds_ratio score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.jaccard_index(*, tp, fp, p)[source]

The jaccard_index score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
fp (int|float|Interval|IntervalUnion) – The number of false positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.balanced_accuracy(*, tp, tn, p, n)[source]

The balanced_accuracy score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.cohens_kappa(*, tp, tn, p, n)[source]

The cohens_kappa score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.accuracy_standardized(*, tp, tn, p, n)[source]

The standardized accuracy score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.error_rate_standardized(*, tp, tn, p, n)[source]

The standardized error_rate score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.sensitivity_standardized(*, tp, p)[source]

The standardized sensitivity score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_negative_rate_standardized(*, tp, p)[source]

The standardized false_negative_rate score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_positive_rate_standardized(*, tn, n)[source]

The standardized false_positive_rate score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.specificity_standardized(*, tn, n)[source]

The standardized specificity score

Parameters:

tn (int|float|Interval|IntervalUnion) – The number of true negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.positive_predictive_value_standardized(*, tp, tn, n)[source]

The standardized positive_predictive_value score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_discovery_rate_standardized(*, tp, tn, n)[source]

The standardized false_discovery_rate score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.false_omission_rate_standardized(*, tp, tn, p)[source]

The standardized false_omission_rate score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.negative_predictive_value_standardized(*, tp, tn, p)[source]

The standardized negative_predictive_value score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f_beta_positive_standardized(*, tp, tn, p, n, beta_positive)[source]

The standardized f_beta_positive score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives
beta_positive (int|float|Interval|IntervalUnion) – the beta parameter

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f_beta_negative_standardized(*, tp, tn, p, n, beta_negative)[source]

The standardized f_beta_negative score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives
beta_negative (int|float|Interval|IntervalUnion) – the beta parameter

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f1_positive_standardized(*, tp, tn, p, n)[source]

The standardized f1_positive score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.f1_negative_standardized(*, tp, tn, p, n)[source]

The standardized f1_negative score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.unified_performance_measure_standardized(*, tp, tn, p, n)[source]

The standardized unified_performance_measure score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.geometric_mean_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The standardized geometric_mean score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.fowlkes_mallows_index_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The standardized fowlkes_mallows_index score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.markedness_standardized(*, tp, tn, p, n)[source]

The standardized markedness score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.positive_likelihood_ratio_standardized(*, tp, tn, p, n)[source]

The standardized positive_likelihood_ratio score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.negative_likelihood_ratio_standardized(*, tp, tn, p, n)[source]

The standardized negative_likelihood_ratio score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.matthews_correlation_coefficient_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The standardized matthews_correlation_coefficient score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.bookmaker_informedness_standardized(*, tp, tn, p, n)[source]

The standardized bookmaker_informedness score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.prevalence_threshold_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]

The standardized prevalence_threshold score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.diagnostic_odds_ratio_standardized(*, tp, tn, p, n)[source]

The standardized diagnostic_odds_ratio score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.jaccard_index_standardized(*, tp, tn, p, n)[source]

The standardized jaccard_index score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.balanced_accuracy_standardized(*, tp, tn, p, n)[source]

The standardized balanced_accuracy score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

mlscorecheck.scores.cohens_kappa_standardized(*, tp, tn, p, n)[source]

The standardized cohens_kappa score

Parameters:

tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives

Returns:

the score

Return type:

int|float|Interval|IntervalUnion

Testing logic for individual scores (`individual`)

The main, low level interface function of the module is check_scores_tptn_pairs.

mlscorecheck.individual.check_scores_tptn_pairs(p: int, n: int, scores: dict, eps, *, numerical_tolerance: float = 1e-06, solve_for: str = None, prefilter_by_pairs: bool = False) → dict[source]

Check scores by iteratively reducing the set of feasible tp, tn pairs.

Parameters:

p (int) – the number of positives
n (int) – the number of negatives
scores (dict) – the available reported scores
eps (float|dict(str,float)) – the numerical uncertainties for all scores or each score individually
numerical_tolerance (float) – the additional numerical tolerance
solve_for (str) – the figure solving for (the other is used to iterate by) (tp/tn) If None, the optimal one is being used.
prefilter_by_pairs (bool) – whether to prefilter the tp and tn intervals by the pairwise solutions

Returns:

a summary of the results. When the inconsistency flag is True, it indicates that the set of feasible tp, tn pairs is empty. The list under the key details provides further details from the analysis of the scores one after the other. Under the key n_valid_tptn_pairs one finds the number of tp and tn pairs compatible with all scores. Under the key prefiltering_details one finds the results of the prefiltering by using the solutions for the score pairs.

Return type:

dict

Testing logic for aggregated scores (`aggregated`)

The main, low level interface function of the module is check_aggregated_scores.

mlscorecheck.aggregated.check_aggregated_scores(*, experiment: dict, scores: dict, eps, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) → dict[source]

Check aggregated scores

Parameters:

experiment (dict|Experiment) – the experiment specification
scores (dict) – the scores to match
eps (dict|float) – the numerical uncertainty
solver_name (str) – the name of the solver to be used, check pulp.listSolvers(onlyAvailable) for the available list
timeout (int) – the number of seconds to time out
verbosity (int) – controls the verbosity level of the pulp based linear programming solver. 0: no output; non-zero: print output
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.

Returns:

the details of the test, under the key ‘inconsistency’, one can find the flag indicating if inconsistency was identified

Return type:

dict

Raises:

ValueError – if the problem is not specified properly

class mlscorecheck.aggregated.Dataset(p: int = None, n: int = None, dataset_name: str = None, identifier: str = None)[source]

The abstract representation of a dataset

resolve_pn()[source]: Resolves the p and n values from the name of the dataset

to_dict() → dict[source]

Dictionary representation of the dataset

Returns:: to_dict
Return type:: dict

class mlscorecheck.aggregated.Folding(n_folds: int = None, n_repeats: int = None, folds: list = None, strategy: str = None)[source]

Abstract representation of a folding

generate_folds(dataset: Dataset, aggregation: str) → list[source]

Generates fold objects according to the folding

Parameters:

dataset (Dataset) – the dataset to generate folds for
aggregation (str) – the type of aggregation (‘mos’/’som’)

Returns:

the list of fold objects

Return type:

list(Fold)

Raises:

ValueError – if the problem is not specified correctly

to_dict() → dict[source]

Dictionary representation of the folding

Returns:: the representation of the folding
Return type:: dict

class mlscorecheck.aggregated.Fold(p: int, n: int, identifier: str = None)[source]

Abstract representation of a fold

calculate_scores(rounding_decimals: int = None, score_subset: list = None) → dict[source]

Calculate the scores for the fold

Parameters:

rounding_decimals (int|None) – the number of decimals to round to
score_subset (list) – the subset of scores to calculate

Returns:

the scores

Return type:

dict

init_lp(scores: dict = None)[source]

Initialize a linear programming problem by creating the variables for the fold

Parameters:: scores (dict|None) – the score values to be used to set initial values
Returns:: the updated problem
Return type:: pl.LpProblem

populate(lp_problem: LpProblem) → LpProblem[source]

Populate the fold with the tp and tn values from the linear program

Parameters:: lp_problem (pl.LpProblem) – the linear programming problem
Returns:: the self object populated with the tp and tn scores
Return type:: obj

sample_figures(random_state=None)[source]

Samples the tp and tn figures

Parameters:: random_state (None|int|np.random.RandomState) – the random state/seed to use
Returns:: the self object after sampling
Return type:: Fold

set_initial_values(scores)[source]

Sets the initial values for the tp and tn variables

Parameters:: scores (dict) – the dictionary of scores

to_dict() → dict[source]

Dictionary representation of the fold

Returns:: the dictionary representation
Return type:: dict

class mlscorecheck.aggregated.Evaluation(dataset: dict, folding: dict, aggregation: str, fold_score_bounds: dict = None)[source]

Abstract representation of an evaluation

calculate_scores(rounding_decimals: int = None, score_subset: list = None) → dict[source]

Calculates the scores

Parameters:

rounding_decimals (int|None) – the number of decimals to round the scores to
score_subset (list) – the list of scores to calculate scores for

Returns:

the calculated scores

Return type:

dict

check_bounds(numerical_tolerance: float = 1e-06) → dict[source]

Check the bounds in the problem

Parameters:

numerical_tolerance (float) – the additional numerical tolerance to be used

Returns:

a summary of the test, with the boolean flag under bounds_flag: indicating the overall results

Return type:

dict

init_lp(lp_problem: LpProblem, scores: dict = None) → LpProblem[source]

Initializes a linear programming problem

Parameters:

lp_problem (pl.LpProblem) – the linear programming problem to initialize
scores (dict(str,float)|None) – the scores used to estimate initial values

Returns:

the updated linear programming problem

Return type:

pl.LpProblem

populate(lp_problem: LpProblem)[source]

Populates the evaluation with the figures in the solved linear programming problem

Parameters:: lp_problem (pl.LpProblem) – the linear programming problem with solve() executed
Returns:: the updated self object
Return type:: obj

sample_figures(random_state=None, score_subset: list = None)[source]

Samples the figures in the evaluation

Parameters:: random_state (None|int|np.random.RandomState) – the random seed/state to use
Returns:: the self object with the sampled figures
Return type:: obj

to_dict() → dict[source]

Returns the dictionary representation of the object

Returns:: the dictionary representation
Return type:: dict

class mlscorecheck.aggregated.Experiment(evaluations: list, aggregation: str, dataset_score_bounds: dict = None)[source]

Abstract representation of an experiment

calculate_scores(rounding_decimals: int = None, score_subset: list = None) → dict[source]

Calculates the scores

Parameters:

rounding_decimals (int|None) – the number of decimals to round the scores to
score_subset (list|None) – the subset of scores to return

Returns:

the scores

Return type:

dict(str,float)

check_bounds(numerical_tolerance: float = 1e-06) → dict[source]

Check the bounds in the problem

Parameters:

numerical_tolerance (float) – the additional numerical tolerance to be used

Returns:

a summary of the test, with the boolean flag under bounds_flag: indicating the overall results

Return type:

dict

init_lp(lp_problem: LpProblem, scores: dict = None) → LpProblem[source]

Initializes a linear programming problem

Parameters:

lp_problem (pl.LpProblem) – the linear programming problem to initialize
scores (dict(str,float)) – the scores used to estimate initial values

Returns:

the updated linear programming problem

Return type:

pl.LpProblem

populate(lp_problem)[source]

Populates the evaluation with the figures in the solved linear programming problem

Parameters:: lp_problem (pl.LpProblem) – the linear programming problem with solve() executed
Returns:: the updated self object
Return type:: obj

sample_figures(random_state=None, score_subset: list = None)[source]

Samples the tp and tn figures

Parameters:: random_state (None|int|np.random.RandomState) – the random seed/state to use
Returns:: the sampled self object
Return type:: obj

to_dict() → dict[source]

Returns a dictionary representation of the object

Returns:: the dictionary representation of the object
Return type:: dict

The main interface

Consistency testing (check)

Binary classification

Multiclass classification

Regression

Test bundles (bundles)

Retina Image Processing

DRIVE

STARE

HRF

CHASE_DB1

DIARETDB0

DIARETDB1

DRISHTI_GS

Preterm delivery prediction by EHG signals

Skin lesion classification

ISIC2016

ISIC2017

Experiments (experiments)

The core modules

Score functions (scores)

Testing logic for individual scores (individual)

Testing logic for aggregated scores (aggregated)

Consistency testing (`check`)

Test bundles (`bundles`)

Experiments (`experiments`)

Score functions (`scores`)

Testing logic for individual scores (`individual`)

Testing logic for aggregated scores (`aggregated`)