The main interface
Consistency testing (check)
The test functions implemented in the mlscorecheck.check module.
Binary classification
- mlscorecheck.check.binary.check_1_testset_no_kfold(testset: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) dict[source]
Use this check if the scores are calculated on one single test set with no kfolding. The test is performed by exhaustively testing all possible confusion matrices.
- Parameters:
testset (dict) – the specification of a testset with p, n or its name
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty (potentially for each score)
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the dataset.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_1_testset_no_kfold >>> testset = {'p': 530, 'n': 902} >>> scores = {'acc': 0.62, 'sens': 0.22, 'spec': 0.86, 'f1p': 0.3, 'fm': 0.32} >>> result = check_1_testset_no_kfold(testset=testset, scores=scores, eps=1e-2) >>> result['inconsistency'] # False
>>> testset = {'p': 530, 'n': 902} >>> scores = {'acc': 0.92, 'sens': 0.22, 'spec': 0.86, 'f1p': 0.3, 'fm': 0.32} >>> result = check_1_testset_no_kfold(testset=testset, scores=scores, eps=1e-2) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_1_dataset_kfold_som(dataset: dict, folding: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) dict[source]
This function checks the consistency of scores calculated by applying k-fold cross validation to a single dataset and aggregating the figures over the folds in the score of means fashion. The test is performed by exhaustively testing all possible confusion matrices.
- Parameters:
dataset (dict) – The dataset specification.
folding (dict) – The folding specification.
scores (dict(str,float)) – The scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
numerical_tolerance (float, optional) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the dataset.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_1_dataset_kfold_som >>> dataset = {'dataset_name': 'common_datasets.monk-2'} >>> folding = {'n_folds': 4, 'n_repeats': 3, 'strategy': 'stratified_sklearn'} >>> scores = {'spec': 0.668, 'npv': 0.744, 'ppv': 0.667, 'bacc': 0.706, 'f1p': 0.703, 'fm': 0.704} >>> result = check_1_dataset_kfold_som(dataset=dataset, folding=folding, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> dataset = {'p': 10, 'n': 20} >>> folding = {'n_folds': 5, 'n_repeats': 1} >>> scores = {'acc': 0.428, 'npv': 0.392, 'bacc': 0.442, 'f1p': 0.391} >>> result = check_1_dataset_kfold_som(dataset=dataset, folding=folding, scores=scores, eps=1e-3) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_1_dataset_known_folds_mos(dataset: dict, folding: dict, scores: dict, eps, fold_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated by applying k-fold cross validation to a single dataset and aggregating the figures over the folds in the mean of scores fashion.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
fold_score_boundswhen, for example, the minimum and the maximum scores over the folds are also provided. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
dataset (dict) – The dataset specification.
folding (dict) – The folding specification.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
fold_score_bounds (None|dict(str,tuple(float,float))) – Bounds on the scores in the folds.
solver_name (None|str) – The solver to use.
timeout (None|int) – The timeout for the linear programming solver in seconds.
verbosity (int) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose.
numerical_tolerance (float) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_1_dataset_known_folds_mos >>> dataset = {'p': 126, 'n': 131} >>> folding = {'folds': [{'p': 52, 'n': 94}, {'p': 74, 'n': 37}]} >>> scores = {'acc': 0.573, 'sens': 0.768, 'bacc': 0.662} >>> result = check_1_dataset_known_folds_mos(dataset=dataset, folding=folding, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> dataset = {'p': 398, 'n': 569} >>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'} >>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6} >>> result = check_1_dataset_known_folds_mos(dataset=dataset, folding=folding, scores=scores, eps=1e-2) >>> result['inconsistency'] # True
>>> dataset = {'dataset_name': 'common_datasets.glass_0_1_6_vs_2'} >>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'} >>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6, 'bacc': 0.1, 'f1': 0.95} >>> result = check_1_dataset_known_folds_mos(dataset=dataset, folding=folding, fold_score_bounds={'acc': (0.8, 1.0)}, scores=scores, eps=1e-2, numerical_tolerance=1e-6) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_1_dataset_unknown_folds_mos(dataset: dict, folding: dict, scores: dict, eps, fold_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores calculated in a k-fold cross validation on a single dataset, in a mean-of-scores fashion, without knowing the fold configuration. The function generates all possible fold configurations and tests the consistency of each. The scores are inconsistent if all the k-fold configurations lead to inconsistencies identified.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add fold_score_bounds when, for example, the minimum and the maximum scores over the folds are also provided. Full names in camel case, like
‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
Note that depending on the size of the dataset (especially the number of minority instances) and the folding configuration, this test might lead to an untractable number of problems to be solved. Use the function
estimate_n_evaluationsto get an upper bound estimate on the number of fold combinations.The evaluation of possible fold configurations stops when a feasible configuration is found.
- Parameters:
dataset (dict) – the dataset specification
folding (dict) – the folding specification
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
fold_score_bounds (None|dict(str,dict(str,str))) – bounds on the scores in the folds
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list of dictionaries containing the details of the consistency tests. Each entry contains the specification of the folds being tested and the outcome of the
check_1_dataset_known_folds_mosfunction.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_1_dataset_unknown_folds_mos >>> dataset = {'p': 126, 'n': 131} >>> folding = {'n_folds': 2, 'n_repeats': 1} >>> scores = {'acc': 0.573, 'sens': 0.768, 'bacc': 0.662} >>> result = check_1_dataset_unknown_folds_mos(dataset=dataset, folding=folding, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> dataset = {'p': 19, 'n': 97} >>> folding = {'n_folds': 3, 'n_repeats': 1} >>> scores = {'acc': 0.9, 'spec': 0.9, 'sens': 0.6} >>> result = check_1_dataset_unknown_folds_mos(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_testsets_mos_no_kfold(testsets: list, scores: dict, eps, testset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated on multiple testsets with no k-fold and aggregating the figures over the testsets in the mean of scores fashion.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
testset_score_boundswhen, for example, the minimum and the maximum scores over the testsets are also provided. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
testset_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the testsets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_n_testsets_mos_no_kfold >>> testsets = [{'p': 349, 'n': 50}, {'p': 478, 'n': 323}, {'p': 324, 'n': 83}, {'p': 123, 'n': 145}] >>> scores = {'acc': 0.6441, 'sens': 0.6706, 'spec': 0.3796, 'bacc': 0.5251} >>> result = check_n_testsets_mos_no_kfold(testsets=testsets, eps=1e-4, scores=scores) >>> result['inconsistency'] # False
>>> scores['sens'] = 0.6756 >>> result = check_n_datasets_mos_no_kfold(testsets=testsets, eps=1e-4, scores=scores) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_testsets_som_no_kfold(testsets: list, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True)[source]
Checking the consistency of scores calculated by aggregating the figures over testsets in the score of means fashion, without k-folding.
The test is performed by exhaustively testing all possible confusion matrices.
- Parameters:
datasets (list(dict)) – the specification of the evaluations
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_n_datasets_som_no_kfold >>> testsets = [{'p': 405, 'n': 223}, {'p': 3, 'n': 422}, {'p': 109, 'n': 404}] >>> scores = {'acc': 0.4719, 'npv': 0.6253, 'f1p': 0.3091} >>> result = check_n_datasets_som_no_kfold(testsets=testsets, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> scores['npv'] = 0.6263 >>> result = check_n_datasets_som_no_kfold(testsets=testsets, scores=scores, eps=1e-3) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_datasets_som_kfold_som(evaluations: list, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True)[source]
Checking the consistency of scores calculated by applying k-fold cross validation to multiple datasets and aggregating the figures over the folds and datasets in the score of means fashion. The test is performed by exhaustively testing all possible confusion matrices.
- Parameters:
evaluations (list(dict)) – the specification of the evaluations
scores (dict(str,float)) – the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’), when using f-beta positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to do a prefiltering based on the score-pair tp-tn solutions (faster)
- Returns:
A dictionary containing the results of the consistency check. The dictionary
- Return type:
dict
includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the dataset.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_n_datasets_som_kfold_som >>> evaluation0 = {'dataset': {'p': 389, 'n': 630}, 'folding': {'n_folds': 5, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}} >>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.saheart'}, 'folding': {'n_folds': 5, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.631, 'sens': 0.341, 'spec': 0.802, 'f1p': 0.406, 'fm': 0.414} >>> result = check_n_datasets_som_kfold_som(scores=scores, evaluations=evaluations, eps=1e-3) >>> result['inconsistency'] # False
>>> evaluation0 = {'dataset': {'p': 389, 'n': 630}, 'folding': {'n_folds': 5, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}} >>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.saheart'}, 'folding': {'n_folds': 5, 'n_repeats': 2, 'strategy': 'stratified_sklearn'}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.731, 'sens': 0.341, 'spec': 0.802, 'f1p': 0.406, 'fm': 0.414} >>> result = check_n_datasets_som_kfold_som(scores=scores, evaluations=evaluations, eps=1e-3) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_datasets_mos_kfold_som(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated on multiple datasets with k-fold cross-validation, applying score of means aggregation over the folds and mean of scores aggregation over the datasets.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
dataset_score_boundswhen, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
evaluations (list(dict)) – the list of evaluation specifications
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
dataset_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the datasets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_n_datasets_mos_kfold_som >>> evaluation0 = {'dataset': {'p': 39, 'n': 822}, 'folding': {'n_folds': 5, 'n_repeats': 3, 'strategy': 'stratified_sklearn'}} >>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.winequality-white-3_vs_7'}, 'folding': {'n_folds': 5, 'n_repeats': 3, 'strategy': 'stratified_sklearn'}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.312, 'sens': 0.45, 'spec': 0.312, 'bacc': 0.381} >>> result = check_n_datasets_mos_kfold_som(evaluations=evaluations, dataset_score_bounds={'acc': (0.0, 0.5)}, eps=1e-4, scores=scores) >>> result['inconsistency'] # False
>>> evaluation0 = {'dataset': {'p': 39, 'n': 822}, 'folding': {'n_folds': 5, 'n_repeats': 3, 'strategy': 'stratified_sklearn'}} >>> evaluation1 = {'dataset': {'dataset_name': 'common_datasets.winequality-white-3_vs_7'}, 'folding': {'n_folds': 5, 'n_repeats': 3, 'strategy': 'stratified_sklearn'}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.412, 'sens': 0.45, 'spec': 0.312, 'bacc': 0.381} >>> result = check_n_datasets_mos_kfold_som(evaluations=evaluations, dataset_score_bounds={'acc': (0.5, 1.0)}, eps=1e-4, scores=scores) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_datasets_mos_known_folds_mos(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated by applying k-fold cross validation to N datasets and aggregating the figures over the folds and datasets in the mean of scores fashion.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
dataset_score_boundswhen, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.- Parameters:
evaluations (list) – The list of evaluation specifications.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
dataset_score_bounds (None|dict(str,tuple(float,float))) – Bounds on the scores for the datasets.
solver_name (None|str) – The solver to use.
timeout (None|int) – The timeout for the linear programming solver in seconds.
verbosity (int) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose.
numerical_tolerance (float) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary check_n_datasets_mos_known_folds_mos >>> evaluation0 = {'dataset': {'p': 118, 'n': 95}, 'folding': {'folds': [{'p': 22, 'n': 23}, {'p': 96, 'n': 72}]}} >>> evaluation1 = {'dataset': {'p': 781, 'n': 423}, 'folding': {'folds': [{'p': 300, 'n': 200}, {'p': 481, 'n': 223}]}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.61, 'sens': 0.709, 'spec': 0.461, 'bacc': 0.585} >>> result = check_n_datasets_mos_known_folds_mos(evaluations=evaluations, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> evaluation0 = {'dataset': {'p': 118, 'n': 95}, 'folding': {'folds': [{'p': 22, 'n': 23}, {'p': 96, 'n': 72}]}} >>> evaluation1 = {'dataset': {'p': 781, 'n': 423}, 'folding': {'folds': [{'p': 300, 'n': 200}, {'p': 481, 'n': 223}]}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.71, 'sens': 0.709, 'spec': 0.461} >>> result = check_n_datasets_mos_known_folds_mos(evaluations=evaluations, scores=scores, eps=1e-3) >>> result['inconsistency'] # True
- mlscorecheck.check.binary.check_n_datasets_mos_unknown_folds_mos(evaluations: list, scores: dict, eps, dataset_score_bounds: dict = None, *, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores calculated in k-fold cross validation on multiple datasets, in mean-of-scores fashion, without knowing the fold configurations. The function generates all possible fold configurations and tests the consistency of each. The scores are inconsistent if all the k-fold configurations lead to inconsistencies identified.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add dataset_score_bounds when, for example, the minimum and the maximum scores over the datasets are also provided. Full names in camel case, like
‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
Note that depending on the size of the dataset (especially the number of minority instances) and the folding configuration, this test might lead to an untractable number of problems to be solved. Use the function
estimate_n_experimentsto get an upper bound estimate on the number of fold combinations.The evaluation of possible fold configurations stops when a feasible configuration is found.
- Parameters:
evaluations (list(dict)) – the list of evaluation specifications
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
dataset_score_bounds (None|dict(str,dict(float,float))) – bounds on the scores in the datasets
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the pulp linear programming solver, 0: silent, non-zero: verbose
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list of dictionaries containing the details of the consistency tests. Each entry contains the specification of the folds being tested and the outcome of the
check_n_datasets_known_folds_mosfunction.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.binary import check_n_datasets_mos_unknown_folds_mos >>> evaluation0 = {'dataset': {'p': 13, 'n': 73}, 'folding': {'n_folds': 4, 'n_repeats': 1}} >>> evaluation1 = {'dataset': {'p': 7, 'n': 26}, 'folding': {'n_folds': 3, 'n_repeats': 1}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.357, 'sens': 0.323, 'spec': 0.362, 'bacc': 0.343} >>> result = check_n_datasets_mos_unknown_folds_mos(evaluations=evaluations, scores=scores, eps=1e-3) >>> result['inconsistency'] # False
>>> evaluation0 = {'dataset': {'p': 13, 'n': 73}, 'folding': {'n_folds': 4, 'n_repeats': 1}} >>> evaluation1 = {'dataset': {'p': 7, 'n': 26}, 'folding': {'n_folds': 3, 'n_repeats': 1}} >>> evaluations = [evaluation0, evaluation1] >>> scores = {'acc': 0.357, 'sens': 0.323, 'spec': 0.362, 'bacc': 0.9} >>> result = check_n_datasets_mos_unknown_folds_mos(evaluations=evaluations, scores=scores, eps=1e-3) >>> result['inconsistency'] # True
Multiclass classification
- mlscorecheck.check.multiclass.check_1_testset_no_kfold_macro(testset: dict, scores: dict, eps, *, class_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
The function tests the consistency of scores calculated by taking the macro average of class level scores on one single multiclass dataset.
The test operates by constructing a linear programming problem representing the experiment and checking its feasibility.
Note that this test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. Note that without bounds, if there is a large number of classes, it is likely that there will be a configuration matching the scores provided. In order to increase the strength of the test, one can add
class_scores_boundswhen, for example, besides the average score, the minimum and the maximum scores over the classes are also provided. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
testset (dict) – the specification of the testset
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
class_score_bounds (None|dict(str,tuple(float,float))) – bounds on the scores in the classes
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.multiclass import check_1_testset_no_kfold_macro >>> testset = {0: 10, 1: 100, 2: 80} >>> scores = {'acc': 0.6, 'sens': 0.3417, 'spec': 0.6928, 'f1p': 0.3308} >>> results = check_1_testset_no_kfold_macro(scores=scores, testset=testset, eps=1e-4) >>> results['inconsistency'] # False
>>> scores['acc'] = 0.6020 >>> results = check_1_testset_no_kfold_macro(scores=scores, testset=testset, eps=1e-4) >>> results['inconsistency'] # True
- mlscorecheck.check.multiclass.check_1_testset_no_kfold_micro(testset: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) dict[source]
Checking the consistency of scores calculated by taking the micro average of class level scores on one single multiclass dataset.
The test operates by the exhaustive enumeration of all potential confusion matrices.
- Parameters:
testset (dict) – the specification of the testset
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.multiclass import check_1_testset_no_kfold_micro >>> testset = {0: 10, 1: 100, 2: 80} >>> scores = {'acc': 0.5158, 'sens': 0.2737, 'spec': 0.6368, 'bacc': 0.4553, 'ppv': 0.2737, 'npv': 0.6368} >>> results = check_1_testset_no_kfold_micro(testset=testset, scores=scores, eps=1e-4) >>> results['inconsistency'] # False
>>> scores['acc'] = 0.5258 >>> results = check_1_testset_no_kfold_micro(testset=testset, scores=scores, eps=1e-4) >>> results['inconsistency'] # True
- mlscorecheck.check.multiclass.check_1_dataset_known_folds_mos_macro(dataset: dict, folding: dict, scores: dict, eps, *, class_score_bounds: dict = None, fold_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores calculated by taking the macro average of class-level scores on one single multiclass dataset with k-fold cross-validation.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
class_score_boundsorfold_score_boundswhen, for example, the minimum and the maximum scores over the classes or folds are available. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.- Parameters:
testset (dict) – the specification of the testset
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
class_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores for the classes
fold_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the folds
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_macro >>> dataset = {0: 149, 1: 118, 2: 83, 3: 154} >>> folding = {'n_folds': 4, 'n_repeats': 2, 'strategy': 'stratified_sklearn'} >>> scores = {'acc': 0.626, 'sens': 0.2483, 'spec': 0.7509, 'f1p': 0.2469} >>> result = check_1_dataset_known_folds_mos_macro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> results['inconsistency'] # False
>>> scores['acc'] = 0.656 >>> result = check_1_dataset_known_folds_mos_macro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
- mlscorecheck.check.multiclass.check_1_dataset_known_folds_mos_micro(dataset: dict, folding: dict, scores: dict, eps, *, fold_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated by taking the micro average on a single multiclass dataset with known folds.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
fold_score_boundswhen, for example, the minimum and the maximum scores over the folds are available. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding strategy.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
fold_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores in the folds
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float, optional) – Beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It ensures that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – if the problem is not specified properly
Examples
>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_micro >>> dataset = {0: 66, 1: 178, 2: 151} >>> folding = {'folds': [{0: 33, 1: 89, 2: 76}, {0: 33, 1: 89, 2: 75}]} >>> scores = {'acc': 0.5646, 'sens': 0.3469, 'spec': 0.6734, 'f1p': 0.3469} >>> result = check_1_dataset_known_folds_mos_micro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # False
>>> scores['acc'] = 0.5746 >>> result = check_1_dataset_known_folds_mos_micro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
- mlscorecheck.check.multiclass.check_1_dataset_known_folds_som_macro(dataset: dict, folding: dict, scores: dict, eps, *, class_score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
This function checks the consistency of scores calculated by taking the macro average on a single multiclass dataset and averaging the scores across the folds in the SoM manner.
The test operates by constructing a linear program describing the experiment and checkings its feasibility.
The test can only check the consistency of the ‘acc’, ‘sens’, ‘spec’ and ‘bacc’ scores. For a stronger test, one can add
class_score_boundswhen, for example, the minimum and the maximum scores over the classes are available. Full names in camel case, like‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
- Parameters:
dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
class_score_bounds (None|dict(str,tuple(float,float))) – the potential bounds on the scores for the classes
solver_name (None|str, optional) – The solver to use. Defaults to None.
timeout (None|int, optional) – The timeout for the linear programming solver in seconds. Defaults to None.
verbosity (int, optional) – The verbosity level of the pulp linear programming solver. 0: silent, non-zero: verbose. Defaults to 1.
numerical_tolerance (float, optional) – In practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity. Defaults to NUMERICAL_TOLERANCE.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- Raises:
ValueError – If the provided scores are not consistent with the dataset.
Examples
>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_som_macro >>> dataset = {0: 129, 1: 81, 2: 135} >>> folding = {'n_folds': 2, 'n_repeats': 2, 'strategy': 'stratified_sklearn'} >>> scores = {'acc': 0.5662, 'sens': 0.3577, 'spec': 0.6767, 'f1p': 0.3481} >>> result = check_1_dataset_known_folds_som_macro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # False
>>> scores['acc'] = 0.6762 >>> result = check_1_dataset_known_folds_som_macro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
- mlscorecheck.check.multiclass.check_1_dataset_known_folds_som_micro(dataset: dict, folding: dict, scores: dict, eps, *, numerical_tolerance: float = 1e-06, prefilter_by_pairs: bool = True) dict[source]
This function checks the consistency of scores calculated by taking the micro average of class level scores on a single multiclass dataset and averaging across the folds in the SoM manner.
The test is performed by exhaustively testing all possible confusion matrices.
- Parameters:
dataset (dict) – The specification of the dataset.
folding (dict) – The specification of the folding strategy.
eps (float|dict(str,float)) – The numerical uncertainty(ies) of the scores.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
prefilter_by_pairs (bool) – whether to prefilter the solution space by pair solutions when possible to speed up the process
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- Raises:
ValueError – If the provided scores are not consistent with the dataset.
Examples
>>> from mlscorecheck.check.multiclass import check_1_dataset_known_folds_som_micro >>> dataset = {0: 86, 1: 96, 2: 59, 3: 105} >>> folding = {'folds': [{0: 43, 1: 48, 2: 30, 3: 52}, {0: 43, 1: 48, 2: 29, 3: 53}]} >>> scores = {'acc': 0.6272, 'sens': 0.2543, 'spec': 0.7514, 'f1p': 0.2543} >>> result = check_1_dataset_known_folds_som_micro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # False
>>> scores['sens'] = 0.2553 >>> result = check_1_dataset_known_folds_som_micro(dataset=dataset, folding=folding, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
Regression
- mlscorecheck.check.regression.check_1_testset_no_kfold(var: float, n_samples: int, scores: dict, eps, numerical_tolerance: float = 1e-06) dict[source]
The consistency test for regression scores calculated on a single test set with no k-folding
- Parameters:
var (float) – the variance of the evaluation set
n_samples (int) – the number of samples in the evaluation set
scores (dict(str,float)) – the scores to check (‘mae’, ‘rmse’, ‘mse’, ‘r2’)
eps (float,dict(str,float)) – the numerical uncertainty of the scores
numerical_tolerance (float) – the numerical tolerance of the test
- Returns:
a summary of the analysis, with the following entries:
'inconsistency'(bool): whether an inconsistency has been identified'details'(list(dict)): the details of the analysis, with the following entries
- Return type:
Examples
>>> from mlscorecheck.check.regression import check_1_testset_no_kfold >>> var = 0.08316192579267838 >>> n_samples = 100 >>> scores = {'mae': 0.0254, 'r2': 0.9897} >>> result = check_1_testset_no_kfold(var=var, n_samples=n_samples, scores=scores, eps=1e-4) >>> result['inconsistency'] # False
>>> scores['mae'] = 0.03 >>> result = check_1_testset_no_kfold(var=var, n_samples=n_samples, scores=scores, eps=1e-4) >>> result['inconsistency'] # True
Test bundles (bundles)
The test bundles dedicated to specific problems in the mlscorecheck.bundles module.
Retina Image Processing
The test functions dedicated to retina image processing problems.
DRIVE
- mlscorecheck.check.bundles.retina.check_drive_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the DRIVE dataset with both assumptions on the region of evaluation (‘fov’/’all’).
- Parameters:
image_identifier (str) – the identifier of the image (like “21”)
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A summary of the results, with the following entries:
'inconsistency':All findings.
'details*':The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_drive_vessel_image >>> scores = {'acc': 0.9633, 'sens': 0.7406, 'spec': 0.9849} >>> identifier = '01' >>> k = 4 >>> results = check_drive_vessel_image(scores=scores, eps=10**(-k), image_identifier=identifier, annotator=1) >>> results['inconsistency'] # {'inconsistency_fov': True, 'inconsistency_all': False}
- mlscorecheck.check.bundles.retina.check_drive_vessel_image_assumption(image_identifier: str, assumption: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the DRIVE dataset with a particular assumption on the region of evaluation.
- Parameters:
image_identifier (str) – the identifier of the image (like “21”)
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated(imageset, annotator: int, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Testing the scores calculated for the DRIVE dataset with both assumptions regarding the region of evaluation (using the FoV or all pixels of the images).
The strength of the test can be improved by specifying the
score_bounds(minimum and maximum scores) for the images when available.- Parameters:
imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
annotator (int) – the annotation to use (1, 2) (typically annotator 1 is used in papers)
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_drive_vessel_aggregated >>> scores = {'acc': 0.9494, 'sens': 0.7450, 'spec': 0.9793} >>> k = 4 >>> results = check_drive_vessel_aggregated(scores=scores, eps=10**(-k), imageset='test', annotator=1, verbosity=0) >>> results['inconsistency'] # {'inconsistency_fov_mos': False, # 'inconsistency_fov_som': False, # 'inconsistency_all_mos': True, # 'inconsistency_all_som': True}
- mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated_mos_assumption(imageset, assumption: str, annotator: int, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores calculated for some images of the DRIVE dataset with the mean of scores aggregation and a particular assumption on the region of evaluation.
- Parameters:
imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to be used (1/2) (typically annotator 1 is used in papers)
scores (dict) – the scores to check the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’)
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- mlscorecheck.check.bundles.retina.check_drive_vessel_aggregated_som_assumption(imageset, assumption: str, annotator: int, scores: dict, eps, *, numerical_tolerance=1e-06)[source]
Tests the consistency of scores calculated on the DRIVE dataset using the score of means aggregation and a particular assumption on the region of evaluation.
- Parameters:
imageset (str|list) – ‘train’/’test’ for all images in the train or test set, or a list of identifiers of images (e.g. [‘21’, ‘22’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
annotator (int) – the annotation to be used (1/2) (typically annotator 1 is used in papers)
scores (dict) –
the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
STARE
- mlscorecheck.check.bundles.retina.check_stare_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the STARE dataset
- Parameters:
image_identifier (str) – the identifier of the image (like “im0235”)
annotator (str) – the annotation to use (‘ah’/’vk’)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_stare_vessel_image >>> img_identifier = 'im0235' >>> scores = {'acc': 0.4699, 'npv': 0.8993, 'f1p': 0.134} >>> results = check_stare_vessel_image(image_identifier=img_identifier, annotator='ah', scores=scores, eps=1e-4) >>> results['inconsistency'] # False
- mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Testing the scores calculated for the STARE dataset
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_stare_vessel_aggregated >>> scores = {'acc': 0.4964, 'sens': 0.5793, 'spec': 0.4871, 'bacc': 0.5332} >>> results = check_stare_vessel_aggregated(imageset='all', annotator='ah', scores=scores, eps=1e-4, verbosity=0) >>> results['inconsistency'] # {'inconsistency_mos': False, 'inconsistency_som': True}
- mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated_mos(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores calculated for some images of the STARE dataset with the mean of scores aggregation.
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’). Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- mlscorecheck.check.bundles.retina.check_stare_vessel_aggregated_som(imageset, annotator, scores, eps, numerical_tolerance=1e-06)[source]
Tests the consistency of scores calculated on the STARE dataset using the score-of-means aggregation.
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘im0082’, ‘im0235’])
annotator (str) – the annotation to be used (‘ah’/’vk’)
scores (dict) –
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
HRF
- mlscorecheck.check.bundles.retina.check_hrf_vessel_image(image_identifier: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the HRF dataset with both assumptions on the region of evaluation (‘fov’/’all’)
- Parameters:
image_identifier (str) – the identifier of the image (like “01_g”)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_hrf_vessel_image >>> scores = {'acc': 0.5562, 'sens': 0.5049, 'spec': 0.5621} >>> identifier = '13_h' >>> k = 4 >>> results = check_hrf_vessel_image(scores=scores, eps=10**(-k), image_identifier=identifier) >>> results['inconsistency'] # {'inconsistency_fov': False, 'inconsistency_all': True}
- mlscorecheck.check.bundles.retina.check_hrf_vessel_image_assumption(image_identifier: str, assumption: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the HRF dataset using an assumption on the region of evaluation.
- Parameters:
image_identifier (str) – the identifier of the image (like “01_g”)
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – the additional numerical tolerance
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
- mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated(imageset, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Testing the scores calculated for the HRF dataset with both assumptions on the region of evaluation (‘fov’/’all’) and both aggregation methods (‘mean of scores’,
- Parameters:
imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_hrf_vessel_aggregated >>> scores = {'acc': 0.4841, 'sens': 0.5665, 'spec': 0.475} >>> k = 4 >>> results = check_hrf_vessel_aggregated(scores=scores, eps=10**(-k), imageset='all', verbosity=0) >>> results['inconsistency'] # {'inconsistency_fov_mos': False, # 'inconsistency_fov_som': True, # 'inconsistency_all_mos': False, # 'inconsistency_all_som': True}
- mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated_mos_assumption(imageset, assumption: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores with calculated for some images of the HRF dataset with the mean of scores aggregation and an assumption on the region of evaluation.
- Parameters:
imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’). Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- mlscorecheck.check.bundles.retina.check_hrf_vessel_aggregated_som_assumption(imageset, assumption: str, scores: dict, eps, numerical_tolerance=1e-06)[source]
Tests the consistency of scores calculated on the HRF dataset using the score-of-means aggregation and an assumption on the region of evaluation.
- Parameters:
imageset (str|list) – ‘all’ or the list of identifiers of images (e.g. [‘13_h’, ‘01_g’])
assumption (str) – the assumption on the region of evaluation to test (‘fov’/’all’)
scores (dict) –
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
CHASE_DB1
- mlscorecheck.check.bundles.retina.check_chasedb1_vessel_image(image_identifier: str, annotator: str, scores: dict, eps, *, numerical_tolerance: float = 1e-06)[source]
Testing the scores calculated for one image of the CHASEDB1 dataset
- Parameters:
image_identifier (str) – the identifier of the image (like “11R”)
annotator (str) – the annotation to use (‘manual1’/’manual2’)
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_chasedb1_vessel_image >>> img_identifier = '11R' >>> scores = {'acc': 0.4457, 'sens': 0.0051, 'spec': 0.4706} >>> results = check_chasedb1_vessel_image(image_identifier=img_identifier, annotator='manual1', scores=scores, eps=1e-4) >>> results['inconsistency'] # False
- mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Testing the scores calculated for the CHASEDB1 dataset with both assumptions on the mode of aggregation.
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_chasedb1_vessel_aggregated >>> scores = {'acc': 0.5063, 'sens': 0.4147, 'spec': 0.5126} >>> k = 4 >>> results = check_chasedb1_vessel_aggregated(imageset='all', annotator='manual1', scores=scores, eps=1e-4, verbosity=0) >>> results['inconsistency'] # {'inconsistency_mos': False, 'inconsistency_som': True}
- mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated_mos(imageset, annotator: str, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checking the consistency of scores with calculated for some images of the CHASEDB1 dataset with the mean of scores aggregation.
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
scores (dict) – the scores to check (supports only ‘acc’, ‘sens’, ‘spec’, ‘bacc’)
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'lp_status':The status of the lp solver.
'lp_configuration_scores_match':A flag indicating if the scores from the lp configuration match the scores provided.
'lp_configuration_bounds_match':Indicates if the specified bounds match the actual figures.
'lp_configuration':Contains the actual configuration of the linear programming solver.
- Return type:
- mlscorecheck.check.bundles.retina.check_chasedb1_vessel_aggregated_som(imageset, annotator, scores, eps, numerical_tolerance=1e-06)[source]
Tests the consistency of scores calculated on the CHASEDB1 dataset using the score-of-means aggregation.
- Parameters:
imageset (str|list) – ‘all’ if all images are used, or a list of identifiers of images (e.g. [‘11R’, ‘07L’])
annotator (str) – the annotation to be used (‘manual1’/’manual2’)
scores (dict) –
the scores to check (‘acc’, ‘sens’, ‘spec’, ‘bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
DIARETDB0
- mlscorecheck.check.bundles.retina.check_diaretdb0_class(subset: str, batch, class_name, scores: dict, eps, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Testing the scores calculated for the DIARETDB0 dataset. The dataset is an image labeling dataset, where various images can be labeled by the lesion recognized on the images. There are 5 different lesion labels, referred to as
class_namein the arguments. The test considers the labeling of a certain lesion (class) as a binary classification problem as the images with the label treated as positive and the images without the label treated as negative samples. Furthermore, there are multiple batches of train and test images (9), the list of batches used for the evaluation can be passed with thebatchargument. The actual subset from the batches being evaluated is passed through thesubsetargument. The test assumes that the scores are aggregated across the batches, thus, executes the tests with both the SoM and MoS aggregation assumptions.- Parameters:
subset (str) – ‘train’/’test’
batch (str|list) – the list of batches used, ‘all’ for all batches, or a subset of [‘1’, ‘2’, …, ‘9’]
class_name (str|list) – the name of the class being evaluated (‘neovascularisation’| ‘hardexudates’|’softexudates’|’hemorrhages’|’redsmalldots’), a list if a list of classes is treated as positive
eps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_diaretdb0_class >>> scores = {'acc': 0.4271, 'sens': 0.406, 'spec': 0.4765} >>> results = check_diaretdb0_class(subset='test', batch='all', class_name='hardexudates', scores=scores, eps=1e-4) >>> results['inconsistency'] # {'inconsistency_som': True, 'inconsistency_mos': False}
DIARETDB1
- mlscorecheck.check.bundles.retina.check_diaretdb1_class(*, subset: str, class_name, confidence: float, scores: dict, eps, numerical_tolerance: float = 1e-06) dict[source]
Tests the scores describing the labeling of images in DIARETDB1. The problem is a multi-labeling problem, this test function supports binary the testing of binary subproblems (for example, the ‘hardexudates’ class being treated as the positive label).
- Parameters:
subset (str) – the subset to be used (‘train’/’test’), typically ‘test’
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_class >>> scores = {'acc': 0.3115, 'sens': 1.0, 'spec': 0.0455, 'f1p': 0.4474} >>> results = check_diaretdb1_class(subset='test', class_name=['hardexudates', 'softexudates'], confidence=0.75, scores=scores, eps=1e-4) >>> results['inconsistency'] # False
- mlscorecheck.check.bundles.retina.check_diaretdb1_segmentation_image(*, image_identifier: str, class_name, confidence: float, scores: dict, eps, numerical_tolerance: float = 1e-06) dict[source]
Tests the scores describing the segmentation of images in DIARETDB1. This test function supports binary the testing of binary subproblems (for example, the pixels of the ‘hardexudates’ class being segmented in an image). The test evaluates both assumptions of using the FoV or all pixels for evaluation.
- Parameters:
image_identifier (str) – the identifier of the image to be tested (e.g. ‘001’)
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
- the scores to be tested (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float) – the numerical uncertainty
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_segmentation_image >>> scores = {'acc': 0.5753, 'sens': 0.0503, 'spec': 0.6187, 'f1p': 0.0178} >>> results = check_diaretdb1_segmentation_image(image_identifier='005', class_name=['hardexudates', 'softexudates'], confidence=0.75, scores=scores, eps=1e-4) >>> results['inconsistency'] # {'inconsistency_fov': True, 'inconsistency_all': False}
- mlscorecheck.check.bundles.retina.check_diaretdb1_segmentation_aggregated(*, subset: str, class_name, confidence: float, only_valid: bool, scores: dict, eps, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Tests the scores describing the segmentation of multiple images of DIARETDB1 in an aggregated way. This test function supports binary the testing of binary subproblems (for example, the pixels of the ‘hardexudates’ class being segmented in an image). The test evaluates both assumption on the region of evaluation.
- Parameters:
subset (str|list) – the subset of images to be used (‘train’/’test’) or the list of image identifiers to be tested (e.g. ‘001’)
class_name (str|list) – the name or list of names of classes used as “positive”
confidence (float) – the confidence threshold, typically 0.75
only_valid (bool) – if True, works with that subset of the images, where both positives and negatives are present (e.g. images where the class class_name=’hardexudates’ is not present with confidence=0.75 level are discarded). If False, sensitivity is specified in
scoresand one of the images has 0 positives, the MoS test cannot be executedeps (float) – the numerical uncertainty
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_diaretdb1_segmentation_aggregated >>> scores = {'acc': 0.7143, 'sens': 0.3775, 'spec': 0.7244} >>> results = check_diaretdb1_segmentation_aggregated(subset='test', class_name='hardexudates', confidence=0.5, only_valid=True, scores=scores, eps=1e-4) >>> results['inconsistency'] # {'inconsistency_fov_som': True, # 'inconsistency_all_som': True, # 'inconsistency_fov_mos': False, # 'inconsistency_all_mos': False}
DRISHTI_GS
- mlscorecheck.check.bundles.retina.check_drishti_gs_segmentation_image(image_identifier: str, confidence: float, target: str, scores: dict, eps: float, *, numerical_tolerance: float = 1e-06)[source]
Testing the segmentation results on one image.
- Parameters:
image_identifier (str) – the image identifier (e.g. ‘053’)
confidence (float) – the confidence level (in [0,1]), used for thresholding the soft segmentation ground truth image at threshold*255
target (str) – the target anatomical part (‘OD’/’OC’)
scores (dict) –
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_drishti_gs_segmentation_image >>> scores = {'acc': 0.5966, 'sens': 0.3, 'spec': 0.6067, 'f1p': 0.0468} >>> results = check_drishti_gs_segmentation_image(image_identifier='053', confidence=0.75, target='OD', scores=scores, eps=1e-4) >>> results['inconsistency'] # False
- mlscorecheck.check.bundles.retina.check_drishti_gs_segmentation_aggregated(subset: str, confidence: float, target: str, scores: dict, eps: float, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06)[source]
Testing the scores shared for a set of images with both the MoS and SoM aggregations.
- Parameters:
subset (str|list) – the subset (‘test’/’train’) or the list of identifiers, e.g. [‘053’, ‘086’]
confidence (float) – the confidence level (in [0,1]), used for thresholding the soft segmentation ground truth image at threshold*255
target (str) – the target anatomical part (‘OD’/’OC’)
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the images
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity of the linear programming solver, 0: silent, 1: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
The summary of the results, with the following entries:
'inconsistency':All findings.
details*:The details of the analysis for the two assumptions.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.retina import check_drishti_gs_segmentation_aggregated >>> scores = {'acc': 0.4767, 'sens': 0.4845, 'spec': 0.4765, 'f1p': 0.0512} >>> results = check_drishti_gs_segmentation_aggregated(subset='test', confidence=0.75, target='OD', scores=scores, eps=1e-4) >>> results['inconsistency'] # {'inconsistency_som': False, 'inconsistency_mos': False}
Preterm delivery prediction by EHG signals
The test bundle dedicated to the testing of electrohsyterogram data.
- mlscorecheck.check.bundles.ehg.check_tpehg(scores: dict, eps, n_folds: int, n_repeats: int, *, score_bounds: dict = None, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Checks the cross-validated TPEHG scores
- Parameters:
- the dictionary of scores (supports only ‘acc’, ‘sens’, ‘spec’,
’bacc’). Full names in camel case, like
’positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
n_folds (int) – the number of folds
n_repeats (int) – the number of repetitions
score_bounds (dict(str,tuple(float,float))) – the potential bounds on the scores of the folds
solver_name (None|str) – the solver to use
timeout (None|int) – the timeout for the linear programming solver in seconds
verbosity (int) – the verbosity level of the pulp linear programming solver 0: silent, non-zero: verbose.
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the experiment.
'details':A list of dictionaries containing the details of the consistency tests. Each entry contains the specification of the folds being tested and the outcome of the
check_1_dataset_known_folds_mosfunction.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.ehg import check_tpehg >>> # the 5-fold cross-validation scores reported in the paper >>> scores = {'acc': 0.9447, 'sens': 0.9139, 'spec': 0.9733} >>> eps = 0.0001 >>> results = check_tpehg(scores=scores, eps=eps, n_folds=5, n_repeats=1) >>> results['inconsistency'] # True
Skin lesion classification
The test bundle dedicated to the testing of skin lesion classification.
ISIC2016
- mlscorecheck.check.bundles.skinlesion.check_isic2016(*, scores: dict, eps: float, numerical_tolerance: float = 1e-06)[source]
Tests if the scores are consistent with the test set of the ISIC2016 melanoma classification dataset
- Parameters:
scores (dict) –
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the dataset.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.skinlesion import check_isic2016 >>> scores = {'acc': 0.7916, 'sens': 0.2933, 'spec': 0.9145} >>> results = check_isic2016(scores=scores, eps=1e-4) >>> results['inconsistency'] # False
ISIC2017
- mlscorecheck.check.bundles.skinlesion.check_isic2017(*, target, against, scores: dict, eps: float, numerical_tolerance: float = 1e-06)[source]
Tests if the scores are consistent with the test set of the ISIC2017 skin lesion classification dataset. The dataset contains three classes, the test covers the binary classification aspect of the problem, when one (or two) of the classes are classified against the other two (or one) class.
- Parameters:
target (str|list) – the target (positive) class(es), with the encoding ‘M’ for melanoma, ‘SK’ for seborrheic keratosis and ‘N’ for nevus.
against (str|list) – specification of the negative classes, with the encoding ‘M’ for melanoma, ‘SK’ for seborrheic keratosis and ‘N’ for nevus.
scores (dict) –
- the scores to check (‘acc’, ‘sens’, ‘spec’,
’bacc’, ‘npv’, ‘ppv’, ‘f1’, ‘fm’, ‘f1n’, ‘fbp’, ‘fbn’, ‘upm’, ‘gm’, ‘mk’, ‘lrp’, ‘lrn’, ‘mcc’, ‘bm’, ‘pt’, ‘dor’, ‘ji’, ‘kappa’). When using f-beta
positive or f-beta negative, also set ‘beta_positive’ and ‘beta_negative’. Full names in camel case, like ‘positive_predictive_value’, synonyms, like ‘true_positive_rate’ or ‘tpr’ instead of ‘sens’ and complements, like ‘false_positive_rate’ for (1 - ‘spec’) can also be used.
eps (float|dict(str,float)) – the numerical uncertainty(ies) of the scores
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
A dictionary containing the results of the consistency check. The dictionary includes the following keys:
'inconsistency':A boolean flag indicating whether the set of feasible true positive (tp) and true negative (tn) pairs is empty. If True, it indicates that the provided scores are not consistent with the dataset.
'details':A list providing further details from the analysis of the scores one after the other.
'n_valid_tptn_pairs':The number of tp and tn pairs that are compatible with all scores.
'prefiltering_details':The results of the prefiltering by using the solutions for the score pairs.
'evidence':The evidence for satisfying the consistency constraints.
- Return type:
Examples
>>> from mlscorecheck.check.bundles.skinlesion import check_isic2017 >>> scores = {'acc': 0.6183, 'sens': 0.4957, 'ppv': 0.2544, 'f1p': 0.3362} >>> results = check_isic2017(target='M', against=['SK', 'N'], scores=scores, eps=1e-4) >>> results['inconsistency'] # False
Experiments (experiments)
The predefined dataset and experiment statistics to look up are stored in the mlscorecheck.experiments module.
The core modules
Score functions (scores)
- mlscorecheck.scores.accuracy(*, tp, tn, p, n)[source]
The accuracy score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.error_rate(*, fp, fn, p, n)[source]
The error_rate score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.positive_predictive_value(*, tp, fp)[source]
The positive_predictive_value score
- mlscorecheck.scores.negative_predictive_value(*, tn, fn)[source]
The negative_predictive_value score
- mlscorecheck.scores.f_beta_positive(*, tp, fp, p, beta_positive)[source]
The f_beta_positive score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.f_beta_negative(*, tn, fn, n, beta_negative)[source]
The f_beta_negative score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.unified_performance_measure(*, tp, tn, p, n)[source]
The unified_performance_measure score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.geometric_mean(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The geometric_mean score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.fowlkes_mallows_index(*, tp, fp, p, sqrt=<built-in function sqrt>)[source]
The fowlkes_mallows_index score
- mlscorecheck.scores.markedness(*, tp, tn, p, n)[source]
The markedness score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.positive_likelihood_ratio(*, tp, fp, p, n)[source]
The positive_likelihood_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.negative_likelihood_ratio(*, tn, fn, p, n)[source]
The negative_likelihood_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.matthews_correlation_coefficient(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The matthews_correlation_coefficient score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.bookmaker_informedness(*, tp, tn, p, n)[source]
The bookmaker_informedness score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.prevalence_threshold(*, tp, fp, p, n, sqrt=<built-in function sqrt>)[source]
The prevalence_threshold score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.diagnostic_odds_ratio(*, tp, tn, p, n)[source]
The diagnostic_odds_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.balanced_accuracy(*, tp, tn, p, n)[source]
The balanced_accuracy score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.cohens_kappa(*, tp, tn, p, n)[source]
The cohens_kappa score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.accuracy_standardized(*, tp, tn, p, n)[source]
The standardized accuracy score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.error_rate_standardized(*, tp, tn, p, n)[source]
The standardized error_rate score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.false_negative_rate_standardized(*, tp, p)[source]
The standardized false_negative_rate score
- mlscorecheck.scores.false_positive_rate_standardized(*, tn, n)[source]
The standardized false_positive_rate score
- mlscorecheck.scores.positive_predictive_value_standardized(*, tp, tn, n)[source]
The standardized positive_predictive_value score
- mlscorecheck.scores.false_discovery_rate_standardized(*, tp, tn, n)[source]
The standardized false_discovery_rate score
- mlscorecheck.scores.false_omission_rate_standardized(*, tp, tn, p)[source]
The standardized false_omission_rate score
- mlscorecheck.scores.negative_predictive_value_standardized(*, tp, tn, p)[source]
The standardized negative_predictive_value score
- mlscorecheck.scores.f_beta_positive_standardized(*, tp, tn, p, n, beta_positive)[source]
The standardized f_beta_positive score
- Parameters:
tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives
beta_positive (int|float|Interval|IntervalUnion) – the beta parameter
- Returns:
the score
- Return type:
- mlscorecheck.scores.f_beta_negative_standardized(*, tp, tn, p, n, beta_negative)[source]
The standardized f_beta_negative score
- Parameters:
tp (int|float|Interval|IntervalUnion) – The number of true positives
tn (int|float|Interval|IntervalUnion) – The number of true negatives
p (int|float|Interval|IntervalUnion) – The number of positives
n (int|float|Interval|IntervalUnion) – The number of negatives
beta_negative (int|float|Interval|IntervalUnion) – the beta parameter
- Returns:
the score
- Return type:
- mlscorecheck.scores.f1_positive_standardized(*, tp, tn, p, n)[source]
The standardized f1_positive score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.f1_negative_standardized(*, tp, tn, p, n)[source]
The standardized f1_negative score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.unified_performance_measure_standardized(*, tp, tn, p, n)[source]
The standardized unified_performance_measure score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.geometric_mean_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The standardized geometric_mean score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.fowlkes_mallows_index_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The standardized fowlkes_mallows_index score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.markedness_standardized(*, tp, tn, p, n)[source]
The standardized markedness score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.positive_likelihood_ratio_standardized(*, tp, tn, p, n)[source]
The standardized positive_likelihood_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.negative_likelihood_ratio_standardized(*, tp, tn, p, n)[source]
The standardized negative_likelihood_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.matthews_correlation_coefficient_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The standardized matthews_correlation_coefficient score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.bookmaker_informedness_standardized(*, tp, tn, p, n)[source]
The standardized bookmaker_informedness score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.prevalence_threshold_standardized(*, tp, tn, p, n, sqrt=<built-in function sqrt>)[source]
The standardized prevalence_threshold score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.diagnostic_odds_ratio_standardized(*, tp, tn, p, n)[source]
The standardized diagnostic_odds_ratio score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.jaccard_index_standardized(*, tp, tn, p, n)[source]
The standardized jaccard_index score
- Parameters:
- Returns:
the score
- Return type:
- mlscorecheck.scores.balanced_accuracy_standardized(*, tp, tn, p, n)[source]
The standardized balanced_accuracy score
- Parameters:
- Returns:
the score
- Return type:
Testing logic for individual scores (individual)
The main, low level interface function of the module is check_scores_tptn_pairs.
- mlscorecheck.individual.check_scores_tptn_pairs(p: int, n: int, scores: dict, eps, *, numerical_tolerance: float = 1e-06, solve_for: str = None, prefilter_by_pairs: bool = False) dict[source]
Check scores by iteratively reducing the set of feasible
tp,tnpairs.- Parameters:
p (int) – the number of positives
n (int) – the number of negatives
scores (dict) – the available reported scores
eps (float|dict(str,float)) – the numerical uncertainties for all scores or each score individually
numerical_tolerance (float) – the additional numerical tolerance
solve_for (str) – the figure solving for (the other is used to iterate by) (
tp/tn) If None, the optimal one is being used.prefilter_by_pairs (bool) – whether to prefilter the tp and tn intervals by the pairwise solutions
- Returns:
a summary of the results. When the
inconsistencyflag is True, it indicates that the set of feasibletp,tnpairs is empty. The list under the keydetailsprovides further details from the analysis of the scores one after the other. Under the keyn_valid_tptn_pairsone finds the number of tp and tn pairs compatible with all scores. Under the keyprefiltering_detailsone finds the results of the prefiltering by using the solutions for the score pairs.- Return type:
Testing logic for aggregated scores (aggregated)
The main, low level interface function of the module is check_aggregated_scores.
- mlscorecheck.aggregated.check_aggregated_scores(*, experiment: dict, scores: dict, eps, solver_name: str = None, timeout: int = None, verbosity: int = 1, numerical_tolerance: float = 1e-06) dict[source]
Check aggregated scores
- Parameters:
experiment (dict|Experiment) – the experiment specification
scores (dict) – the scores to match
solver_name (str) – the name of the solver to be used, check pulp.listSolvers(onlyAvailable) for the available list
timeout (int) – the number of seconds to time out
verbosity (int) – controls the verbosity level of the pulp based linear programming solver. 0: no output; non-zero: print output
numerical_tolerance (float) – in practice, beyond the numerical uncertainty of the scores, some further tolerance is applied. This is orders of magnitude smaller than the uncertainty of the scores. It does ensure that the specificity of the test is 1, it might slightly decrease the sensitivity.
- Returns:
the details of the test, under the key ‘inconsistency’, one can find the flag indicating if inconsistency was identified
- Return type:
- Raises:
ValueError – if the problem is not specified properly
- class mlscorecheck.aggregated.Dataset(p: int = None, n: int = None, dataset_name: str = None, identifier: str = None)[source]
The abstract representation of a dataset
- class mlscorecheck.aggregated.Folding(n_folds: int = None, n_repeats: int = None, folds: list = None, strategy: str = None)[source]
Abstract representation of a folding
- class mlscorecheck.aggregated.Fold(p: int, n: int, identifier: str = None)[source]
Abstract representation of a fold
- calculate_scores(rounding_decimals: int = None, score_subset: list = None) dict[source]
Calculate the scores for the fold
- init_lp(scores: dict = None)[source]
Initialize a linear programming problem by creating the variables for the fold
- Parameters:
scores (dict|None) – the score values to be used to set initial values
- Returns:
the updated problem
- Return type:
pl.LpProblem
- populate(lp_problem: LpProblem) LpProblem[source]
Populate the fold with the
tpandtnvalues from the linear program- Parameters:
lp_problem (pl.LpProblem) – the linear programming problem
- Returns:
the self object populated with the
tpandtnscores- Return type:
obj
- class mlscorecheck.aggregated.Evaluation(dataset: dict, folding: dict, aggregation: str, fold_score_bounds: dict = None)[source]
Abstract representation of an evaluation
- calculate_scores(rounding_decimals: int = None, score_subset: list = None) dict[source]
Calculates the scores
- init_lp(lp_problem: LpProblem, scores: dict = None) LpProblem[source]
Initializes a linear programming problem
- populate(lp_problem: LpProblem)[source]
Populates the evaluation with the figures in the solved linear programming problem
- Parameters:
lp_problem (pl.LpProblem) – the linear programming problem with
solve()executed- Returns:
the updated self object
- Return type:
obj
- class mlscorecheck.aggregated.Experiment(evaluations: list, aggregation: str, dataset_score_bounds: dict = None)[source]
Abstract representation of an experiment
- calculate_scores(rounding_decimals: int = None, score_subset: list = None) dict[source]
Calculates the scores
- init_lp(lp_problem: LpProblem, scores: dict = None) LpProblem[source]
Initializes a linear programming problem
- populate(lp_problem)[source]
Populates the evaluation with the figures in the solved linear programming problem
- Parameters:
lp_problem (pl.LpProblem) – the linear programming problem with
solve()executed- Returns:
the updated self object
- Return type:
obj