Preterm delivery prediction from electrohysterogram signals (TPEHG dataset)
---------------------------------------------------------------------------

Electrohysterogram classification for the prediction of preterm delivery in pregnancy became a popular area for the applications of minority oversampling, however, it turned out that there were overly optimistic classification results reported due to systematic data leakage in the data preparation process [EHG]_. In [EHG]_, the implementations were replicated and it was shown that there is a decent gap in terms of performance when the data is prepared properly. However, data leakage changes the statistics of the dataset being cross-validated. Hence, the problematic scores could be identified with the tests implemented in the ``mlscorecheck`` package. In order to facilitate the use of the tools for this purpose, some functionalities have been prepared with the dataset already pre-populated.

The test bundle implemented in the ``mlscorecheck`` package is based on the TPEHG dataset [TPEHG]_, containing 262 negative and 38 positive samples. In the lack of predefined train/test splits, the dataset is usually evaluated in a k-fold cross-validation scenario with unknown fold structures.

For illustration, given a set of scores reported in a real paper, the test below shows that it is not consistent with the dataset:

.. code-block:: Python

    >>> from mlscorecheck.check.bundles.ehg import check_tpehg
    >>> # the 5-fold cross-validation scores reported in the paper
    >>> scores = {'acc': 0.9447, 'sens': 0.9139, 'spec': 0.9733}
    >>> eps = 0.0001
    >>> results = check_tpehg(scores=scores,
                                eps=eps,
                                n_folds=5,
                                n_repeats=1)
    >>> results['inconsistency']
    # True

As the results show, the reported scores are inconsistent with the assumption of being yielded in a 5-fold cross-validation experiment on the TPEHG dataset.