Distribution scoring metrics

  • Predictive Power Score (PPS)
    • The improvement in an evaluation metric from a naive model to a model based on the selected feature, relative to the potential improvement achieved from a perfect feature
      • For classifications, the naive model is the mode and the metric is F1 Score

      • For regressions, the naive model is the median and the metric is Mean Absolute Error

    • e.g. a model that uses the modal classes achieves an F1 score of 0.6 and a model using the selected feature achieves an F1 score of 0.9, the PPS score is 0.75

  • Categorical skew (categorical features)
    • The proportion of data points with the modal value

    • e.g. a dataset of 6 class A, 3 class B and one class C has 60% of data points as the modal value

  • Inter-quartile skew (numerical features)
    • The difference between the median and the midpoint of the upper and lower quantiles, as a proportion of half the inter-quartile range

    • e.g. if the lower and upper quartile were 1 and 3 and the median was 1.5, the inter-quartile skew would be 0.5