Feature importance operators fail on datasets with features without any data

yzanyzan MemberPosts:66Unicorn
When an ExampleSet contains even just a single feature, which consists only of missing values, following operators:
  • Weight by Information Gain Ratio
  • Weight by Information Gain
  • Weight by Gini
  • Weight by Uncertainty
fail with:
Exception: java.lang.ArrayIndexOutOfBoundsException
Message: 0
Similarly, Weight by Rules fails with:
Exception: com.rapidminer.example.AttributeTypeException
Message: Cannot map index of nominal attribute to nominal value: index 0 is out of bounds!
Known workaround: Use first Remove Useless Attributes.

Expected result: Zero weight for features without any data.

Justification:
  1. Sometimes I want to report the relevance of all the features in the dataset.
  2. 我dislike when a time consuming process fails because of some unlucky random seed in cross-validation...
Proposed action: Add a parameterized test, which tests all feature weighting operators whether they can handle a feature without any data (be it a nominal, numerical or date column).

Reasoning: I didn't test all the operators. And there is a good chance other operators might share the same "halt the world" trait.







Tagged:

Answers

  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    我think your suggestions are not as simple as they may seem. Most of the operators you have mentioned rely on the notion of entropy. You are also assuming that a variable with all missing values is equivalent to an empty set. First of all, entropy is undefined for an empty set and there are deep philosophical questions about this (often touching the issues of beginning and the end of the Universe and black holes). Some people suggests that entropy of an empty set should be 0 as such as a set has a single possible state. If however, the interpretation of a variable with all missing values is that this is a set with unknown values then perhaps its entropy is 1 as we have a perfect uncertainty about every element of the set. And yet, the above are not the only possible interpretations, if for instance the missing values are certain but not collected then the entropy of such a set can be any value possible between 0 and 1. So perhaps a Java failure is the best possible result for this conundrum!
  • yzanyzan MemberPosts:66Unicorn
    我f we want to stick with the theory, we may return NaN for a feature with all missing values. And then we just have to make sure that downstream operators that consume entropy or weights (like Select by Weight) can handle NaNs.

    You also do not have to necessarily patch it at the entropy level, but you might patch it at the level of Weight by Something (maybe a simple try-catch?). Then the proposal of returning 0 would make sense, as it is the only deterministic value which makes sense. Reasoning: We might generate infinitely many features with all missing values for any dataset. Using any other (deterministic) value than 0 would suggest that we might be able to extract information from thin air.

    我definitely do not propose to handle missing values as a placeholder for any value, because then we would have to return ranges (or distributions) instead of point estimates, whenever there is at least one missing value in a feature.

    Nevertheless, I would argue that Java error is not the best possible result. If it was, operators like Decision Tree would have to be modified to also return Java error, whenever there is a variable with all missing values in the dataset.

  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    Yes, I agree with you that Java error is definitely not the best but I could not resist - it is a hard problem!
Sign InorRegisterto comment.