Feature importance operators fail on datasets with features without any data
When an ExampleSet contains even just a single feature, which consists only of missing values, following operators:
Similarly, Weight by Rules fails with:
- Weight by Information Gain Ratio
- Weight by Information Gain
- Weight by Gini
- Weight by Uncertainty
fail with:
Exception: java.lang.ArrayIndexOutOfBoundsException
Message: 0
Exception: com.rapidminer.example.AttributeTypeException
Message: Cannot map index of nominal attribute to nominal value: index 0 is out of bounds!
Known workaround: Use first Remove Useless Attributes.
Expected result: Zero weight for features without any data.
Justification:
- Sometimes I want to report the relevance of all the features in the dataset.
- 我dislike when a time consuming process fails because of some unlucky random seed in cross-validation...
Proposed action: Add a parameterized test, which tests all feature weighting operators whether they can handle a feature without any data (be it a nominal, numerical or date column).
Reasoning: I didn't test all the operators. And there is a good chance other operators might share the same "halt the world" trait.
Reasoning: I didn't test all the operators. And there is a good chance other operators might share the same "halt the world" trait.
Tagged:
0
Answers
我definitely do not propose to handle missing values as a placeholder for any value, because then we would have to return ranges (or distributions) instead of point estimates, whenever there is at least one missing value in a feature.
Nevertheless, I would argue that Java error is not the best possible result. If it was, operators like Decision Tree would have to be modified to also return Java error, whenever there is a variable with all missing values in the dataset.