Categories

Versions

回归

Preliminaries: Mean versus median

Given a set of numbers, how do you define their numerical "center"?

A common answer to this question is to say that the "center" is given by the average -- also called themean. Depending on how your data is distributed, the average might or might not be a good way of representing the data set. When all the data is tightly clumped together, the average is usually an excellent choice.

When the data is more spread out, other representations of the data may be more appropriate. Compared to the average, themedianis relatively insensitive to outliers. See the following two examples, where in the second data set the value8has been replaced by108. The mean changes dramatically, but the median is unaffected.

Data set {1, 2, 3, 6, 8} {1, 2, 3, 6, 108}
Mean 4 24
Median 3 3

In a sorted list of numbers, themedianis the number precisely in the middle of the list, with just as many smaller values as larger values.

你应该选择均值或中位数repr吗ent the data? So long as you don't have significant outliers, it shouldn't matter, but if you do, as in the second example, you have to make a decision, depending on the purpose of your investigation.

Suppose each of the data points represents household income for one house in a small village, and you want to compare the data with other villages.

  • The mean more accurately represents the income for the whole village (it's a sum!).
  • The median more accurately represents the income of a typical household, ignoring outliers.

If you're quarrelsome, you might argue that neither the mean nor the median is a good indicator for the second data set; why not ignore them both and instead show a chart with all the data? When the data already exists, that argument has some merit, but we're building a predictive model precisely because the (future) data does not yet exist! We can't escape the problem so easily.

Nobody will believe your predictive model unless it makes plausible predictions, and to do that it has to weave a path through the center of your current data (thetraining set), using its own definition of "center", and paying more or less attention to outliers than other models. Even if you didn't create the model, you can still exercise some control over the result by

  • (a) choosing an appropriate performance metric, and
  • (b) choosing the model with the best performance according to that metric.

Performance metrics

We need some notation. Assume that thetest sethasNrows, and let the indexnidentify one of the rows.

  • Σ_n - A sum over all the rows in the test set
  • Y_n - in the nth row of the test set, the value of the target column
  • X_n - in the nth row of the test set, the values of the non-target data used to predict Y_n
  • f(X_n) - the prediction generated by the model, using X_n as input. Compare with Y_n, the actual value.

The difference between the actual value and the predicted value, |Y_n - f(X_n)|, is sometimes called theresidual. A successful model should of course minimize the residuals, but since there is more than one way of combining the residuals, there is also a variety of performance metrics. For regression problems, RapidMiner Go provides the following metrics:

Performance metric Formula
Root Mean Square Error (RMSE) sqrt [ Σ_n (Y_n - f(X_n))2] / sqrt(N)
Average Absolute Error (1 / N) Σ_n |Y_n - f(X_n)|
Average Relative Error (1 / N) Σ_n (|Y_n - f(X_n)| / |Y_n|)
Squared Correlation (R2) SeeCoefficient of determination

Let's convert these formulae into useful advice.

Performance metric Description
Root Mean Square Error (RMSE) Choose the model with the minimum value of the Root Mean Square Error if you prefer average values. Average values give more weight to outliers, as explained above.
Average Absolute Error Choose the model with the minimum value of the Average Absolute Error if you prefer median values. Median values give less weight to outliers, as explained above.
Average Relative Error A variant of the Average Absolute Error, where the error is calculated as a percentage of the actual value.
Squared Correlation (R2) Look for a high value of R2 (close to 1), indicating a high correlation between predicted values and actual values.

Performance Charts

Predicted vs Actual Values Chart: A simple scatter plot of predicted vs actual values shows the performance of the model when applied to the test set. Each point's x-coordinate is its actual value; each point's y-coordinate is its predicted value. The solid blue liney=xrepresents the placement of points in an ideal (perfect) model where all predictions are equal to their actual values. Dashed blue lines represent the boundaries forxandyof a 95% confidence interval. The closer the points are to the solid blue line, the better the model.

Distribution of Prediction Errors Chart: A frequency histogram of prediction error (difference between predicted and actual values) shows the performance of the model when applied to the test set. A prediction error of 0 represents an ideal (perfect) model where all predictions are equal to the actual values. The more prediction errors near 0 (i.e. the higher the frequency bars near 0), the better the model. Dashed blue lines represent the boundaries of a 95% confidence interval.

Example: Predict sales from advertising data

As an example of regression analysis, we examine the data setAdvertising.csvsupplied by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani in connection with their bookAn Introduction to Statistical Learning. The purpose of this data set is to show that you can predict the volume of sales as a function of advertising budget in three different channels: TV, radio, and newspapers (yes, it's an old data set!).

After downloading the CSV file from the above link, follow the steps outlined inBuild models:

  1. UploadAdvertising.csvinto RapidMiner Go.

  2. Choose "Sales" as the column to predict.

  3. Make sure to select "TV" as one of the inputs. TV advertising has a high correlation with sales, and that will help us to make better predictions.

  4. Select and run all the models.

Model comparison

In theModel Comparison,Decision Treeis the clear winner when compared to theGeneralized Linear Model (GLM):

  • It has smaller errors according to each of the metricsRoot mean squared error,Average absolute error, andAverage relative error.
  • It has a larger value forSquared correlation (R2).

The strong agreement between the metrics suggests that there are no significant outliers in this data set.

Decision Tree

By clicking onDecision Tree, you can see theActual vs. Predicted Valueschart. It resembles a straight line, because the predictions are good.

A recalculation: without the TV data

As before, with theChurn Predictive Data, let's do a recalculation where the highly-correlated data is excluded. As the screenshot below clearly demonstrates, the results for theDecision Treeare much worse without the TV advertising data.

  • TheAverage Absolute Errorquadruples, from 0.905 to 3.754
  • TheSquared Correlation (R2)plunges from 0.954 to 0.254

Although it's not good either, the performance of theGeneralized Linear Model (GLM)is actually better than theDecision Tree.

Notice that theActual vs. Predicted Valueschart for theDecision Treeno longer looks like a straight line.

Conclusion? To get good results, make sure to include all the relevant data.