ratio training/testing in sliding windows validation
Hi all,
I'm already months begin struggling with a problem about prediction. Now after a few optimalization runs (takes days and days) I advanced (probably) with the problem overfitting?!
As you can see in the picture below the 6th column represent the performance of the model. The 3rd, 4th an 5th are the parameters of the sliding windows validation(training width, step width, testing width). Probably the ratio training vs testing is too high. But if I decrease the ratio the performance will decrease. So I don't know what the perfect ratio will be such that the perfromance is not suspect anymore.
So could anyone advice me what the ratio is in respect to my datasets:
https://drive.google.com/open?id=12XjPKw2diSLnc9-MtAv_--SVfntA3nR-
Below the XML code of the proces. I used the score object to combine these values against my test set in a score process.
<列出关键= " additional_macros " / >
<操作符= " true " class = " support_vector_m激活achine" compatibility="7.6.001" expanded="true" height="124" name="SVM" width="90" x="112" y="34">
<参数键= " C " value = " 9000.0 " / >
< portSpacing port="source_training" spacing="0"/>
< portSpacing port="sink_model" spacing="0"/>
< portSpacing port="sink_through 1" spacing="0"/>
< portSpacing port="source_model" spacing="0"/>
< portSpacing port="source_test set" spacing="0"/>
< portSpacing port="source_through 1" spacing="0"/>
< portSpacing port="sink_averagable 1" spacing="0"/>
< portSpacing port="sink_averagable 2" spacing="0"/>
< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="source_input 2" spacing="0"/>
< portSpacing port="sink_performance" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
< portSpacing port="sink_result 3" spacing="0"/>
< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
< portSpacing port="sink_result 3" spacing="0"/>
< portSpacing port="sink_result 4" spacing="0"/>
Answers
Hi@maurits_freriks,
For the ratio, I would say training width = 0,7 / 0,8 and respectivly test width = 0,3 / 0,2 with increased absolute value of
test width (test width = 5 is too low from my opinion).
Alternatively, how said in PM, you can use theRMSE-performance(regression)operator - to measure in a more objective way the performance of your model(s).
Best regards,
Lionel
@lionelderkrikor
You mean change the performance (forecasting performance) into performance (regression)?
Hi Maurits,
Exactly. The best model is the one that minimizes RMSE.
Best regards,
Lionel
Yep, RSME is definately another way to look at this. My main concern has been those spikes lower. Can they be removed or is there a specific reason that they must remain in?
@Thomas_Ott
Sorry for the late reply.
Yes there is a specific reason why those spikes are in the dataset. Becaus this was the actual flow of the days in the past. The reason of this spikes is because of maintainance (planning) or tripping (Unpreditable). The final goal is to automatise the prediction process so you have to pay attention to those spikes. Now I do have a planning_dump where you could find what happends in the spikes.
@Thomas_OttCould I sent you a PM such that you could think about how to implet this into a Rapid Miner Process?
With kind regards,
Maurits Freriks
With kind regards
@maurits_freriksMy suggestion is to ask your question in the community. I'm very crunched for time this week and won't be able to look at anything.