83 private links
A colleague was saying the other day, that deep learning models are like sports cars - they need a minimum distance to accelerate before they can reach their top speed. The same way, deep learning models don't perform well on smaller data sets where there is no room (i.e. not enough data) to rev their engine. Thats why a mountain bike (e.g. CART decision tree) can navigate a trail in a forest compared to a Ferrari (e.g. convoloution neural network).
I really liked my colleagues analogy, but is there any math theory to support what they are saying? Are complex models (e.g. neural networks, svm) naturally (through their mathematical architecture) more susceptible to overfitting than a logistic regression or decision tree when exposed to smaller data? I feel there is an unspoken rule: "in general, use complicated models on complicated data". But is there any mathematical justification to support this?
I understand that sometimes deep learning models perform poorly because the analyst might not know how to use them properly (e.g. hyperparameter tuning) - but this doesn't reflect the model itself.
I know there is a theorem called the "no free lunch theorem" that shows by default, "there is no single best algorithm for all problem" - but can this theorem be used to somehow justify that smaller datasets don't require conplex models? I.e. is there some way to show that more complex models (e.g. suppose we quantify model complexity through the VC dimensionality) dont necessarily produce lower generalization error on smaller datasets?
So, given a very powerful computer that can simultaneously consider millions of hyperparameter combinations: can it be statistically shown that more complex models are not necessarily better for smaller data sets (e.g. iris data)?
Thanks