Everything is automatic and effortless with machine learning algorithms, right? Big data is all you need, after all. You have a dataset, you split it when necessary, you take one machine learning model, you train it and the miracle of a correct classification or prediction shines its light on you, your name, your business. Artificial intelligence is easy, isn’t it? No, it is not: this is only advertising.
“Most Machine Learning talks present beautiful cases of success, but in reality models often fail to deliver the desired performance“, stated Rafael Garcia-Dias in the introduction to his speech at Codemotion Milan 2019. “It is not uncommon to see developers blaming certain models and even blacklisting certain models.”
Garcia-Dias is a Research associate at King’s College London whose main focus is on developing machine learning models based on structural MRI to diagnose patients. In many cases, he has found that repeated trial-and-error processes are required to find a good data/algorithm combination, if one exists at all.
Data are nothing without control over the problem you are facing. “Only when you know that can you think about your model”, Garcia-Dias clarifies. “Be sure you understand your problem”: if you don’t have enough data, then generate it, even if this could prove expensive.
A good path from astrophysics to neuroscience
Automated learning can help in branches of knowledge that are fascinating, but inaccessible to the human mind. Garcia-Dias offers really amazing examples from his career. He invested time in testing the chemical history of galaxies: “With machine learning tools you can understand where the interstellar gas in each of them comes from”. Your data constraints limit your performances: “not all clusters are distinguishable with today’s approaches”.
The second example from Rafael Garcia-Dias’ work is an analysis of MRI scans. “We determine the brain age, then we compare it with the real age of the person”, explains the King’s College researcher; “the results can help diagnose some important diseases in time”.
Machine Learnng: Linearity is broken
A common mistake researchers make is fostering misleading expectations about the process’ linearity. First of all, each model has its own limitations, and the coder must be aware of these in order to be sure the reality will match the desired results.
One successful example of Rafael Garcia-Dias’ experience is based on the k-means algorithm. It’s important to understand the underlying assumptions of your model. In k-means the basic distance is euclidean, and this is one constraint to the use of this model. It rarely works ‘as is’, and often needs much work on data and parameters. Moreover, there are many viable alternatives, such as GMM and DBscan.
Gaussian Mixture Modelling is an extension to the k-means algorithm, assuming that all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The Scikit-learn library allows the use of GMM with several alternative strategies.
Density-based spatial clustering of applications with noise groups together points with many nearby neighbors.
Garcia-Dias tested these three algorithms with different parameters, showing that very small changes can significantly alter the homogeneity score. If you have a feel for your data you can limit the number of trial tests you need to carry out.
Deciding what model suits you best: real world apps
Many different algorithms on the market have multiple libraries. You have to know what is behind their code in order to make good use of them. This great variety of development tools could generate a problem of choice. Coding for machine learning can look strange but it’s more or less like any other kind of programming: if you know one environment, you can learn any other environment.
Existing libraries can look inadequate for a particular goal, so the researcher may think of writing their own code. Is this usually a mistake?
“I never write new libraries myself“, answers Rafael Garcia-Dias, “because that code is highly optimized and strongly reviewed. But I often look for other libraries in different languages”. Python‘s libraries are often surpassed by R’s equivalents, to give one example.
“Great programmers develop great libraries, and algorithms, all stuff that flows in the open-source software pool, sooner or later”. Each of these will have its own limitations to study and understand so that a user makes the best choice. It’s better to spend time looking for dummy classifiers and dummy regressors!
Conclusions
Bad models don’t exist, to be crystal clear. There are some silly, limiting mistakes to be avoided: it’s essential to be aware of the assumptions behind each model, and to really feel your database.
The most important advice suggests a continuous process flux – “never quit thinking” – that looks well suited to both AI algorithms and real-life activities.