DevOps is an IT mindset that encourages communication, collaboration, integration and automation among software developers and IT operations in order to improve the speed and quality of delivering software. Its tasks involve standardizing development environments and automating delivery processes to improve delivery predictability, efficiency, security and maintainability. DevOps encourages empowering teams with the autonomy to build, validate, deliver and support their own applications. Machine learning engineers can benefit from the frameworks and philosophies of DevOps.
Thiago de Faria is Head of Solutions Engineering, at LINKIT, he’s fascinated by DevOps and what machine learning engineers can learn from the ideology and practices.
He explained that DevOps is critically underpinned by culture change, “a culture change that involves creating product teams, creating product and development working together and products with ITOps, with QA, with infosec. So everybody as a team working on it.”
So what does this has to do with AI?
Thiago contends: “AI to me is making computers capable of doing things that when done by a human will be taught to require intelligence. That means intelligence that is out of proportion, and that is ultimatized and runs on scale. Machine learning is encompassed inside AI. It’s making machine computers find patterns without explicitly programming them to do.
With machine learning you have to come with the idea that it will never be 100% correct, is a statistical thing. If you get a model work with 95% of accuracy, that is brilliant, that is amazing. It’s better than majority of the models that I’ve seen in my life. So you have to convince the board or the people responsible for it. Are you okay with 5% of error?” The answer may vary depending on whether you are testing HIV drugs or image recognition.
The machine learning lifecycle always starts with business questions:
- How many things do you expect to come?
- How are we going to love this?
- How are you going to monitor to see if your machine learning model has been useful?
Data scientists have to stop being “the PhD people sitting on marketing on finance, working on the model, making a PPT presentation showing to the CEO and say my job here is done.”
Culture is the biggest problem
According to Thiago, “Culture is the biggest problem in every field in every area. technical things and computer are easy. People are hard. And usually, people are hard because they are a reflection of how the culture happens inside the company.”
He gives the example of being an on-call worker: “This is something that I passionately defend. Because I think people only write good production code, when they feel the pain to be on-call for receiving a pager duty notification, then you start to do proper testing, proper integration for things because you don’t want that thing to go into production. If it’s gonna harm, you’re weakened.”
However with a pager call, when the dev doesn’t know what to do, if they have a CI/CD pipeline, “They’re just going to roll back to the latest version and open a ticket for you to fix that thing. If this is hard for you developers imagine how hard it is for a data scientist that thinks that his job is making PPT presentations. Not writing documentation. or production-ready code.”
Explainability plagues machine learning engineers
If you are building a machine learning model, you have to be able to pinpoint the reason for outcomes. “You cannot say your mortgage was not accepted because my algorithm says so, I have to tell them, your mortgage was not accepted, because these were the variables that influence more.”
Data scientists are perfectionists plagued by time
Data scientists and machine learning engineers are usually masters students or PhD students that come into work into the field. They are used to writing models to send to publications and traditionally spent six months working in one model.
“And now you come to them and say like, can you build something in two weeks? They go insane, they start to work locally, to not show their failures for the other people. I’ve seen a bunch of data scientists that had they have a GitHub or Bitbucket, or Gitlab, whatever, that they only have one commit in three months. Because when they’re doing that, they feel that there I am not going to show people this model that is 30% accurate. I’m just gonna show when I get to 80% or 90%.”
This is a key cultural problem. Data scientists need to know it’s ok if something is not perfect, it’s ok to fail and it’s better to have a model with 30% accuracy that you can deploy today that have no model at all, because then it becomes a business decision.
As Thiago shares “Somewhere we need to be able to know, this algorithm was tried on this data set with these parameters. And this was the result, so that everybody can continuously watch around and see how the evaluation is running.
Also important is drift:
Drift is related to if I deploy a machine learning model to production, it will always become better? “If it gets more data, it will always become better model, right? It’s a very, very common misconception that this happens. But what actually needs to happen is, you always need to recalibrate your model. Otherwise, it will drift away. Because usually, your model is going to generate an interaction with the user that the user is going to sometimes adapt to it. You have to add more features, create more features, retrain, change the algorithm on it, otherwise, it will become worse. It’s a living organism that you have to continuously teach.
Fragility is related to the ability of a model going to cope with changes in important features.
Helpful tools for machine learning engineers
- Apache spark.23 can run on top of Kubernetes
- If you are on public cloud, cloud-native machine learning tools like Sage maker on AWS, or Spark data bricks in AWS or Azure or, ml studio or Google ml auto ml are great.
- Try to use Kubeflow if you have a current cluster that can have your Jupiter hub and TensorFlow serving in a nice way,
- AWS Sage maker build, it’s a helpful CI/CD tool for your machine learning models.