I have written before about some of the important paradoxical tendencies in the Machine Learning (ML) and Intelligence (AI) industries. Let’s look at it from two perspectives:
- The Developer: the top engineer that is trying to help his or her organization move forward, by using ML and AI techniques.
- The CTO: the engineering manager that is trying to oversee these efforts, and who wants to ensure that they are pursued in a technologically and economically/budgetary sound way - while complying with all the new rules and regulations that are now forthcoming.
The Developer's Context
For developers working on these types of AI/ML software projects, I think we can state some universal truths:- developers don't like architectural overhauls
- developers are opinionated about the tools that they like and use: they want to be productive, and because of this they usually prefer to stick to the tools that they know.
The CTO's Context
For CTO’s that are working on AI/ML software projects, we can also consider some universal considerations:- The CTO is responsible for making sure that his organization is constantly innovating and adding value to the business. AI and ML seem to offer massive potential for innovation for many organizations: we want to seize the opportunity, before our competitors do so. If the CTO does not act proactively, there will be a CxO on the management team that will start asking questions about the potential of AI shortly - you can bet your house on that.
- Exceptionally, this is done with "big bang" solutions that will radically change an entire architecture from scratch, in a greenfield.
- However, most people don't work in a greenfield, they work in a brownfield. Therefore, big bang approaches are not possible, and most people and organizations turn to innovate in different small incremental steps.
So: in assessing both of these perspectives, we are clearly presented with an important question: how can we figure out a way to incrementally innovate using AI/ML, provide value to our organization in an economically sound manner, while respecting the productivity requirements and current tool choices of our engineers as much as possible?
The importance of Openness and Modularity
As I have tried to explain how both the individual engineer and their CTO management have at least slightly conflicting interests in the implementation of AI and ML projects. How can we make this work then? Here’s where I think we can take a look at the software industry in general, and look at some of the technological and methodological approaches that have made software engineering so much more productive in recent years.To do so, let’s think back to where we came from: as little as two or three decades ago, waterfall-based software engineering methodologies were the standard for how you would tackle large software projects. It took books like the Mythical Man-month and the Agile Manifesto to emerge for people to start changing their approach, and to start developing and then using iterative software engineering methodologies, DevOps practices, Infrastructure-as-Code and cloud computing as critical enablers of a more modern approach. This is what we need to apply in the world of the very specific domain of AI and ML now: we need to learn how to apply these techniques in the world of AI and ML, and that will require a specific architecture to enable this.
This architecture should therefore
- NOT be a closed, all-or-nothing environment,
- NOT limit the engineers to a specific toolset,
- NOT bound you to a specific deployment environment, offered by this that or the other cloud vendor
- NOT dictate the specific framework that you should or should not use for your AI and ML
Let’s discuss this vision in a bit more detail. We’ll call this the Incremental AI Architecture (AIAI), enabled by the AI Lakehouse.
The Incremental AI Architecture (IAIA)
Today, many AI and ML projects are characterized by structures that are very different from what we want in an innovative and incremental AI Architecture. Systems are developed with bespoke tools and processes that are specific to the project in which they are implemented, and do not allow for reuse and automation in a way that an IAIA enabled by an AI Lakehouse would. Therefore, we propose a combination of best practices and tools that will allow you to move in this direction.Let’s walk through the different steps in a few succinct paragraphs.
Start with modularization of pipelines:
At Hopsworks, we have argued many times before that one of the first steps that one should take when embarking on an AI/ML project, is to conceptually break up the AI/ML system into three parts. We refer to these three parts as the “F-T-I” pipelines, where we would distinguish three distinct phases and processes in the development and productionalization of these systems. At a high level, we would consider- The Feature Pipeline, where the different data sources of the input data of the AI/ML system are wrangled into the right format(s), and then eventually persisted in a way that would enable the next “T” phase of the system to proceed.
- The Training Pipeline, where the input features would be used to train and test a an AI/ML model that would be used for making the predictions that we need for our organization.
- The Inference Pipeline, where we deploy the model that was generated by the training pipeline, and start using it to allow systems and applications to use it’s capabilities and serve the predictions of the model to the outside world. In this pipeline, we would also want to monitor for any impactful changes inside or outside the system that would warrant us to revisit the process.
Once we have split the AI/ML system into these three functional and modular parts, we can proceed to take the next step to make the system into a true production-ready AI system. This will require some education of our team, and that would be the important second step in this journey.
Educate the team on the importance of MLOps for production AI systems
In order to maximize the benefits of AI systems in production, we will need to automate as many of the FTI pipeline steps as possible. This is what we call MLOps, short for “Machine Learning Operations” and analogous to DevOps.These automations will offer
- Significant Engineering and Technical benefits that will allow us to write better software systems
- Process benefits that will allow these improved systems to be seized and maintained
- Budgetary and Financial benefits: both our operational expenditures (OpEx) and our capital expenditures (CapEx) can be reduced, enabling more innovation at a lower cost.
Do a gradual implementation of an MLOps platform
Over the years, Hopsworks has gained a lot of experience with the implementation of AI/ML systems on top of an MLOps platform, and as a consequence we have developed and released the 4.0 release of the Hopsworks Enterprise platform. We have called this the “AI Lakehouse” release, as we believe that it allows for a comprehensive but gradual implementation of everything an AI/ML project needs to become successful.Based on this experience, and using the AI Lakehouse infrastructure, we recommend that you consider the following steps in sequence:
- Implement a feature store, using the appropriate taxonomy of feature transformations for offline, batch use cases, as well as for online, real time use cases. As I have written before and elsewhere, a Feature Store is a critical piece of software infrastructure that will serve as a forcing function for MLOps.
- Implement and automate the FTI pipelines on the Feature-store platform. As you can see from the graphic above, all of the pipelines of the recommended F-T-I approach will leverage this central infrastructure, and therefore drive all users and stakeholders of the AI/ML system towards a centralized approach that will be much easier for the Alpha Developer, manageable for the Empowered CTO, and governable for the architects and compliance teams that are to ensure that rules and regulations are appropriately followed.
- Implement a model registry, thereby accurately keeping track of all the impacts and changes that are required of our production AI/ML systems. If necessary, you could combine the registry with an experiment tracking system - but we have found this to be largely unnecessary if the above steps are duly implemented. Once you have the FTI pipelines connected to a feature store, tracking experiments becomes pure overhead.
- Implement centralized model deployment for offline, batch use cases, as well as for online, real time use cases.
- Finalize with centralized feature and model monitoring, iterating back to the FTI pipelines whenever such monitoring demonstrates the need for updating and adaptation.
No comments:
Post a Comment