Feature Stores: Components of a Data Science Factory

architecture-1048092_960_720.jpg

Several years ago, Uber found itself facing a dilemma familiar to any company or organization with a sophisticated machine learning operation: expensive and ineffective methods of feature engineering. Feature engineering requires transforming raw data into an understandable format for predictive models, but its data engineers spent countless hours recreating and reusing the same curated modeling attributes for popular categories such as customer demographics, past purchase history, and digital channel interactions. 

Identifying the problem

With certain aspects of machine learning earning a reputation as a messy process, leaders across organizations recognized recurring bottlenecks.

1. Data engineers follow no best practice to access features during model serving.

2. Data scientists commonly work on projects in silos without collaboration.

3. Inconsistencies arise between features used for training and those used for serving.

4. Rather than recomputing select features as new data becomes available, the entire pipeline must be run. 

5. The time it takes to reinvent the wheel for each project leads to inflated costs and hampered speed. 

All of these contributing factors slowed down projects; with teams unable to recreate predictive models or generate consistent outcomes, businesses risked the trust and support of clients and stakeholders invested in the results. 

Feature stores begin to gain traction

In 2017, Uber introduced Michaelangelo, a new data management aspect of its machine learning platform. Among other attributes, Michaelangelo provided internal data scientists and engineers a feature store allowing them to ingest, catalog, and share features with other teams and for future projects in the machine learning pipeline.

A revolutionary and very timely new concept, other companies followed suit, either building feature stores themselves or providing services to implement and maintain the technology for others. In early 2019, Google announced Feast, an open-source feature store to provide solutions for machine learning teams. 

With feature stores in operation, data teams faced a new challenge

While the new concept of feature stores remained in its infancy, professionals encountered a new dilemma. Fluid operations require both online and offline analytic environments, and companies struggled to close the gap between the two. To operate a high-functioning data science factory, it’s imperative to bridge this gap and ensure efficient communication and operation across both platforms. 

With inconsistencies between the two platforms, questions arise regarding the quality, accuracy, and reliability of the outcomes. 

Understanding the powerful offline feature store

These operations host large-scale, massively parallel environments containing years of history and are tuned for analyzing many records simultaneously. Offline operations may include discovery environments for revealing insights and patterns among large volumes of historical data. Processing these queries may require time ranging from 10 seconds to several hours. 

Common technologies include Hadoop, data lakes, cloud storage (S3, Azure Blob, Google Big Query), Snowflake, Redshift, Netezza, Vertica, and Teradata.

Leveraging the nimble online feature store

Online operations operate differently, offering superior availability in low-latency environments designed to run mission-critical applications, websites, and mobile apps. With high-speed response times, online feature stores deploy models in milliseconds to the applications depending on them for near real-time decisions. 

Common technologies include RESTful services, relational databases (Oracle, MySQL, SQL Server), and NoSQL databases (MongoDB, Couchbase, Cassandra, etc.).

Tips for bridging the gap between offline and online feature stores

1. Create data parity between offline and online feature stores to ensure that models trained offline will match the data from online when used for real-time decision-making.

2. Use modern data catalogs to manage and govern the various feature sets to ensure that fields, quality automatically evaluated, data types, data owners, and SMEs are all well documented. 

PRO TIP: Alation and Colibra are best of breed data catalogs.

3. Use the batch environment to create long-term aggregations and derivations that cannot be made in real time. Once created, publish them to the online feature store for real-time model serving.

4. Push real-time variables and contextual data from the online store to the offline. 

5. Create feedback loops from the online environment and log the model inputs, outputs, and business outcomes or responses to position yourself for automated model performance monitoring and eventually automated model retaining. 

With both stores in sync, teams can reach new levels of efficiency

Data teams are transforming from cost centers to profit centers. As leaders increasingly infuse strategy with data-backed insights, data teams take on more responsibility. Researchers have dubbed machine learning the “high-interest credit card of technical debt,” due to the high complexity involved in the creation pipeline. As new programs and techniques emerge to decrease the number of entanglements and dead-ends, efficiency is becoming the currency of the day. 

With the necessary resources, any team can operate a high-functioning data science factory. The introduction of feature stores moved the industry one step in the right direction while also exposing more problematic inefficiencies. Dedicating time and energy to synching online and offline stores increases reliability, accuracy, and efficiency and helps ensure continued success with clients and fellow team members. 

For more information on how Quickpath works, visit us here.

Image source