Abstract

Whatify is a cloud-based ML platform that provides answers to a large set of business questions using AI. The system uses predictive and prescriptive analytics to provide the end user with future insights and recommended actions. Its intuitive interface enables the end user to obtain the desired results easily and it requires no data science skills. Whatify is built for business professionals who need data science to reach and exceed their KPIs and targets.

The large variety of Whatify analytics questions and the growth in data sources, coupled with the proven value of modeling this data, has highlighted a need for methods of rapid data preparation and model generation. 

Manual data preparation and AI model generation requires considerable effort. There are many data preprocessing steps, many model types each with many internal parameters (hyperparameters) that a data scientist needs to try. He must prepare his data, select algorithms to be used, code the whole thing and test the outcome. Then he needs to repeat this again and again until the right model is found. In most cases, the process is done in a serial manner, not using concurrent computation.

Whatify system uses a behind-the-scenes concurrent cloud-based system that allows it to generate models in an automatic and fast manner and supports the entire process from input data analysis to model generation and deployment to production. The system uses dynamic allocation of computation resources and hyperparameter optimization processes that highly accelerate the search for the optimal ensemble of models.

The system is also capable of generating recommended actions based on its prediction analytics that allows the user to take the right action on top of getting predictions.

I. Introduction

Machine learning has become a key element in modern computation and data analysis. It is used in many areas for various tasks ranging from fraud detection [Ram P., Gray A.] to recommendation systems [Koren et. al.], on-line retail for pricing and stock management, [Ferreira et. al.] and analytical customer relationship management (CRM) [Gupta et. al.].

The process of applying machine learning to a specific task involves collecting training data and arranging and preparing it, then applying preprocessing steps such as imputation, feature selection, text analysis and feature embedding. Various algorithm types defined by their hyperparameters are then applied to the data during the search for optimal models and an ensemble. Later, after the model has been deployed to production, further steps are taken to ensure the robustness and reliability of the prediction process. In some applications the prediction is achieved by calling a remote prediction server, in others it’s done by integrating the model into another system. In some cases, there are severe time constraints on the prediction, so the predictor must be designed for a prompt response.

A machine learning platform that automates this long process could be very useful. It would have to generate high accuracy models and shorten the time to solution considerably. It would also need to provide users with information about the search process, algorithms and transformations applied on the data during the preprocessing phase, the model building phase and the ensemble phase. It also needs to provide model interpretability tools that would allow model governance. In addition, such a platform would need to provide the means that would deploy models into production.

II. Hyperparameter optimization

There are many machine learning algorithm types and it is usually not obvious, even for an experienced data scientist, to decide which ones to use. There are several reasons for this. First, the machine learning algorithms underlying each model are very different from one another. The algorithms vary in their generalization capabilities and in their number of degrees of freedom. For every algorithm, there are many variants based on hyperparameters defining the model, such as the cost of a support vector machine (SVM) or the number of neighbors k in k-nearest neighbors (KNN).

The large space of possible algorithms and hyperparameters becomes even larger when adding in data preprocessing methods such as feature embedding, imputation methods and scaling.

III. Platform architecture

The Whatify underlying automatic modelling platform described here is a scalable cloud-based platform. This system can be accessed with an API bypassing the user interface. Here are the system main components:

  1. The front end user interface allows the user to perform the entire process of model generation and deployment to production. It also contains input data analysis tools and data scientist tools for feature selection, feature analysis, data leakage detection and more. The user interface gives special attention to model interoperability. It’s essential to make preprocessing and model building comprehensible to the user, allowing the user to analyze specific predictions, find key parameters that are locally affecting a specific prediction and more.
  2. The data ingestion module is responsible for the data loaded into the system, feature type inference, data analysis, data feature extraction, and, if needed, subsampling. This process is heavily parallelized using cluster processing based on Apache Spark.
  3. The modeler performs the search in the hyperparameter space for the optimal models and generates a golden ensemble of models.
  4. The prediction module uses the golden ensemble for production. The prediction can be done by sending data to the predictor or by integrating an exported model into an external system. The predictor supports both options. Both the prediction module and the modeler are scalable. They employ a variable amount of processing units, which varies according to the current and predicted load.

Figure 1. Whatify expert dashboard

Platform design

The platform described here was designed to address challenges involved in training and deploying machine learning models in production. There are challenges in the ingestion of large data input that require multi-machine orchestration. There are challenges in handling multiple machine learning tasks, in optimizing the model and preprocess hyperparameters, in generating an ensemble that will outperform the best models and more. Additional challenges are posed when storing the models and deploying them in production in a way that will ensure high availability and prediction speed. To cope with these challenges, the following capabilities are built into the platform.

  1. Handling multiple machine learning tasks. In many cases there are multiple ways to solve a prediction challenge. For example, predicting risk can be done using regression or classification into several risk labels. Recommendation can be done using classification or recommendation engines. To allow users to test several approaches, the platform supports several tasks – regression, time-series regression, classification, anomaly detection and recommendation. After uploading a dataset, the user defines the type of task the user wants to do. Users can also select the type of algorithms they want to be included in the optimization process.
  2. Efficient searching over preprocessing and model parameters. There are many preprocessing algorithms that may be applied to data, ranging from simple operations like scaling and imputation to feature engineering, feature stacking and embedding. There is also a wide range of algorithms for learning tasks and each algorithm has its own hyperparameters. The platform runs efficient AutoML optimization processes that balance between exploration, exploitation and use of previous experiences collected by the platform.
  3. Loading large data sources is supported with cluster computing. Operations including type inference for each feature, input data statistics calculation, data shuffle, data stratification and other operations are done in parallel. When loading data into the system, the system computes meta features of this data. These meta features are later used in the meta learning module that, based on experience collected in the system, predicts which algorithms and preprocessing steps will perform well on the dataset. The meta feature computation is also executed in parallel using a cluster.

Figure 2. shows the prediction error of the best model found using a commonly used model compared to the Whatify platform. The model predicts which customer is likely to buy services from a bank and is based on a public dataset provided by Santander Bank. It’s possible to see that the Whatify algorithm is outperforming the other model by using previous experience. It also improves at a better rate over time.


Figure 2. – Loss in the search process comparing whatify to a random i

The model generated by the system for the Santander dataset, shown in figure 3 is an ensemble of four models and contains many data pre-processing steps. These steps include data cleaning and removal of outliers, imputation of missing values, automatic sample generation, new feature generation, feature selection and more. 


Figure 3. – Graphic description of a model 

IV. Whatify expert 

Whatify expert is used to generate the optimal model ensemble internally in an automatic manner. After uploading the data and defining the input file structure and information concerning the machine learning task, users can run modeling tasks on the input data. In this phase, special care is given to “data leakage” detection and prevention.

Once the input file structure is defined, the system starts searching for the optimal ensemble. It’s possible to alter a limited amount of the default parameters of the search including the type of algorithms and preprocessing, optimization parameters, target metrics and many other parameters.

Data input

The input data, in many cases, is the result of a pipeline of systems and processes. Minor issues in the input data can cause a significant reduction in model performance. Monitoring input data, as well as automatic repair and transformations must be applied in order to verify data quality. The platform performs checks on input data, looking for issues including wrong data formats, typos in text fields, missing values, etc. It then displays warnings about possible faults and enables the user to identify these faults and use the system to fix them.

Another key element of the data input phase is the computation of data features for the meta learning system. These features capture information on the nature of the modeling problem on hand and later direct the system to possible models and preprocessing procedures that would be optimal for it. Such an approach was demonstrated by [Smith-Miles et. al.]. These statistical features include data derived from the distribution of the data, including moments and quantiles; cross-feature statistics; interclass and intraclass features for classification and additional data used for tuning anomaly detection sensitivity parameters.

The above features and system checks on input data must be performed efficiently at scale. On large datasets some of the computation components are performed using distributed streaming algorithms to give the required performance. [Flajolet et. al., Krishnan et. al.]

Model optimization

The space of hyperparameters for preprocessing and modeling algorithms is high dimensional. Several open source approaches have been proposed for optimizing such a non-differentiable space with considerable cost for each point estimation. One approach is using Gaussian processes to model the function [Snoek et. al.] Another has been implemented in MOE (Metric Optimization Engine) [Clark et al.] One approach [Hutter et al.] uses the Random Forest algorithm to model the function, while another approach [Bergstra et al.] uses tree-structured Parzen estimators to model the function. We have compared the performance of some of the above approaches to the performance of optimization that uses meta learning and input from human

data scientists and found the optimization with meta learning to perform considerably better. Allowing the system to use experience gained from running other datasets and allowing the user to indicate which algorithms to rule out of the search improves platform performance considerably.

The space of machine learning algorithms is constantly changing. New algorithms are added, and existing algorithms change the way they work and change their hyperparameters. For those reasons we have based our platform on open source machine learning modules including TensorFlow, SKlearn and others that continuously improve.

Figure 4: Hyperparameter search algorithm selection. This screen shows some of the advanced settings for hyperparameter search.

Production environment

Once an optimal ensemble is generated, it is ready for use in production in the cloud or on premises. The platform supports both. The user can apply the model to data available in the system and download the prediction. Alternatively, the user can export the ensemble the Firefly Lab generated and integrate it into an on premises environment.

V. Final ensemble and validation methods

The outcome of Firefly Lab is a weighted ensemble of machine learning models and preprocessing components that form it.

Preprocessing components

Many types of operations are applied to the data before being used  as training data. Some of the operations “fix” the input data by imputing missing values, applying scaling or normalizing the data. Others reduce the number of input features, for example, by entropy or correlation-based feature selection or

feature embedding. The platform also applies preprocessing algorithms for feature engineering and generates new features, for example, by feature stacking based on k-means algorithms or other clustering algorithms. The search for preprocessing hyperparameters is part of the ensemble generation process.

ML Task

After preprocessing, the data is fed into the machine learning ensemble builder. There are five types of tasks supported by the platform:

  1. Regression – Predicting a value as a function of an input vector. The number of possible use cases is very large and includes price prediction, expected sales prediction and inventory management.
  2. Classification – Classifying a sample, either binary or multiclass classification. The system also outputs an estimate for the probability of the classification correctness. Use cases are numerous and include risk assessment, customer churn, customer support challenges and more.
  3. Anomaly detection – Finding outliers in a dataset. The search for the right model is done using supervised and semi-supervised algorithms. The use cases include cyber security threat detection, fraud detection and medical results analysis.
  4. Time series – Predicting a value as a function of an input vector and a series of previous vectors and values. A common use case is price and sales predictions.
  5. Recommendation – Predicting customer preferences based on recent activity and recent preferences. The system can also utilize additional information about the customer and the items involved. The use cases are plentiful. The most obvious ones are in e-commerce and online services when additional items can be offered based on information collected thus far.

For each task there is a set of ML model types and preprocessing components defining the space of hyperparameters specific to the task. Some algorithms are shared between tasks with only a change in their hyperparameters.

Interim ensembles that are generated along the search processes are validated using a large set of performance measures that are specific for each ML task. These validation measures include the common measures like AUC, RMSE, MAE and F-measures, as well as some new measures developed as a response to requests. The user defines validation measures as well as the validation set extraction methods like cross-validation, holdout and time-based partitions. Eventually, some of the best models are combined into the final ensemble.

Figure 5: Performance validation measures

Data scientist tools and model interpretability

Along the model development process and during the model’s use, it is essential to provide insights on features, the search process and the golden ensemble generated. The platform incorporates various tools that provide this information. 

In the beginning of the model generation process, the potential contribution of features is analyzed and presented along with statistical information about them. The feature effect and contribution to model accuracy is analyzed at the end of the process and provides the user with information on key features that have a global effect on the model. There are also additional tools and informative screens detailing the structure and parameters of the golden ensemble. These tools show the platform user which types of models and preprocessing steps were found to be the most effective in this task.

Prescriptive analytics tools

The basic level of analytics is descriptive analytics, which uses data aggregation and data mining to provide insight into the past and answer: “What has happened?”.

Predictive analytics, which uses models and forecasting techniques to understand the future and answer: “What could happen?”. Prescriptive analytics, which uses optimization and simulation algorithms to advise on possible outcomes and answer: “What should we do?”

The AI model built by the Whatify system provides predictions for the future and classification results. To turn this information into recommended action the Whatify system has a prescriptive analytics module that examines possible actions and their effect on the future. For example, in the case of predicting which customer will respond to a marketing campaign, the system can analyze parameters of that campaign and determine which parameters are optimal for each potential customer. In this example that system can recommend on what channel to approach a customer, what type of product to put on sale for him etc.

The system can also allow him to examine manually the effect of a change in a parameter on the final prediction using Individual Conditional Expectation (ICE) algorithm [Goldstein et. al. 2015]

One size doesn’t fit all

A common practice among data scientists and machine learning experts is to divide the training set into smaller sets so that each smaller dataset contains data that is less diverse and has more in common compared to the original dataset. This allows the models generated on each smaller dataset to handle a subset of the general prediction model and, in many cases, do this with greater accuracy. A good example for that is [Oh, Makar] where prediction models built on subsets of the original data proved much more accurate than a single model built on the entire dataset.

The use of an automatic platform for rapid generation, update and deployment of multiple models lessens the amount of work involved in building and using a large number of models, and, in many cases, makes this approach practically possible.

Conclusions

In this white paper, we have described a platform used to automatically generate predictive models. This platform supports the entire process from input data analysis to model generation and deployment to production while using dynamic allocation of computation resources. These end-to-end processes encompass hyperparameter optimization at every step so as to highly accelerate the creation of a weighted ensemble of complementary machine learning models with broad coverage for high accuracy.

Generation of this optimal model has proven, through Kaggle results and benchmarking, that it significantly outperforms manual methods and, through meta learning, continually improves.

About Whatify:

Whatify is a crystal ball that allows anyone to see the future of their business and act on it.  Powered by a next-gen AI engine you can now generate predictions and recommended actions in moments, zero data skills required. Just ask your business question and see predictions and suggestions come your way. It’s really that simple.

References

  1. Bergstra, James, Yamins, Daniel, and Cox, David. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. A Stratified Analysis of Bayesian Optimization Methods. In Proceedings of the 30th International Conference on Machine Learning, pp. 115–123, 2013
  2. Clark, Scott, Liu, Eric, Frazier, Peter, Wang, JiaLei, Oktay, Deniz, and Vesdapunt, Norases. MOE: A global, black box optimization engine for real world metric optimization. https://github.com/Yelp/MOE, 2014
  3. Ferreira, Kris Johnson, Bin Hong Alex Lee, and David Simchi-Levi. “Analytics for an online retailer: Demand forecasting and price optimization.” Manufacturing & Service Operations Management 18.1 (2015): 69-88.
  4. Flajolet P., Fusy R., Gandouet O., and et al. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms (AofA).
  5. Goldstein, Alex, et al. “Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation.” Journal of Computational and Graphical Statistics 24.1 (2015): 44-65
  6. Gupta G. and Aggarwal H. Improving Customer Relationship Management Using Data Mining, International Journal of Machine Learning and Computing, Vol. 2, No. 6, December 2012
  7. Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pp. 507–523. Springer, 2011
  8. James Bergstra, Yoshua Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research 13 (2012) 281-305
  9. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems, Computer Volume: 42, Issue: 8, Aug. 2009
  10. Krishnan S., Wang J., Wu E., Franklin M. J., and Goldberg K. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB 9, 12 (2016), 948–959.
  11. Oh, J., Makar, M., Fusco, C., McCaffrey, R., Rao, K.,… Wiens, J. (2018). A Generalizable, Data-Driven Approach to Predict Daily Risk of Clostridium Difficile Infection at Two Large Academic Health Centers. Infection Control & Hospital Epidemiology, 39(4), 425-433. doi:10.1017/ice.2018.16
  12. Ram P., Gray A. Fraud Detection with Density Estimation Trees. Proceedings of the KDD 2017:Workshop on Anomaly Detection in Finance, PMLR 71:85-94, 2018.
  13. Smith-Miles K., Baatar D., Wreford B., Lewis R. Towards objective measures of algorithm performance across instance space, Journal Computers and Operations Research archive Volume 45, May, 2014 Pages 12-24
  14. Snoek, Jasper, Swersky, Kevin, Larochelle, Hugo, Gelbart, Michael, Adams, Ryan P, and Zemel, Richard S. Spearmint: Bayesian optimization software. https: //github.com/HIPS/Spearmint, 2014