The Problem: The Typical ML Workflow Is Riddled With Questions Answered By Costly Trial and Error
Which Variables to Use For Predictions?
If variables are not sufficiently informative about the label to predict, training any predictive model would be unsuccessful.
Additionally, training a predictive model with all available inputs, informative and uninformative, would take longer, be more expensive and might not necessarily result in better performance.
Finding the right variables to use is crucial, but doing so by trial and error can be very costly.
How Long Should A Model Be Trained?
Slow to train predictive models can be trained on a time budget. When doing so, the tradeoff one needs to get right is between time or compute resources and performance.
Getting an understanding of whether a model is still worth trying to improve is crucial, but doing so by trial and error can be very costly.
Which Models Are Worth Training?
The right model is the one that strikes the best compromise between performance and production-worthiness (e.g. simplicity, robustness, explainability, etc.).
At times, simpler and more explainable models are the most accurate. Occasionally however, the most complex models can yield a performance increase over simpler models that is worth the decrease in production-worthiness.
It is important to work with a large library of models, but systematically training all models in the library is often unecessary and can be very slow and costly.
How To Improve a Deployed Predictive Model?
Faced with improving a deployed machine learning model, should a data scientist use the same input variables but a more complex model (model-driven improvement), or look for additional and complementary input data (data-driven improvement)?
If the former, what variables did the model not fully exploit? If the latter, how to determine whether candidate new variables are complementary and should be expected to improve performance?
Trying to increase the performance of a deployed model by trial and error can be very slow and costly.
Our Solution: Empower Data Scientists To Focus on High ROI Experiments
We use novel information theory methods to anticipate experiments that are unlikely to be successful, and empower data scientists and machine learning engineers to focus on experiments that are expected to generate the highest business value.
Our work is both open-math and open-source. The math behind our solutions can be found in the reference section, and our source code is available on our GitHub page.
Model-Free Variable Selection Analysis
Our model-free variable selection analysis evaluates the highest performance (e.g. R2, log-likelihood, classification accuracy) achievable by any predictive model using a specific set of input variables, thereby allowing data scientists to choose inputs that can bring about sufficient business impact, prior to any modeling, and free of trial and error.
Additionally, for d input variables and for any k between 1 and d, our analysis generates the best k variables to use, and the percentage of the overall achievable performance that could be obtained by only using the best k variables. This allows data scientists to focus on insightful variables, thereby reducing training time and compute cost, and often improving performance as well. The alternative by trial and error requires running an exponential number of experiments, which would be extremely costlier.
Model-Free Achievable Performance Analysis
Our achievable performance analysis evaluates the best performance (e.g. R2, log-likelihood, classification accuracy) achievable by any predictive model using a specific set of input variables.
Used as stopping criteria while training predictive models, it empowers data scientists to strike the right balance between time and compute resource budgets and expected business impact, and it helps with early detection of overfitted models.
Model-Free Model Improvability Analysis
Our model improvability analysis empowers data scientists to explore vast libraries of models parsimoniously, starting with models with high production-worthiness (i.e. simpler, more robust and easier to explain), and only considering a less production-worthy model when the model improvability analysis suggests the potential increase in performance outweighs the decrease in production-worthiness.
Model-Free Dataset Valuation
When adopting a data-driven approach to improving a deployed predictive model, our model-free dataset valuation provides data scientists with an estimate of the maximum performance increase (e.g. R2, log-likelihood, classification accuracy) that one or more additional input variables can bring about, prior to using them in any model.
This allows data scientists to cut down the number of dead-end experiments, and generate business value much quicker and cost-effectively.