# AutoML Solutions

## The Problem: The Typical ML Workflow Is Riddled With Questions Answered By Costly Trial and Error

**Which Variables to Use For Predictions?**

If variables are not sufficiently informative about the label to predict, training any predictive model would be unsuccessful.

Additionally, training a predictive model with all available inputs, informative and uninformative, would take longer, be more expensive and might not necessarily result in better performance.

Finding the right variables to use is crucial, but doing so by trial and error can be very costly.

**How Long Should A Model Be Trained?**

Slow to train predictive models can be trained on a time budget. When doing so, the tradeoff one needs to get right is between time or compute resources and performance.

Getting an understanding of whether a model is still worth trying to improve is crucial, but doing so by trial and error can be very costly.

**Which Models Are Worth Training?**

The right model is the one that strikes the best compromise between performance and production-worthiness (e.g. simplicity, robustness, explainability, etc.).

At times, simpler and more explainable models are the most accurate. Occasionally however, the most complex models can yield a performance increase over simpler models that is worth the decrease in production-worthiness.

It is important to work with a large library of models, but systematically training all models in the library is often unecessary and can be very slow and costly.

**How To Improve a Deployed Predictive Model?**

Faced with improving a deployed machine learning model, should a data scientist use the same input variables but a more complex model (model-driven improvement), or look for additional and complementary input data (data-driven improvement)?

If the former, what variables did the model not fully exploit? If the latter, how to determine whether candidate new variables are complementary and should be expected to improve performance?

Trying to increase the performance of a deployed model by trial and error can be very slow and costly.

## Our Solution: Empower Data Scientists To Focus on High ROI Experiments

#####
We use novel information theory methods to anticipate experiments that are unlikely to be successful, and empower data scientists and machine learning engineers to focus on experiments that are expected to generate the highest business value.

Our work is both open-math and open-source. The math behind our solutions can be found in the reference section, and our source code is available on our GitHub page.

**Model-Free Variable Selection Analysis**

Our model-free variable selection analysis evaluates the highest performance (e.g. R^{2}, log-likelihood, classification accuracy) achievable by any predictive model using a specific set of input variables, thereby allowing data scientists to choose inputs that can bring about sufficient business impact, prior to any modeling, and free of trial and error.

Additionally, for d input variables and for any k between 1 and d, our analysis generates the best k variables to use, and the percentage of the overall achievable performance that could be obtained by only using the best k variables. This allows data scientists to focus on insightful variables, thereby reducing training time and compute cost, and often improving performance as well. The alternative by trial and error requires running an exponential number of experiments, which would be extremely costlier.

**Model-Free Achievable Performance Analysis**

Our achievable performance analysis evaluates the best performance (e.g. R^{2}, log-likelihood, classification accuracy) achievable by any predictive model using a specific set of input variables.

Used as stopping criteria while training predictive models, it empowers data scientists to strike the right balance between time and compute resource budgets and expected business impact, and it helps with early detection of overfitted models.

**Model-Free Model Improvability Analysis**

Our model improvability analysis empowers data scientists to explore vast libraries of models parsimoniously, starting with models with high production-worthiness (i.e. simpler, more robust and easier to explain), and only considering a less production-worthy model when the model improvability analysis suggests the potential increase in performance outweighs the decrease in production-worthiness.

**Model-Free Dataset Valuation**

When adopting a data-driven approach to improving a deployed predictive model, our model-free dataset valuation provides data scientists with an estimate of the maximum performance increase (e.g. R^{2}, log-likelihood, classification accuracy) that one or more additional input variables can bring about, prior to using them in any model.

This allows data scientists to cut down the number of dead-end experiments, and generate business value much quicker and cost-effectively.

# Asset Management Solutions

#### We provide a wide range of model-free quantitative solutions for the asset management industry. Our methods are all free of rigid and arbitrary assumptions such as linearity and Gaussianity.

##### Performance Attribution

Modern quantitative investment management aggregates a large number of investment ideas (or alphas) into a meta-strategy by solving an optimization problem under various investment constraints such as maximum risk factor exposures, maximum position sizes, and portfolio turnover. This results in a highly nonlinear relationship between investment performance, alpha performance, risk factors.

Thus any attempt at explaining the active performance of an investment fund using linear models, a popular example being the Brinson-Hood-Beebower model, is bound to be suboptimal and exhibit large residual variances.

We provide a performance attribution solution that is model-free and copes with nonlinear associations.

##### Nonlinear Risk Analysis

Assuming linearty or Gaussianity in a risk model is like assuming that crises do not exist, debtors do not default, and markets do not overact. A modern risk model should make no such assumptions.

We provide an information-theoretical risk analysis solution that is model-free and naturally copes with non-Gaussian and nonlinear associations.

##### Alternative Data Evaluation

Alternative data present a unique opportunity for investment managers and analysts to gain an information advantage.

To assess the value of alternative datasets, investment managers and analysts often plug them into their modeling pipeline, making format-dependent assumptions such as linearity, and monitor expected incremental performance for a while. Considering the explosion of datasets, vendors, and data formats, this approach can be very slow, costly and wasteful.

We provide an information theoretical alternative dataset evaluation solution that decouples the how much insight can be extracted from an alternative dataset question from the how to extract insight from an alternative dataset question.
This speeds up alternative dataset evaluation, and empowers quants and analysts to focus on promising datasets.