Recent advances in NVIDIA GPU computing technologies have allowed us to compute large feature sets on economic data, including macroeconomic, financial, and operating statistics. These features can be combined to produce reliable forecasting models with significant business applications. That said, identifying the most suitable forecasting model remains one of the most challenging tasks.
Model selection requires deep expertise in forecasting algorithms, algorithm engineering, and statistics. Even with this expertise, success is not guaranteed due to factors like selection bias stemming from personal preferences. Expert knowledge is costly, scarce, and not without errors, making meta-learning an increasingly promising alternative. Meta-learning automates the accumulation of experience based on the performance of multiple forecasting models, effectively “teaching machines to learn how to learn.” In forecasting, this means training a meta-learner to identify which models perform best for different data types, using the characteristics of each time series to guide model selection.
In this presentation, we introduce a novel framework for forecasting at scale based on the meta-learning approach in the time-series domain.
Data availability and quality are critical factors that directly impact the performance of forecasting models. To maximize predictive accuracy, we integrate internal data, such as sales of other products within the company, and relevant external data, including macroeconomic indicators and market conditions, into our prediction pipeline. From a bias-variance tradeoff perspective, including relevant external data introduces only a marginal increase in bias compared to the substantial decrease in variance achieved by expanding the dataset. This larger and more diverse dataset enhances model robustness during validation and testing stages, ultimately improving overall performance.
A crucial aspect of data engineering involves transforming raw data into the most informative form suitable for modeling. This process includes encoding variables, filtering outliers, adding time-series signatures, creating lag variables, and generating rolling statistics. These transformations enrich the feature set, ensuring that the models are equipped with sufficient and relevant information to capture the dynamics of the target variable. However, careful attention is required to avoid over-expanding the feature set, which could lead to an increased risk of overfitting. Domain knowledge plays an essential role here, guiding the creation of meaningful predictors that reflect the underlying patterns of the target variable without inflating the model complexity.
The rising availability of high-dimensional datasets poses the challenge of overfitting, where models capture noise rather than meaningful signals. This makes the careful selection of appropriate features critical to maintaining model generalizability. We employ advanced feature selection techniques like Boruta SHAP feature elimination to address this. Boruta SHAP ranks features based on their contribution to model performance, iteratively removing the less important features until optimal validation performance is achieved. This approach ensures that our final feature set includes only the most informative predictors, minimizing the inclusion of noise or redundant variables introduced during data expansion. Feature selection helps refine our set of predictors by eliminating non-informative variables introduced while incorporating external data or creating new features. This process results in a more streamlined and interpretable set of predictors, enhancing the model's predictive accuracy and ability to generalize effectively to unseen test data.
After data preparation and feature engineering, various forecasting models, including SARIMAX, XGBoost, DLinear, Prophet, TiDE, and others, are processed through a unified interface. Our selected base machine learning models are designed to handle exogenous data inputs. This capability allows these models to process the target time series and other relevant variables that can influence the forecast. The primary advantage of using models that handle exogenous variables is their ability to incorporate additional contextual information that affects the target series, making predictions more accurate and actionable. Our unified interface is built on top of the Darts package. The Darts package supports wrapper functions, enabling seamless integration with any third-party model with minimal modifications. This unified modeling interface standardizes the forecasting process, facilitates parallelized operations, and includes integrations with Optuna, Dask, and AWS for distributed parallelization, significantly boosting overall computational efficiency.
We implemented a stacking algorithm that trains several meta-learners on top of base learners, performs hyperparameter optimization, and automatically selects the best meta-learner. This can be thought of as having an automated expert seeing the performance of the various base models and making adjustments to models before giving the final predictions. Our custom functions and classes are easily reused from the training/validation process to the stacking process, making it more understandable and maintainable. Additionally, we used a package called MAPIE to perform Conformal Quantile Regression (CQR), providing statistically guaranteed coverage of our forecasts. Alongside various interpretable ML tools like SHAP and DICE, we can enhance prediction certainty, bolster confidence in business decision-making, and provide actionable insights.
In a forecasting pipeline, large datasets, a wide range of feasible models, and an extensive hyperparameter space can make the process memory-intensive and computationally expensive, slowing down automation. We address the sizeable computational time needed to tune hyperparameters using Optuna. This framework-agnostic, state-of-the-art hyperparameter optimization package supports parallelization and custom trial pruning capabilities for faster results.
We leverage parallelized frameworks through the unified modeling interface to further reduce computational time.
First Level: Parallel Computing
A rolling-window validation process is commonly used in scenarios involving time series forecasting, where each validation window requires multiple model training runs. For a validation set of size n and m different models, this approach results in n x m model training processes. To handle these extensive computations efficiently, we leverage the Dask package, which enables large-scale parallel computing through our unified modeling interface. Dask allows these n x m computational tasks to run simultaneously across available resources, significantly enhancing processing speed and scalability, as commonly applied in scientific computing and high-performance computing environments.
Second Level: Distributed Parallel Computing
A next-generation forecasting pipeline must utilize cloud resources to support large- scale task processing. Our custom libraries and functions integrate AWS SageMaker for GPU tasks and AWS Fargate for CPU-bound tasks. Both processes are executed in parallel through Metaflow’s architecture. This allows computationally demanding models to access GPUs in the cloud, while lower-level models utilize cheaper computational resources, ensuring cost efficiency in a production setting. Using cloud computing resources accelerates the pipeline, significantly reducing computational time.
We use the RAPIDS framework to accelerate the forecasting pipeline, including training the meta-learner and predicting the most efficient algorithm for a time series. The meta-learning framework provides valuable insights into how different features and interactions affect each forecasting value.
Offline Phase: The application of this novel forecasting framework begins with training a meta-learner using historical data. In this phase, the meta-learner is fed various time series along with their extracted features, such as seasonality, trend, and volatility. The goal is for the meta-learner to understand how these features influence the performance of different forecasting models. The meta-learner builds a knowledge base of model performance across different scenarios through this training process.
Online Phase: When new data becomes available, the trained meta-learner is deployed to quickly assess the features and identify the models most likely to perform well. The meta- learner assigns probabilities to each model, indicating how well each model fits the given time series. These probabilities are then used to construct an ensemble forecast—a weighted average where each model’s contribution is determined by its predicted performance. This dynamic selection and weighting of models allow the framework to adapt to new data, leveraging the strengths of multiple forecasting approaches to improve overall accuracy and robustness.
This two-phase approach provides a scalable and automated solution for large-scale forecasting, combining models based on the data's characteristics to optimize prediction outcomes.