Analysis of the most recent modelling techniques for big data

With particular attention to Bayesian ones

One of the key issues in large data models is that the number of available economic variables is of considerable size, resulting in poor inference and forecasting performance of standard econometric techniques. Most of the large data statistical and econometric literature attempts to reduce the data dimension by ‘penalising’ the model for complexity. Dimension reduction is accomplished either through various penalty functions shrinking the coefficients of the large set of explanatory variables towards zero or through compressing the dimension of the explanatory variables into a much smaller set. The common idea behind the different approaches is to avoid overfitting and, as a result, considerably improve forecasting.

In this paper, we provide a survey of different models for inference with big data, focusing on the most relevant methodological improvements in the field of Bayesian econometrics. For completeness, we also include a discussion of the methods previously surveyed in Kapetanios, Marcellino and Papailias (2016).

The rest of the document is organised as follows. Section 2 describes various methods suited for the analysis of linear models with a very large number of explanatory variables. We first review various penalised regression approaches, then show how they can be given a Bayesian interpretation when choosing particular prior distributions for the model parameters, next introduce further methods based on yet other choices of prior distributions (such as spike and slab regressions and compressed regressions), and finally we consider quantile and expectile regressions. We also discuss multivariate regression methods, mostly variants of Bayesian VARs, and a set of procedures for variables selection particularly suited for the big data context (the latter were already considered in Kapetanios, Marcellino and Papailias (2016)).

In Section 3, we review some non-parametric and/or non-linear approaches suited for applications with big data, such as random trees, random forests, cluster analysis, deep learning and neural networks.

In Sections 4, 5 and 6, which are based on Kapetanios et al. (2016), for completeness, we survey, respectively, methods for summarizing the information in large datasets, forecast combination approaches, and techniques for the analysis of mixed frequency datasets. The idea of summarizing the information in large datasets by means of a few constructed series, often called factors or indexes, has a long tradition in econometrics, and can be extended to the case of big data, where sparsity is often an additional problem. Another common approach that performs well in empirical applications based on economic data is pooling a large number of forecasts from very simple models rather than using a single forecast from a big model, so that rather than selecting or summarizing the

Analysis of the most recent modelling techniques for big data with particular attention to Bayesian ones many indicators we select or, more frequently, combine, directly the associated forecasts. Finally, approaches that can deal with mixed frequency data are relevant, as big data are typically available in higher frequency than the target indicator, as we have discussed in previous reports.

In Section 7, we compare the reviewed econometric methods for big data. Finally, in Section 8 we summarize the main results and conclude.

logo experttube

Video's op het gebied van Audit & Control, Actuariaat & Risk Management, Juridisch & Fiscale Zaken, Pensioenen, Schade & Hypotheken, Compliance en Investment Management.

Bekijk ons volledige overzicht op

logo CareerTube

Videoplatform met werkenbij video's van toonaangevende organisaties in de financiële wereld. Met een focus op de finance specialisatie zorgt de koppeling met de 17 (niche) vacaturesites van CareerGuide direct voor een relevant bereik.

Bekijk ons volledige overzicht op