Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
Abstract
Machine learning methods are divided into two main groups: supervised and unsupervised
methods. In the first part of this work, we develop a method for creating prediction bands
that can be applied to supervised problems. Our approach is based on conformal methods,
which are very appealing because they create prediction bands that control average coverage
assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that
is, coverage for every new testing point. However, without strong assumptions, conditional
coverage is unachievable. Given this limitation, the literature has focused on methods with
asymptotical conditional coverage. In order to obtain this property, these methods require
strong conditions on the dependence between the target variable and the features. We
introduce two conformal methods based on conditional density estimators that do not
depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split
and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to
interpret than general regions, CD-split obtains optimal size regions, which are smaller than
intervals. CD-split also obtains local coverage by creating prediction bands locally on a
partition of the features space. This partition is data-driven and scales to high-dimensional
settings. In a wide variety of simulated scenarios, our methods have a better control of
conditional coverage and have smaller length than previously proposed methods.
In the second part, in a context of unsupervised methods, we develop a new version of
the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for
creating mixed-membership clusters. Despite having been originally developed for text
analysis, LDA has been used for a wide range of other applications. We propose a new
formulation for the LDA model which incorporates covariates. In this model, a negative
binomial regression is embedded within LDA, enabling straight-forward interpretation of
the regression coefficients and the analysis of the quantity of cluster-specific elements in
each sampling units (instead of the analysis being focused on modeling the proportion of
each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling
algorithm to estimate model parameters. We rely on simulations to show how our algorithm
is able to successfully retrieve the true parameter values. The model is illustrated using
real data sets from three different areas: text-mining of Coronavirus articles, analysis of
grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama).
This model allows the identification of mixed-membership clusters in discrete data and
provides inference on the relationship between covariates and the abundance of these
clusters.
Collections
The following license files are associated with this item: