Very often time series are subject to abrupt changes in the level, which are generally represented by Markov Switching (MS) models, assuming that the level is constant within a certain state (regime). This is not a realistic framework because in the same regime the level could change with minor jumps with respect to a change of state; this is a typical situation in many economic time series such as the Gross Domestic Product (GDP) or the volatility of financial markets. We propose to make the state flexible, introducing a very general model which provides oscillations of the level of the time series within each state of the MS model; these movements are driven by a forcing variable. The new model allows for consideration of extreme jumps in a parsimonious way, without the adoption of a large number of regimes (in our examples the two-state MS models are used). Moreover, this model increases the interpretability and in particular the out-of-sample performance with respect to the most used alternative models. This approach can be applied in several fields, also using unobservable data. We show its advantages in three distinct applications, extending particular MS models, which involve macroeconomic variables, volatilities of financial markets and conditional correlations.
Sequential regression approaches can be used to analyze processes in which covariates are revealed in stages. Such processes occur widely, with examples including medical intervention, sports contests and political campaigns. The naïve sequential approach involves fitting regression models using the covariates revealed by the end of the current stage, but this is only practical if the number of covariates is not too large. An alternative approach is to incorporate the score (linear predictor) from the model developed at the previous stage as a covariate at the current stage. This score takes into account the history of the process prior to the stage under consideration. However, the score is a function of fitted parameter estimates and, therefore, contains measurement error. In this article, we propose a novel technique to account for error in the score. The approach is demonstrated with application to the sprint event in track cycling and is shown to reduce bias in the estimated effect of the score and avoid unrealistically extreme predictions.
Bayesian penalized splines (P-splines) assume an intrinsic Gaussian Markov random field prior on the spline coefficients, conditional on a precision hyper-parameter
The evaluation of peritoneal dialysis (PD) programmes requires the use of statistical methods that suit the complexity of such programmes. Multi-state regression models taking competing risks into account are a good example of suitable approaches. In this work, multi-state structured additive regression (STAR) models combined with penalized splines (P-splines) are proposed to evaluate peritoneal dialysis programmes. These models are very flexible since they may consider smooth estimates of baseline transition intensities and the inclusion of time-varying and smooth covariate effects at each transition. A key issue in survival analysis is the quantification of the time-dependent predictive accuracy of a given regression model, which is typically assessed using receiver operating characteristic (ROC)’based methodologies. The main objective of the present study is to adapt the concept of time-dependent ROC curve, and their corresponding area under the curve (AUC), to a multi-state competing risks framework. All statistical methodologies discussed in this work were applied to PD survival data. Using a multi-state competing risks framework, this study explored the effects of major clinical covariates on survival such as age, sex, diabetes and previous renal replacement therapy. Such multi-state model was composed of one transient state (peritonitis) and several absorbing states (death, transfer to haemodialysis and renal transplantation). The application of STAR models combined with time-dependent ROC curves revealed important conclusions not previously reported in the nephrology literature when using standard statistical methodologies. For practical application, all the statistical methods proposed in this article were implemented in
The shared frailty model is a popular tool to analyze correlated right-censored time-to-event data. In the shared frailty model, the latent frailty is assumed to be shared by the members of a cluster and is assigned a parametric distribution, typically a gamma distribution due to its conjugacy. In the case of interval-censored time-to-event data, the inclusion of frailties results in complicated intractable likelihoods. Here, we propose a flexible frailty model for analyzing such data by assuming a smooth semi-parametric form for the conditional time-to-event distribution and a parametric or a flexible form for the frailty distribution. The results of a simulation study suggest that the estimation of regression parameters is robust to misspecification of the frailty distribution (even when the frailty distribution is multimodal or skewed). Given sufficiently large sample sizes and number of clusters, the flexible approach produces smooth and accurate posterior estimates for the baseline survival function and for the frailty density, and it can correctly detect and identify unusual frailty density forms. The methodology is illustrated using dental data from the Signal Tandmobiel
Index measures are commonly used in medical research and clinical practice, primarily for quantification of health risks in individual subjects or patients. The utility of an index measure is ultimately contingent on its ability to predict health outcomes. Construction of medical indices has largely been based on heuristic arguments, although the acceptance of a new index typically requires objective validation, preferably with multiple outcomes. In this article, we propose an analytical tool for index development and validation. We use a multivariate single-index model to ascertain the best functional form for risk index construction. Methodologically, the proposed model represents a multivariate extension of the traditional single-index models. Such an extension is important because it assures that the resultant index simultaneously works for multiple outcomes. The model is developed in the general framework of longitudinal data analysis. We use penalized cubic splines to characterize the index components while leaving the other subject characteristics as additive components. The splines are estimated directly by penalizing nonlinear least squares, and we show that the model can be implemented using existing software. To illustrate, we examine the formation of an adiposity index for prediction of systolic and diastolic blood pressure in children. We assess the performance of the method through a simulation study.
In categorical data analysis, several regression models have been proposed for hierarchically structured responses, such as the nested logit model, the two-step model or the partitioned conditional model for partially ordered set. The specifications of these models are heterogeneous and they have been formally defined for only two or three levels in the hierarchy. Here, we introduce the class of partitioned conditional generalized linear models (PCGLMs) that encompasses all these models and is defined for any number of levels in the hierarchy. The hierarchical structure of these models is fully specified by a partition tree of categories. Using the genericity of the recently introduced
To represent the complex structure of intensive longitudinal data of multiple individuals, we propose a hierarchical Bayesian Dynamic Model (BDM). This BDM is a generalized linear hierarchical model where the individual parameters do not necessarily follow a normal distribution. The model parameters can be estimated on the basis of relatively small sample sizes and in the presence of missing time points. We present the BDM and discuss the model identification, convergence and selection. The use of the BDM is illustrated using data from a randomized clinical trial to study the differential effects of three treatments for panic disorder. The data involves the number of panic attacks experienced weekly (73 individuals, 10–52 time points) during treatment. Presuming that the counts are Poisson distributed, the BDM considered involves a linear trend model with an exponential link function. The final model included a moving average parameter and an external variable (duration of symptoms pre-treatment). Our results show that cognitive behavioural therapy is less effective in reducing panic attacks than serotonin selective re-uptake inhibitors or a combination of both. Post hoc analyses revealed that males show a slightly higher number of panic attacks at the onset of treatment than females.
In rating surveys, people are requested to express preferences on several aspects related to a topic by selecting a category in an ordered scale. For such data, we propose a model defined by a mixture of a uniform distribution and a Sarmanov distribution with CUB (combination of uniform and shifted binomial) marginal distributions (
This study examines the efficacy of tort reforms instituted throughout the country during the last decade, improving upon existing semiparametric density ratio estimation (DRE) methodologies in the process. DRE is a well-known semiparametric modelling technique that has been used for well over two decades. Although the approach has been demonstrated to be extremely useful in statistical modelling, it has suffered from one main limitation—the methodology has thus far not been capable of modelling individual-level heterogeneity. We address this issue by presenting a novel adaptation of DRE to model individual level heterogeneity. We do so by marginalizing the associated empirical likelihood function involving density ratios to provide an overall distribution of the entire population despite having extremely limited initial information about each individual in the dataset. We apply this approach to medical malpractice loss data from the previous decade to quantify the probability of changes in tort losses. Our results demonstrate the success of a number of recently implemented malpractice reforms. Comparisons to existing DRE methods, as well as standard regression methods, illustrate the efficacy of our approach.
Representing the conditional mean in Poisson regression directly as a sum of smooth components can provide a realistic model of the data generating process. Here, we present an approach that allows such an additive decomposition of the expected values of counts. The model can be formulated as a penalized composite link model and can, therefore, be estimated by a modified iteratively weighted least-squares algorithm. Further shape constraints on the smooth additive components can be enforced by additional penalties, and the model is extended to two dimensions. We present two applications that motivate the model and demonstrate the versatility of the approach.
In the last two decades, regularization techniques, in particular penalty-based methods, have become very popular in statistical modelling. Driven by technological developments, most approaches have been designed for high-dimensional problems with metric variables, whereas categorical data has largely been neglected. In recent years, however, it has become clear that regularization is also very promising when modelling categorical data. A specific trait of categorical data is that many parameters are typically needed to model the underlying structure. This results in complex estimation problems that call for structured penalties which are tailored to the categorical nature of the data. This article gives a systematic overview of penalty-based methods for categorical data developed so far and highlights some issues where further research is needed. We deal with categorical predictors as well as models for categorical response variables. The primary interest of this article is to give insight into basic properties of and differences between methods that are important with respect to statistical modelling in practice, without going into technical details or extensive discussion of asymptotic properties.
This is a discussion on the article ‘Regularized Regression for Categorical Data’ by Tutz and Gertheiss.
Oracle inequalities provide probability loss bounds for the lasso estimator at a deterministic choice of the regularization parameter and are commonly cited as theoretical justification for the lasso and its ability to handle high-dimensional settings. Unfortunately, in practice, the regularization parameter is not selected to be a deterministic quantity, but is instead chosen using a random, data-dependent procedure, often making these inequalities misleading in their implications. We discuss general results and demonstrate empirically for data using categorical predictors that the amount of deterioration in performance of the lasso as the number of unnecessary predictors increases can be far worse than the oracle inequalities suggest, but imposing structure on the form of the estimates can reduce this deterioration substantially.
This is a discussion of the article ’Regularized Regression for Categorical Data’ by Tutz and Gertheiss. As part of the discussion, I raise some questions that may suggest future research work.
<abstract>
An index for characterizing the separation of two distributions is introduced. It is applied to assessing whether mixture components are clusters. A related property of being a satellite and a partial ordering of the components are defined. A sequence of clustering structures is defined for a finite mixture with a continuum of thresholds that qualify a cluster. The approach is suitable for outcomes with arbitrary univariate or multivariate distributions and their mixtures. The properties of the index are explored by simulations and on examples.
</abstract>
<abstract>
In the context of mixture models with random covariates, this article presents the polynomial Gaussian cluster-weighted model (CWM). It extends the linear Gaussian CWM, for bivariate data, in a twofold way. First, it allows for possible nonlinear dependencies in the mixture components by considering a polynomial regression. Second, it is not restricted to be used for model-based clustering only being contextualized in the most general model-based classification framework. Maximum likelihood parameter estimates are derived using the EM algorithm and model selection is carried out using the Bayesian information criterion (BIC) and the integrated completed likelihood (ICL). The article also investigates the conditions under which the posterior probabilities of component-membership from a polynomial Gaussian CWM coincide with those of other well-established mixture-models which are related to it. When applied to artificial and real data, the polynomial Gaussian CWM has shown to outperform the mixture of polynomial Gaussian regressions, which is its natural competitor in the class of mixture models with fixed covariates.
</abstract>
<abstract>
We present a simple and effective iterative procedure to estimate segmented mixed models in a likelihood based framework. Random effects and covariates are allowed for each model parameter, including the changepoint. The method is practical and avoids the computational burdens related to estimation of nonlinear mixed effects models. A conventional linear mixed model with proper covariates that account for the changepoints is the key to our estimating algorithm. We illustrate the method via simulations and using data from a randomized clinical trial focused on change in depressive symptoms over time which characteristically show two separate phases of change.
</abstract>
This article deals with the analysis of sensitivity to the non-ignorability of the dropout process in joint models (JMs). We investigate the behaviour of the maximum likelihood estimates for the longitudinal process in a neighbourhood of ignorability through the Index of Local Sensitivity to Non Ignorability (ISNI). Some concerns may arise since the ISNI is an absolute measure of change in parameter estimates induced by departures from the MAR assumption; for this reason, we introduce a relative index based on the ratio between the ISNI and a measure of its variability under the MAR assumption, highlighting potential interpretation and drawbacks of this approach. The local sensitivity of the JM and the performance of the relative index are discussed in a simulation study, by varying the number of repeated measurements per individual and the random effect covariance structure. The approach is also applied to a benchmark dataset on Primary Biliary Cirrhosis (PBC).