London  UK  Sep 2013  Stata 
19th London Stata Users Group Meeting Proceedings 

Date: Location: 
1213 September 2013 Cass Business School, City University London, UK (View map) 
Overview  Proceedings  Previous Meetings 
Overview
The 19th London Stata Users Group Meeting took place on 1213 September 2013 at Cass Business School, London, UK.
The Stata Users Group Meeting is a twoday international conference where the use of Stata is discussed across a wideranging breadth of fields and environments. Established in 1995, the UK meeting is the longestrunning series of Stata Users Group Meetings. The meeting is open to everyone. In past years, participants have travelled from around the world to attend the event. Representatives from StataCorp are also in attendance.
Back to top 
Proceedings
View the abstracts and download the presentations for the 19th London Stata Users Group Meeting below (you can also click here to download a PDF version of the abstracts).
Creating factor variables in resultssets and other datasets
Factor variables are defined as categorical variables with integer values, which may represent values of some other kind, specified by a value label. We frequently want to generate such variables in Stata datasets, especially resultssets, which are output Stata datasets produced by Stata programs such as the official Stata statsby command and the SSC packages parmest and xcontract. This is because categorical string variables can only be plotted after conversion to numeric variables and because these numeric variables are also frequently used in defining a key of variables, which identify observations in the resultsset uniquely in a sensible sort order. The sencode package is downloadable, and frequently downloaded, from SSC and is a “super” version of encode, which inputs a string variable and outputs a numeric factor variable. Its added features include a replace option allowing the output numeric variable to replace the input string variable, a gsort() option allowing the numeric values to be ordered in ways other than the alphabetical order of the input string values, and a manyto1 option allowing multiple output numeric values to map to the same input string value. The sencode package is well established and has existed since 2001. However, some tips will be given on ways of using it that are not immediately obvious but which the author has found very useful over the years when massproducing resultssets. These applications use sencode with other commands, such as the official Stata command split and the SSC packages factmerg, factext, and fvregen.
Additional materials:
uk13_newson.pdf
uk13_newson_examples1.do
treatrew: A userwritten Stata routine for estimating average treatment effects by reweighting on propensity score
Reweighting is a popular statistical technique to deal with inference in presence of a nonrandom sample. In the literature, various reweighting estimators have been proposed. This paper presents the authorwritten Stata routine treatrew, which implements the reweighting on the propensityscore estimator as proposed by Rosenbaum and Rubin (1983) in their seminal article, where they show that parameters’ standard errors can be obtained analytically (Wooldridge 2010, 920–930) or via bootstrapping. Because an implementation in Stata of this estimator with analytic standard errors was still missing, this paper, and the adofile and helpfile accompanying it, aims at filling this gap by providing an easytouse implementation of the reweighting on the propensityscore method, as a valuable tool for estimating treatmenteffects under “selectiononobservables” (or “overt bias”). Finally, a Monte Carlo experiment to check the reliability of treatrew and to compare its results with other treatment effect estimators will also be provided.
Additional materials:
uk13_cerulli.pdf
Multiple imputation of covariates in the presence of interactions and nonlinearities
Multiple imputation (MI) is a popular approach to handling missing data, and an extensive range of MI commands is now available in official Stata. A common problem is that of missing values in covariates of regression models. When the substantive model for the outcome contains nonlinear covariate effects or interactions, correctly specifying an imputation model for covariates becomes problematic. We present simulation results illustrating the biases that can occur when standard imputation models are used to impute covariates in linear regression models with a quadratic effect or interaction effect. We then describe a modification of the full conditional specification (FCS) or chained equations approach to MI, which ensures that covariates are imputed from a model which is compatible with a userspecified substantive model. We present the smcfcs Stata command, which implements substantive model compatible FCS and illustrate its application to a dataset.
Additional materials:
uk13_bartlett.pdf
A Monte Carlo analysis of multilevel binary logit model estimator performance
Social scientists are increasingly fitting multilevel models to datasets in which a large number of individuals (N ~ several thousands) are nested within each of a small number of countries (C ~ 25). The researchers are particularly interested in “country effects”, as summarized by either the coefficients on countrylevel predictors (or crosslevel interactions) or the variance of the countrylevel random effects. Although questions have been raised about the potentially poor performance of estimators of these “country effects” when C is “small”, this issue appears not to be widely appreciated by many social scientist researchers. Using Monte Carlo analysis, I examine the performance of two estimators of a binarydependent twolevel model using a design in which C = 5(5)50 100 and N = 1000 for each country. The results point to i) the superior performance of adaptive quadrature estimators compared with PQL2 estimators, and ii) poor coverage of estimates of “country effects” in models in which C ~ 25, regardless of estimator. The analysis makes extensive use of xtmelogit and simulate and userwritten commands such as runmlwin, parmby, and eclplot. Issues associated with having extremely long runtimes are also discussed.
Additional materials:
uk13_jenkins.pdf
Multilevel mixedeffects parametric survival analysis
Multilevel mixedeffects survival models are used in the analysis of clustered survival data, such as repeated events, multicenter clinical trials, or individual patient data metaanalyses, to investigate heterogeneity in baseline risk and treatment effects. I present the stmixed command for the parametric analysis of clustered survival data with two levels. Mixedeffects parametric survival models available include the exponential, Weibull and Gompertz proportionalhazards models, the Royston–Parmar flexibleparametric model, and the log–logistic, log–normal, and generalized gammaaccelerated failuretime models. Estimation is conducted using maximum likelihood, with both adaptive and nonadaptive Gauss–Hermite quadrature available. I will illustrate the command through simulation and application to clinical datasets.
Additional materials:
uk13_crowther.pdf
Multiple imputation of missing data in longitudinal health records
Electronic health records are increasingly used for epidemiological and health service research. However, missing data are often an issue when dealing with electronic records. Up to now, various approaches have been used to overcome these issues, including complete case analysis, last observation carried forward, and multiple imputation. In this presentation, we will first highlight the issues of missing data in longitudinal records and provide examples of the limitations of standard methods of multiple imputation. We will then demonstrate the new twofold userwritten Stata command that implements the twofold fully conditional specification (FCS) multipleimputation algorithm in Stata (Nevalainen, Kenward, and Virtanen, 2009. Stat Med. 28: 3657–3669.)
In the application of the twofold FCS algorithm, we divide time into equal size time blocks. The algorithm then imputes missing values in the longitudinal data, imputing one time block, and then the next. The defining characteristic is that when one imputes missing values at a particular time block, only measurements at that time block and adjacent time blocks are used. This obviates some of the principal difficulties that are typically encountered when attempting to apply a standard MI approach to imputing such longitudinal data.
We illustrate how the twofold FCS MI algorithm works in practice and maximizes the use of data available, even in situations where measurements are only made on a relatively small proportion of individuals in each time block. We discuss some of the strengths and limitations of the twofold FCS MI algorithm and contrast it with existing approaches to imputing longitudinal data. Lastly, we present results demonstrating the potential for gains in efficiency through use of the twofold approach compared with a more conventional “baseline MI” approach.
Additional materials:
uk13_welch.pptx
A multiple imputation and coarse data approach to residually confounded regression models
Residual confounding is a major problem in analysis of observational data, occurring when a confounding variable is measured coarsely (censored, heaped, missing, etc.) and hence cannot be fully adjusted to obtain a causal estimate by usual means such as multiple regression. The analysis of coarse data has been investigated by Heitjan and Rubin, but methods for coarse covariates are lacking.
A fully conditionalspecification multipleimputation approach is possible if we are able to model i) the confounding variable conditional on other information in the dataset and ii) the coarsening mechanism. This provides a very flexible framework for removing residual confounding under our assumptions, including sensitivity analysis. An added complexity over missing data is that it may not be known which observations are coarsened.
Programming this method is presented in Stata for various combinations of i) and ii) above, using the ml and mi functions. In the simplest case of a normally distributed confounder subject to known intervalcensoring, intreg and mi can be applied. The method is illustrated with simulated data and the true causal effect is recovered in each instance.
Additional materials:
uk13_grant.pdf
The stiqsp command: A nonparametric approach for the simultaneous analysis of qualityoflife and survival data using the integrated quality survival product
In medical research, particularly in the field of cancer, it is often important to evaluate the impact of treatments and other factors on a composite outcome based on survival and qualityoflife data, such as a Quality Adjusted Life Year (QALY). We present a Stata program, stiqsp, which determines the mean QALY using the Integrated Quality Survival Product. In this nonparametric approach, the survival function is estimated using the Kaplan–Meier method and the qualityoflife function is derived from the mean qualityoflife score at the unique death times. Confidence intervals for the QALY score are determined using the bootstrap method. We illustrate the features of the command with a large dataset of patients with lung cancer.
Additional materials:
uk13_gaunt.pdf
Estimating spatial panel models using unbalanced panels
Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behavior, but it may also reflect concern about the role of omitted variables that are or may be spatially correlated.
The Stata userwritten procedure xsmle has been designed to estimate a wide range of spatial panel models, including spatial autocorrelation, spatial Durbin, and spatial error models using maximum likelihood methods. It relies upon the availability of balanced panel data with no missing observations. This requirement is stringent, but it arises from the fact that in principle, the values of the dependent variable for any panel unit may depend upon the values of the dependent and independent variables for all the other panel units. Thus even a single missing data point may require that all data for a time period, panel unit, or variable be discarded.
The presence of missing data is an endemic problem for many types of applied work, often because of the creation or disappearance of panel units. At the macro level, the number and composition of countries in Europe or local government units in the United Kingdom has changed substantially over the last three decades. In longitudinal household surveys, new households are created and old ones disappear all the time. Restricting the analysis to a subset of panel units that have remained stable over time is a form of sample selection whose consequences are uncertain and that may have statistical implications that merit additional investigation.
The simplest mechanisms by which missing data may arise underpin the missingatrandom (MAR) assumption. When this is appropriate, it is possible to use two approaches to estimation with missing data. The first is either simple or, preferably, multiple imputation, which involves the replacement of missing data by stochastic imputed values. The Stata procedure mi can be combined with xsmle to implement a variety of estimates that rely upon multiple imputation. While the combination of procedures is relatively simple to estimate, practical experience suggests that the results can be quite sensitive to the specification that is adopted for the imputation phase of the analysis. Hence, this is not a onesizefitsall method of dealing with unbalanced panels, because the analyst must give serious consideration to the way in which imputed values are generated.
The second approach has been developed by Pfaffermayr. It relies upon the spatial interactions in the model, which means that the influence of the missing observations can be inferred from the values taken by nonmissing observations. In effect, the missing observations are treated as latent variables whose distribution can be derived from the values of the nonmissing data. This leads to a likelihood function that can be partitioned between missing and nonmissing data and thus used to estimate the coefficients of the full model. The merit of the approach is that it takes explicit account of the spatial structure of the model. However, the procedure becomes computationally demanding if the proportion of missing observations is too large and, as one would expect, the information provided by the spatial interactions is not sufficient to generate welldefined estimates of the structural coefficients. The missingatrandom assumption is crucial for both of these approaches, but it is not reasonable to rely upon it when dealing with the birth or death of distinct panel units.
A third approach, which is based on methods used in the literature on statistical signal processing, relies upon reducing the spatial interactions to immediate neighbors. Intuitively, the basic unit for the analysis becomes a block consisting of a central unit (the dependent variable) and its neighbors (the spatial interactions). Because spatial interactions are restricted to withinblock effects, the population of blocks can vary over time and standard nonspatial panel methods can be applied.
The presentation will describe and compare the three approaches to estimating spatial panel models as implemented in Stata as extensions to xsmle. It will be illustrated by analyses of i) state data on electricity consumption in the U.S. and ii) gridded historical data on temperature and precipitation to identify the effects of El Niño (ENSO) and other major weather oscillations.
Additional materials:
uk13_hughes.pdf
Repeated halfsample bootstrap resampling
This presentation illustrates Stata’s implementation of the repeated halfsample bootstrap proposed by Saigo et al. (2001, Survey Methodology). This resampling scheme is easy to implement and is appropriate for complex survey designs, even with small stratum sizes. The userwritten command rhsbsample mimicks the official bsample command and can be used for bootstrap inference in a wide range of settings.
Additional materials:
uk13_vankerm.pdf
Sample size by simulation for clinical trials with survival outcomes: The simsam package in action
The simsam package, released earlier this year, allows versatile samplesize calculation (calculating sample size required to achieve given statistical power) using simulation for any method of analysis under any statistical model that can be programmed in Stata. Simulation is particularly helpful in situations where formulae for sample size are approximate or unavailable. The usefulness of the simsam package will depend in part, however, on the quality and variety of simsam applications developed by Stata users.
I will discuss simsam applications for clinical trials with timetoevent or survival outcomes. Here sample size formulas are the subject of ongoing research (available methods for clusterrandomized trials with variable cluster size, for example, are only approximate). Using such examples, I will illustrate general simsam programming considerations such as dealing with analyses that fail, and the advantages of modular programming. Simulation forces us to think carefully about trial definitions—for example, whether recruitment ends after a fixed time or a fixed number of recruits—and to relate the samplesize calculation to a detailed or contingent analysis plan. Such careful attention to detail may be lost in a formulaic approach to samplesize calculation. Simulation also allows us to check for bias in a test as well as power. But simulation is not without problems: while a rare (say, one in a million) failure of an analysis in Stata may not worry the pragmatic statistician, a simsam application must anticipate this failure or risk stumbling on it in the course of many simulations.
Additional materials:
uk13_hooper.pdf
Strategy and tactics for graphic multiples in Stata
Many, perhaps most, useful graphs compare two or more sets of values. Examples are two or more groups or variables (as distributions, time series, etc.) or observed and fitted values for one or more model fits. Often there can be a fine line in such comparisons between richly detailed graphics and busy, unintelligible graphics that lead nowhere. In this presentation, I survey strategy and tactics for developing good graphic multiples in Stata.
Details include the use of over() and by() options and graph combine; the relative merits of super(im)posing and juxtaposing; backdrops of context; killing the key or losing the legend if you can; transforming scales for easier comparison; annotations and selfexplanatory markers; linear reference patterns; plotting both data and summaries; plotting different versions or reductions of the data.
Datasets visited or revisited include James Short’s collation of observations from the transit of Venus; John Snow’s data on mortality in relation to water supply in London; Florence Nightingale’s data on deaths in the Crimea; deaths from the Titanic sinking; admissions to Berkeley; hostility in response to insult; and advances and retreats of East Antarctic ice sheet glaciers.
Specific programs discussed include graph dot; graph bar; sparkline (SSC); qplot (SJ) and its relatives; devnplot (SSC); stripplot (SSC); and tabplot (SJ/SSC).
Additional materials:
uk13_cox.ppt
uk13_cox_materials.zip
A general approach to testing for autocorrelation
Testing for the presence of autocorrelation in a time series, either in the univariate setting or with the residuals from the estimation of some model, is one of the most common tasks researchers face in the timeseries setting. The standard Q test statistic is that introduced by Box and Pierce (1970), subsequently refined by Ljung and Box (1978). The original LBP test is applicable to univariate time series and to testing for residual autocorrelation under the assumption of strict exogeneity. Breusch (1978) and Godfrey (1978) in effect extended the LBP approach to testing for autocorrelations in residuals in models with weakly exogenous regressors. Both the LBP test and the Breusch–Godfrey test are available in Stata, the former for univariate time series via the wntestq command and the latter for postestimation testing following OLS via the estat bgodfrey command.
All the above tests have important limitations: i) the tests are for autocorrelation up to order p, where under the null hypothesis the series or residuals are i.i.d.; ii) when applied to residuals from singleequation estimation, the regressors must all be at least weakly exogenous; iii) the tests are for singleequation models and do not cover panel data.
We use the results of Cumby and Huizinga (1992) to extend the implementation of the Q test statistic of LBPBG to cover a much wider range of hypotheses and settings: i) tests for the presence of autocorrelation of order p through q, where under the null hypothesis, there may be autocorrelation of order p1 or less; ii) tests following estimation in which regressors are endogenous and estimation is by IV or GMM methods; iii) tests following estimation using panel data. For iii), we show that the Cumby–Huizinga test, developed for the largeT setting, is, when applied to the largeN paneldata setting and limited to testing for secondorder serial correlation, formally identical to the test presented by Arellano and Bond (1991) and available in Stata via Roodman’s abar command.
Additional materials:
uk13_baumschaffer.pdf
Semiparametric regression in Stata
Semiparametric regression deals with the introduction of some very general nonlinear functional forms in regression analyses. This class of regression models is generally used to fit a parametric model in which the functional form of a subset of the explanatory variables is not known and/or in which the distribution of the error term cannot be assumed of being of a specific type beforehand. To fix ideas, consider the partial linear model y = zb + f(x) + e, in which the shape of the potentially nonlinear function of predictor x is of particular interest. Two approaches to modeling f(x) are to use splines or fractional polynomials. This talk reviews other more general approaches, and the commands available in Stata to fit such models.
The main topic of the talk will be partial linear regression models, with some brief discussion also of socalled single index and generalized additive models. Though several semiparametric regression methods have been proposed and developed in the literature, these are probably the most popular ones.
The general idea of partial linear regression models is that a dependent variable is regressed on i) a set of explanatory variables entering the model linearly and ii) a set of variables entering the model nonlinearly but without assuming any specific functional form. Several estimators have been proposed in the literature and are available in Stata. For example, the semipar command makes available what is called the double residuals estimator introduced by Robinson (1988), which is consistent and efficient. Similarly, the plreg command fits an alternative differencebased estimator proposed by Yatchew (1998) that has similar statistical properties to Robinson’s estimator. These estimators will be briefly compared to identify some drawbacks and pitfalls of both methods.
A natural concern of researchers is how these estimators could be modified to deal with heteroskedasticity, serial correlation, and endogeneity in crosssectional data or how they could be adapted in the context of panel data to control for unobserved heterogeneity. As a consequence, a substantial part of the talk will be devoted to explaining i) how the plreg and semipar commands can be used to tackle these very common violations of the Gauss–Markov assumptions in crosssectional data and ii) how the userwritten xtsemipar command makes a semiparametric regression easy to fit in the context of panel data.
Because it is sometimes possible to move toward pure parametric models, a test proposed by Hardle and Mammen (1993) and built to check whether the nonparametric fit can be satisfactorily approximated by a parametric polynomial adjustment of order p will be described.
Additional materials:
uk13_verardi.pdf
Power of the power command
Stata 13’s new power command performs power and samplesize analysis. The power command expands the statistical methods that were previously available in Stata’s sampsi command. I will demonstrate the power command and its additional features, including the support of multiple study scenarios and automatic and customizable tables and graphs. I will also present new functionality allowing users to add their own methods to the power command.
Additional materials:
uk13_marchenko.pdf
Adaptive dosefinding designs to identify multiple doses that achieve multiple response targets
Within drug development, it is crucial to find the right dose that is going to be safe and efficacious; this is often done within early phase II clinical trials. The aim of the dosefinding trial is to understand the relationship between the dose of drug and the potential effect of the drug. Increasingly, adaptive designs are being used in this area because they allow greater flexibility for dose exploration in comparison with traditional fixeddose designs.
An adaptive dosefinding design usually assumes a true nonlinear dose–response model and select doses that either maximize the determinant of information matrix of the design (Doptimality) or minimize the variance of the predicted dose that gives a targeted response. Our design extends the predicted dose methodology, in a limited number of patients (40), to finding two targeted doses: a minimally effective dose and a therapeutic dose. In our trial, doses are given intravenously, so theoretically, doses are continuous and the response is assumed to be a normally distributed continuous outcome.
Our design has an initial learning phase where pairs of patients are assigned to five preassigned doses. The next phase is fully sequential with an interim analysis after each patient to determine the choice of dose based on the optimality criterion to minimize the determinant of the covariance of the estimated target doses. The dose–choice algorithm assumes that a specific parametric dose–response model is the true relationship, and so a variety of models are considered at the interim, and human judgment involved in the overall decision.
I will introduce a Mata command that uses the optimize function to find the estimated parameters of the model and to subsequently find the optimal design. Simulated results show that assuming a model with a small number of parameters (=3) leads to a choice of doses that are not near to the target doses and overrely on interpolation under the modeling assumptions. Fitting models with greater flexibility with more parameters (=4) results in a choice of doses near to the two target doses. Overall, the design is efficient and seamlessly combines the initial learning and subsequent confirmatory stages.
Additional materials:
uk13_mander.pdf
usecspro: Importing CSPro hierarchical datasets to Stata
Hierarchical datasets are commonly a product of the popular CSPro system developed by the U.S. Census Bureau. CSPro became widely popular and a de facto standard for data collection in many countries; some agencies supply data exclusively in CSPro format. While CSPro can export the data into Stata format on its own, the procedure compromises on some features and requires the user to run CSPro, which operates in MS Windows only. The new Stata module usecspro allows easy import of the hierarchical datasets into Stata by automatically parsing the CSPro dictionary files. The conversion is implemented in Mata and allows importing data from any level and any record of the hierarchical CSPro dataset. It also preserves the variable and value labels and takes care of the missing values and other common concerns during data conversion.
Additional materials:
uk13_radyakin.pdf
From Stata to aML
This presentation explains how to exploit Stata to run multilevel multiprocess regressions with aML (software downloadable for free from appliedml.com). I show how a single dofile can prepare the dataset, write the control files, input the starting values, and run the regressions without the need to manually open the aML’s Command Prompt window. In this sense, Stata helps to avoid the difficulties of running complicated regressions with aML by automatically generating the necessary files, which avoids typos and easily allows changes in model specification. The paper contains an example of how well Stata and aML work together.
Additional materials:
uk13_ayllon.pdf
Twostage individual participant data metaanalysis and forest plots
Stata has a wide range of tools for performing metaanalysis, but presently not individual participant data (IPD) metaanalysis, in which the analysis units are withinstudy observations (for example, patients) rather than aggregate study results.
I present ipdmetan, a command that facilitates twostage IPD metaanalysis by fitting a specified model to the data of each study in turn and storing the results in a matrix. Features include subgroups, inclusion of aggregate (for example, published) data, iterative estimates and confidence limits for the tausquared measure of heterogeneity, and the analysis of treatmentcovariate interactions. This last is a great benefit of IPD collection and is a subject on which my colleagues and I have published previously (Fisher et al., 2011, Journal of Clinical Epidemiology 64: 949–967). I shall discuss how ipdmetan facilitates our recommended approach and its strengths and weaknesses, in particular onestage versus twostage modeling and within and betweentrial information.
In addition, the graphics subroutine written for the metan package (Harris et al., 2008, Stata Journal 8: 3–28) has been greatly expanded to enable flexible, generalized forest plots for a variety of settings. I shall demonstrate some of the possibilities and encourage feedback on how this may be developed further.
Examples will be given using realworld IPD metaanalyses of survival data in cancer, although the programs are applicable generally.
Additional materials:
uk13_fisher.pptx
A suite of Stata programs for network metaanalysis
Network metaanalysis involves synthesising the scientific literature comparing several treatments. Typically, twoarm and threearm randomized trials are synthesized, and the aim is to compare treatments that have not been directly compared and often to rank the treatments. A difficulty is that the network may be inconsistent, and ways to assess this are required.
In the past, network metaanalysis models have been fitted using Bayesian methods, typically in WinBUGS. I have recently shown how they may be expressed as multivariate metaanalysis models and hence fitted using mvmeta. However, various challenges remain, including getting the dataset in the correct format, parameterizing the inconsistency model, and making good graphical displays of complex data and results. I will show how a new suite of Stata programs, network, meets these challenges, and I will illustrate its use with examples.
Additional materials:
uk13_white.pptx
Path diagrams as a notational formalism—noodling around with pictures
Math, when written carefully, can represent a model perfectly. Stata syntax, when written correctly, can represent a model perfectly. Everyone is supposed to understand math, but I have encountered those who do not. I personally think everyone should understand Stata syntax, but again, I have found a few who do not. Path diagrams provide an alternative formalism that is easily accessible to a wide audience. With a few conventions, we can use path diagrams to represent a variety of models and perhaps help explain those models. These diagrams are easy enough to create in Stata, so we will look at some models—both simple and difficult—to see what we think of this emerging formalism.
Additional materials:
uk13_wiggins.pdf
Estimating and modeling cumulative incidence functions using timedependent weights
Competing risks occur in survival analysis when a subject is at risk of more than one type of event. A classic example is when there is consideration of different causes of death. Interest may lie in the causespecific hazard rates, which can be estimated using standard survival techniques by censoring competing events. An alternative measure is the cumulative incidence function (CIF) which gives an estimate of absolute or crude risk of death accounting for the possibility that individuals may die of other causes. Geskus (2011Biometrics, 67: 39–49) has recently proposed an alternative way for the estimation and modeling of the CIF that can use weighted versions of standard estimators.
I will describe a Stata command, stcrprep, that restructures survival data and calculates weights based on the censoring distribution. The command is based on the R command crprep, but I will describe a number of extensions that enable the CIF to be modeled directly using parametric models on large datasets. After restructuring the data and incorporating the weights, one can use sts graph to plot the CIF and stcox can be used to fit a Fine and Gray model for the CIF. An advantage of fitting models in this way is that it is possible to use a number of the standard features of the Cox model, for example, using Schoenfeld residuals to visualize and test the proportional subhazards assumption.
I will also describe some additional options that are useful for fitting parametric models and useful for large datasets. In particular, I will describe how the flexible parametric survival models estimated with stpm2 can be used to directly model the cumulativeincidence function. An important advantage is that all the predictions built into stpm2 can be used to directly predict the CIF, subdistribution hazards, etc.
Additional materials:
uk13_lambert.pdf
Mixed logit modeling in Stata—an overview
The “workhorse” model for analysing discrete choice data, the conditional logit model, can be implemented in Stata using the official clogit and asclogit commands. While widely used, this model has several wellknown limitations that have led researchers in various disciplines to consider more flexible alternatives. The mixed logit model extends the standard conditional logit model by allowing one or more of the parameters in the model to be randomly distributed. When one models the choices of individuals (as is common in several disciplines, including economics, marketing, and transport), this allows for preference of heterogeneity among respondents. Other advantages of the mixed logit model include the ability to allow for correlations across observations in cases where an individual made more than one choice, and relaxing the restrictive independence from the irrelevant alternatives property of the conditional logit model.
There are a range of commands that can be used to estimate mixed logit models in Stata. With the exception of xtmelogit, the official Stata command for estimating binary mixed logit models, all of them are userwritten. The module that is probably best known is gllamm, but while very flexible, it can be slow when the model includes several random parameters. This talk will focus on alternative commands for estimating logit models, with focus on the mixlogit module. We will also look at alternatives and extensions to mixlogit, including the recent lclogit, bayesmlogit, and gmnl commands. The talk will review the theory behind the methods implemented by these commands and present examples of their use.
Additional materials:
uk13_hole.pdf
Back to top 
Proceedings of Past Meetings
Use the links below for proceedings and presentations from previous meetings:
 View all past User Group Meetings
 18th London Stata Users Group Meeting (September 2012)
 17th London Stata Users Group Meeting (September 2011)
Back to top 
Return to: Stata User Group Meetings  Stata  Home 