Using STATA’s firthlogit Function for Rare Events

Maximum likelihood estimates will produce bias in the presence of rare events. Firthlogit in STATA uses a penalized maximum likelihood estimator to overcome this bias.

Rare Events with Logistic Regression

With the presence of rare events, e.g., 20 events or successes out of 1000, maximum likelihood estimates will be biased. The reason for the bias stems not necessarily from the rarity but rather scarcity of the event. Rare events in a large sample may not necessarily be a problem, though in a small sample, the low frequency of events may cause a problem. This is because maximum likelihood suffers from small-sample bias. King and Zeng proposed the use of a penalized likelihood estimator to overcome sample-sample bias in maximum likelihood estimation. Here I will simulate a large, binomial outcome data set with rare events and a small dichotomous-outcome data set with rare events and compare it to regular logistic regression. Analyses for regular logistic regression are performed using R’s glm() function, and firthlogit analyses performed in STATA.

Simulating Large and Small Dichotomous Outcome Data Sets in R

In the following code I will create two artificial data sets consisting of rare, dichotomous outcomes. The first will be a large data set consisting of 1000 observations, and second a small data set consisting of 50 observations.

set.seed(324)
x1 = rnorm(1000)          
x2 = rnorm(1000)
z = -6 + 2*x1 + 3*x2        
pr = 1/(1+exp(-z))         # we create probabilities for 'rbinom()' using inverse logit
y = rbinom(1000,1,pr)      
df_Large = data.frame(y=y,x1=x1,x2=x2)
glm( y~x1+x2,data=df_Large,family="binomial") #Observe how well glm() estimated intercept and betas
## 
## Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df_Large)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -7.002        2.129        3.480  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
## Null Deviance:       470.3 
## Residual Deviance: 195.9     AIC: 201.9

The estimates in the large data set are fairly accurate. The intercept is estimated at -7, and the true intercept is -6; x1 and x2 are 2 and 3, respectively, which are pretty well recovered with glm() estimation. Now I will create a small data set with 50 observations. Notice how the glm() estimates are severely biased.

set.seed(324)
x1 = rnorm(50)          
x2 = rnorm(50)
z = -6 + 2*x1 + 3*x2        
pr = 1/(1+exp(-z))         # we create probabilities for 'rbinom()' using inverse logit
y = rbinom(50,1,pr)      
df_small = data.frame(y=y,x1=x1,x2=x2)
glm( y~x1+x2,data=df_small,family="binomial") #Observe how well glm() estimated intercept and betas with small sample
## 
## Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df_small)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     -10.261        4.380        6.475  
## 
## Degrees of Freedom: 49 Total (i.e. Null);  47 Residual
## Null Deviance:       36.69 
## Residual Deviance: 8.843     AIC: 14.84

Here we see severe bias of the estimates. The intercept was estimated as -10, and the estimates for x1 and x2 are also extremely biased. Now compare these estimates with those using the firthlogit function in STATA.

Using the firthlogit function in STATA

Now we will perform logistic regression using the firthlogit function in STATA. The first will be using the large data set. Here is a frequency table of response variable and code for firthlogit function.

firthlogit y x1 x2
#to create odds ratio with confidence intervals, use this code:
firthlogit y x1 x2, or

Here are the results from firthlogit with the large sample. It performs slightly better than regular glm() in recovering true parameters in the large sample.

Now let’s examine firthlogit with the small sample and rare event. Here is a frequency table of the response variable:

Here are the results from firthlogit with the small sample and rare events.

The firthlogit function did a good job here recovering the parameters, even with a small sample and rare events.