Methodology for the Model-based Small Area Estimates of Cancer Risk Factors and Screening Behaviors

A novel statistical methodology was developed to produce estimates of the prevalence rates of cancer risk factors and screening behaviors at the state, health service area, and county levels.

On this page:

Model-Based Estimation Procedure for the Main Outcomes
Model-Based Estimation Procedure for the Different Phone Components
Changes of the Methodology for years 2011+

This model-based approach uses data from both the Behavioral Risk Factor Surveillance System (BRFSS) and the National Health Interview Survey (NHIS) (See Data Sources). These model-based estimates correct for potential non-coverage bias and nonresponse bias in the BRFSS, and reduce the variability in the estimates due to small sample sizes. For each time period, the same general methodology was applied to both counties and states. Methods and models used for the two older data periods 1997-1999 and 2000-2003 were given in Raghu et al., 2007 [PDF - 548 KB]*. The methodology was extended to incorporate households with cell phones only for the two newer data periods 2004-2007 and 2008-2010. Liu et al., 2019 describes the updated methodology, the model details are given in Appendix [PDF - 546 KB]. For the data period 2011-2016 the methodology was further updated by removing the stratification of phone status for the NHIS data in the model. Since county and state level direct estimates from NHIS are not publicly available, they were accessed by running these models at the National Center for Health Statistics as part of our collaboration with center researchers. The health service area prevalence estimates were obtained by aggregating its component county prevalence estimates. The county level analysis is described below.

For each outcome, the following landline telephone, cell phone only and non-telephone (estimated) prevalence rates are utilized from every available county, along with characteristics describing the county:

The NHIS prevalence rate for those households with a landline telephone
The NHIS prevalence rate for those households with cell phone only
The NHIS prevalence rate for those households without a telephone
The BRFSS prevalence rate (all of which have landline telephones)
Demographic and socio-economic information describing the county

For each outcome, the NHIS prevalence rate estimates (1), (2) and (3) are assumed to be unbiased estimates for households with landline telephones, cell phone only and without telephones, respectively. As a telephone survey up to 2010, the BRFSS did not fully furnish information about cell phone only and non-telephone households even after poststratification or raking in the weighting step, which results in potential bias due to non-coverage.

Ideally, we would like every sampled person to respond to the survey; however, sample persons do not respond for a variety of reasons. If the non-respondents differ from the respondents, the direct estimate from either survey can be biased—the so-called nonresponse bias. The magnitude of the bias depends on the survey nonresponse rate and the difference between the respondents and the nonrespondents. Since the response rate for the in-person NHIS is larger than for the BRFSS, we assume the NHIS nonresponse bias is ignorable. Thus, the difference between estimates (1) and (3) is assumed to measure the BRFSS nonresponse bias.

From this data, the model estimates four (population) parameters for every county in the country as a function of the county characteristics:

The model-based prevalence rate for those households with a landline telephone
The model-based prevalence rate for those households with cell phone only
The model-based prevalence rate for those households without a telephone
A nonresponse correction factor

The statistical model uses estimates for (A), (B), (C), (D) and the demographic information to obtain model-based county prevalence estimates for telephone and non-telephone households. Then, the model-based prevalence rate for all households is estimated through a weighted average of the estimates for households with landline telephone, cell phone only and non-telephone households, where the weights are the estimated proportions of households in the county with a landline telephone, cell phone only, and without a telephone respectively (estimated using two-step small area estimation models).

For the two older data periods covering 1997-2003, the cell phone only component was ignorable and was not included in the models. The model-based prevalence rate for all households is estimated through a weighted average of the landline telephone and non-telephone household estimates, where the weight is the proportion of households in the county with a telephone (from the U.S. census).

Return to top

Model-Based Estimation Procedure for the Main Outcomes

In this section we sketch the model-based estimation procedure. For a more complete explanation including the mathematical formulation for the methodology used for the two older data periods, see Raghunathan et al., 2007. [View PDF - 548 KB]*

The county level prevalence estimates are based on a statistical model with three levels — a so-called hierarchical model. In the first level, we use the direct estimates of county prevalence rates for households with landline telephone, cell phone only and non-telephone households NHIS and the direct estimates from BRFSS, (1)-(4) above. The distribution of these direct estimates is modeled in terms of unknown county parameters. An example of a parameter is the county prevalence rate of male current smokers who live in households with landline telephones (see A above). These unknown county parameters may be related to each other and predicted by county demographic and socioeconomic characteristics (see (5) above). For example, currently in the U.S. smoking is negatively correlated with socio-economic status (SES); thus, in general, high SES counties will have low smoking rates. Thus, we might predict a low smoking rate in a high SES county – even if we had little sample data for either survey. Similarly, cancer screening is positively correlated with SES, so we might predict a high screening rate in a high SES county even if we had little sample data.

In the second level of the model, the parameters of the first level model are entered in a regression model as dependent variables. The independent variables for the regression model are twenty-six economic, demographic, and educational attainment measures obtained for all counties from the 2000 and 2010 Census, The American Community Survey, and other sources. A list of these variables is given below. This second level of the model allows the estimates from counties with inadequate data to "borrow strength" from other similar counties; for example, for high SES counties to use the results from other high SES counties. This way, the county prevalence estimates are smoothed.

County Level Covariates Used in the Analysis

Total population
Percent black
Percent Hispanic
Percentage in urban areas
Percent blue collar
Population Density
Percent high school grads
Percent college grads
Percentage 1 person households
Percentage elderly (65+)
Percent with children under 18
Percent commute time >=30 min
Indicator for time periods
Percent below poverty
Unemployment rate
Median home value
Per capita income
Per capita property taxes
Federal funds per capita
Social security benefits/person
Buying power index
Total retail sales/household
Serious crime rate
Social service establishments/capita
Newspaper readership rate (M-Fr)
MSA indicator

We use a Bayesian approach to estimate the parameters of the statistical model. In the Bayesian approach, a prior distribution is assumed for the unknown model parameters and combined with the data using Bayes' rule. Here, the prior distribution of the second level regression parameters (such as the regression coefficients and the variance components) could be thought of as a third level of the model. Alternatively, the approach could be thought of as a Bayesian analysis of a two level model.

Since we know little a priori about the second level parameters, we assume a vague prior distribution. That is, we assume a prior distribution that is approximately constant over a wide range of (second level) parameter values. We have performed sensitivity analysis to verify that the choice of the prior distribution does not unduly influence the prevalence estimation results.

A challenge in applying Bayesian methods is that for many problems, a difficult multi-dimensional integration is required to estimate the unknown parameters of the distribution. Thus, some type of numerical approximation must be carried out. Markov chain Monte Carlo (MCMC) methods have been developed to approximate these intractable integrals and to make the Bayesian approach to this type of models feasible. We refer to the Bayesian approach to the three level hierarchical model as a hierarchical Bayes (HB) approach.

We utilized the BRFSS and the NHIS survey design in several ways. First, we used the statistical weights from each survey to calculate the direct estimates (1)-(4) above for each county. Second, we used the survey design to calculate the variance (and covariance) of these estimates and used these quantities in the first level model.

The state prevalence estimates were obtained from a state level model using the same general approach as described above for counties. A principal component analysis was used to reduce the number of covariates in the state level model.

For both counties and states, prevalence estimates were made independently for the four time periods (1997-1999, 2000-2003, 2004-2007, and 2008-2010).

Return to top

Model-Based Estimation Procedure for the Different Phone Components

For both state and county level models, there is no data source that can provide reliable (or acceptable) estimates of the estimated proportions of households in the county with a landline telephone, cell phone only, and without a telephone respectively. A two-step small area modeling approach was developed to estimate those phone coverage rates for data periods 2004-2007 and 2008-2010.

Step 1 is to estimate the proportion of households without a telephone in the county using Fay-Herriot model after taking acrsin square root transformation to the direct estimates. Step 2 is to estimate the ratio of proportion of households in the county with a landline telephone divided by the summation of the proportion of households with a landline telephone and proportion of households with cell phone only in the county using similar modeling approach as used in step 1. Finally compute the final model-based estimate of the proportion of households with a landline telephone and proportion of households with cell phone only in the county using the results obtained from step 1 and step 2.

Return to top

Changes of the Methodology for years 2011+

Due to the design change in BRFSS starting from 2011 and the small portion of the no-phone households (<=2.5%), for years 2011 and forward, we updated the small area model by removing the stratification by phone status, so the dependent variable in the model contains just two entries: The estimate from NHIS and the estimate from BRFSS. The corresponding variance-covariance matrix of the direct estimates in the level 1 model is 2x2. The covariates are also updated to 2011 + whenever possible. Mallow’s Cp is used as the main criteria to select a subset of covariates to be included in the Bayesian model. Details on the updated methodology are described in Liu et al (2024).

Return to top