Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Public Health

Major Professor

Henian Chen, M.D., Ph.D.

Co-Major Professor

Wei Wang, Ph.D.

Committee Member

Getachew Dagne, Ph.D.

Committee Member

Ellen Daley, Ph.D.


confidentiality, data privacy, generalized linear models, hypothesis testing, statistical disclosure limitation


Background: There is a need for rigorous and standardized methods of privacy protection for shared data in the health sciences. Differential privacy is one such method that has gained much popularity due to its versatility and robustness. This study evaluates differential privacy for explanatory regression modeling in the context of health research.

Methods: Surveyed and newly proposed algorithms were evaluated with respect to the accuracy (bias and RMSE) of coefficient estimates, the empirical coverage probability of confidence intervals, and the power and type I error rates of hypothesis tests. Evaluations took place in both simulated and real data from a study of adolescent behavioral health.

Results: For coefficient estimation, the simulation found the objective and output perturbation algorthms to be the most accurate for logistic models, and subsample-and-aggregate emerged as the most accurate for linear and log-linear models. However, only objective and output perturbation had sufficiently low noise at reasonable settings of the privacy parameter epsilon. The empirical coverage probability of confidence intervals only neared the nominal 95% rate for the ouput perturbation algorithm, at less private settings of epsilon. Of the available algorithms for hypothesis testing, only the Noisy Aggregated Censored z-test maintained an appropriate type I error rate, though power was only satisfactory at the least private settings of epsilon.

Conclusions: The objective and output perturbation algorithms emerged as the most promising for differentially private regression statistics. Further work is needed to derive corresponding algorithms for statistical inference.

Included in

Biostatistics Commons