Graduation Year


Document Type




Degree Granting Department

Chemical and Biomedical Engineering

Major Professor

Steven A. Eschrich, Ph.D.

Co-Major Professor

Dmitry Goldgof, Ph.D.

Committee Member

John Heine, Ph.D.

Committee Member

Rangachar Kasturi, Ph.D.

Committee Member

Ji-Hyun Lee, Dr.PH.


quantization, survival analysis, random subspaces, cost-sensitive analysis, biological covariates


Cancer can develop through a series of genetic events in combination with external influential factors that alter the progression of the disease. Gene expression studies are designed to provide an enhanced understanding of the progression of cancer and to develop clinically relevant biomarkers of disease, prognosis and response to treatment. One of the main aims of microarray gene expression analyses is to develop signatures that are highly predictive of specific biological states, such as the molecular stage of cancer. This dissertation analyzes the classification complexity inherent in gene expression studies, proposing both techniques for measuring complexity and algorithms for reducing this complexity. Classifier algorithms that generate predictive signatures of cancer models must generalize to independent datasets for successful translation to clinical practice. The predictive performance of classifier models is shown to be dependent on the inherent complexity of the gene expression data. Three specific quantitative measures of classification complexity are proposed and one measure (φ) is shown to correlate highly (R2 =0.82) with classifier accuracy in experimental data. Three quantization methods are proposed to enhance contrast in gene expression data and reduce classification complexity. The accuracy for cancer prognosis prediction is shown to improve using quantization in two datasets studied: from 67% to 90% in lung x cancer and from 56% to 68% in colorectal cancer. A corresponding reduction in classification complexity is also observed. A random subspace based multivariable feature selection approach using cost sensitive analysis is proposed to model the underlying heterogeneous cancer biology and address complexity due to multiple molecular pathways and unbalanced distribution of samples into classes. The technique is shown to be more accurate than the univariate t-test method. The classifier accuracy improves from 56% to 68% for colorectal cancer prognosis prediction. A published gene expression signature to predict radiosensitivity of tumor cells is augmented with clinical indicators to enhance modeling of the data and represent the underlying biology more closely. Statistical tests and experiments indicate that the improvement in the model fit is a result of modeling the underlying biology rather than statistical over-fitting of the data, thereby accommodating classification complexity through the use of additional variables.