Graduation Year

2010

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Mathematics and Statistics

Major Professor

Kandethody M. Ramachandran, Ph.D.

Committee Member

G. S. Ladde, Ph.D.

Committee Member

Marcus M. McWaters, Ph.D.

Committee Member

Tapas K. Das, Ph.D.

Keywords

Genes, False Discovery Rate, Multiple Testing, Correlation, Classification

Abstract

The aim of the present study is to identify the di®erentially expressed genes be- tween two di®erent conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statis- ticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal vari- ances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as di®erentially expressed, even though the signi¯cance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be di®erentially expressed if the p-value is less than the threshold.

We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of ¯nding di®erentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic’s t-test method, Tusher and Tibshirani’s SAM method among others.

The next step of this research is to check whether the genes selected by the pro- posed Behrens -Fisher method is useful for the classi¯cation of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene se- lection method with some other statistical learning methods, we have found better classi¯cation result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive de¯nite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insu±ciency. The e±ciency of this established method has been demonstrated by ap- plying them in three real microarray data and calculating the misclassi¯cation error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature.

We have studied the classi¯cation performance of di®erent classi¯ers before and after taking the correlation between the genes. The classi¯cation performance of the classi¯er has been signi¯cantly improved once the correlation was accounted. The classi¯cation performance of di®erent classi¯ers have been measured by the misclas- si¯cation rates and the confusion matrix.

The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not a®ect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the signi¯cance level is not su±cient. The rejection region should be rede¯ned accordingly and depends on the degree of correlation. The e®ect of the correlation in selecting the appropriate rejection region have also been studied.

Share

COinS