# Clinical covariates such as age, gender, tumor grade, and smoking history

Clinical covariates such as age, gender, tumor grade, and smoking history have been extensively used in prediction of disease occurrence and progression. or genomic covariates alone. be the clinical outcome of interest. Let = (vector of covariates. Specifically, let be the length be the length is associated with through the model and unknown regression coeffcient is the categorical variable denoting the disease status. For simplicity of notations, we focus on binary classification only. Suppose that = 1 representsthe presence and = 0 indicates the absence of disease. We assume the commonly used logistic regression model, where the logit of the conditional probability is is the length vector of regression coeffcient and is the intercept. Based on a random Tacalcitol monohydrate sample of iid observations (= 1, …, is usually of secondary interest, we simply write = (= ( and denote the event and censoring times, respectively. The most widely used model for censored survival data is the Cox model (Cox, 1972) which assumes that the conditional hazard function is the unknown regression coeffcient. Based on a random sample of iid observations (= 1, …, {that maximizes = { can be estimated by maximizing the continuously differentiable likelihood or partial likelihood functions,|that maximizes = can be estimated by maximizing the differentiable likelihood or partial likelihood functions continuously, which depend on only. The proposed Cov-TGDR is generally applicable if other parametric or semiparametric models are assumed, provided that smooth objective functions are available. 3.?Cov-TGDR 3.1. Algorithm The proposed Cov-TGDR is a gradient searching approach. We refer to Friedman and Popescu (2004) for background and general discussions on such an approach. Let be a small positive increment. In the implementation of our approach, we choose = 1 10?3. Denote = as the index for the point along the parameter path after steps. Let ((0) = 0 and component of {| threshold vector ((+ ( (by + and is component-wise. Steps 2C4 are repeated times. The number of iterations is determined by cross validation. The Cov-TGDR uses a thresholding and variable selection scheme quite different from the TGDR in Friedman and Popescu (2004). Particularly in Step 3, thresholding is carried out for different sets of covariates separately. The rationale is that different type of covariates are not directly comparableone unit increase in gene expressions may have quite different implications from one unit increase in clinical covariates. In addition, genomic covariates usually have a much higher dimensionality than clinical covariates. Variable selection is much more important for genomic covariates than for clinical covariates, which demands a higher degree of regularization for genomic covariates. A fair approach should consider thresholding comparisons within each type of covariates separately, as in Step 3. Loosely speaking, the Cov-TGDR carries out TGDR for each type of covariates separately. The properties of are determined jointly by and (and (for any fixed (non-overlapping subsets of equal sizes. Choose to maximize the cross-validated objective function based on data without the and evaluated Tacalcitol monohydrate without the = 5 in our study. After cross validation over = 1, …, ? 1, carry out the V-fold cross validation and Cov-TGDR estimation. Denote this estimate as for the removed subject. A prediction index can then be computed. For binary classification, class probabilities can be computed from the prediction scores and the logistic model. We use probability 0.5 as the cutoff and predict disease status for each subject. The prediction index can be chosen as the prediction error. For censored survival data, we dichotomize the prediction scores at their median and create two hypothetical risk groups. We then compare the survival functions of the two risk groups. The logrank statistic, which has a Chi-squared distribution with degree of freedom one, is Tacalcitol monohydrate taken as the prediction index. 4.?Breast Cancer Study Breast cancer is the second leading cause of deaths from cancer among women Tacalcitol monohydrate in the United States. Despite major progresses in breast cancer treatment, the ability to predict the metastatic behavior of tumor remains limited. The Breast Cancer Tacalcitol monohydrate study was first reported in vant Veer et al. (2002). 97 lymph node-negative breast cancer patients 55 years old or younger participated in this study. Among them, 46 developed distant metastases within 5 years (metastatic outcome coded as 1) and 51 remained metastases free for at least 5 years (metastatic outcome coded as 0). Clinical covariates collected include age, tumor size, histological grade, angioinvasion, lymphocytic infiltration, estrogen receptor (ER), NNT1 and progesterone receptor (PR) status. Expression levels for 24481 gene probes were collected. We refer to vant Veer et al. (2002).