







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The potential efficacy of seldi-tof ms proteomic profiling in diagnosing various types of cancer, including breast, colon, head and neck, lung, ovarian, prostate, and hepatoma. How to build a predictive model using a set of seldi-tof profiles and different classification models. It also covers the evaluation of a model's quality and the importance of feature selection. Examples of feature selection methods such as fisher score and pca.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!
650 Serum Proteomic Profiling and Analysis
pathological conditions. The initial discussion of the SELDI- TOF technology is followed by a discussion of technical limitations that affect the interpretive analysis of the pro- files. Next we focus on the description of some statistical methods used to decipher the profiles and their usage in diagnosing or predicting the condition of patients. We illustrate the potential of these methods on the task of differentiating between individuals with and without can- cer. Finally, the review concludes with some insight on the direction of using proteomic data analysis towards the benefit of the medical community.
The ProteinChip©^ Biology System developed by Ciphergen Biosystems, Inc. uses SELDI-TOF MS to ionize proteins specifically retained on a chromatographic surface, which are then detected by time-of-flight mass spectrometry. The system can be used for the mass analy- sis of compounds such as proteins, peptides and nucleic acids within a range of 0–200 kD. The procedure begins with the reaction of a biological sample (e.g. bodily fluid, cell lysate or fraction thereof ) with the chromatographic surface (or ‘spot’) of a ProteinChip, which possesses a defined affinity characteristic: anionic/cationic, hydropho- bic, or metal-binding, or biologically derivatized (e.g. antibody coupled). The ProteinChips, comprised of 8 or 16 of these spots, retain only those analytes that match the surface’s physical affinity characteristics – non-binding species are washed away using appropriate conditions. The spots are then overlaid with an energy-absorbing ‘matrix’ compound, which co-crystallizes around the retained analyte molecules. The spots are ‘shot’ multiple times by a pulsed nitrogen laser. The laser desorption
process results in ionization of matrix molecules and protonation of intact analyte molecules. The ions produced are differentially accelerated in an electrical field and then detected after passing through a field-free, evacuated ‘drift’ tube. The time of flight across the tube is converted to provide information on the molecule’s mass-to-charge ratio ( m/z ), since heavier molecules, by Newtonian physics, will take longer to travel the same distance, having acquired less initial velocity from the uniform acceleration force. The detected ions are then represented as a ‘spectrum’ with peaks of varying intensi- ties and molecular weight assignments are made relative to known calibrant species. Figure 57.1 displays a summary of the SELDI-TOF MS process. Early studies and first applications (Paweletz et al., 2001; Petricoin and Ornstein, 2002; Petricoin et al., 2002) assembled SELDI-TOF MS proteomic profiles of patients with various types of cancer. The primary goal of such studies was to determine whether it is possible to detect peptide markers of the presence of disease by analyzing the profiles contrasting those with cancer and those without. For, example, Petricoin and coworkers’ April 2002 study (Petricoin et al., 2002) compared profiles of 200 patients in order to determine a discriminating pattern between patients with ovarian cancer and those with a variety of non-cancer conditions. A sample profile from this set is shown in Figure 57.2. It consists of intensi- ties measured over 15 154 mass/charge ( m/z ) values.
The profiles obtained by the SELDI-TOF MS system mani- fest a number of attributes which can complicate analysis. Figure 57.3 compares two unprocessed profiles from the
Vacuum tube
Laser
Mirror
Spectral view
Lens
C A B
C
High voltage
B A
Chip spots
20
40
60
0
Detector
Figure 57.1 Diagram of the Ciphergen SELDI-TOF MS system. Samples are reacted with ProteinChip spots, coated with ‘matrix’ and pulsed with a laser. The ionized species created during this process are differentially accelerated in an electric field, float through the vacuum tube, where their arrival times and quantities (intensity) are measured by a detector. Heavier ions demonstrate longer time-of-flight, which is a unique indicator for each molecular species. The plot of ion intensity versus specific time of flight (or corresponding mass to charge value) constitutes the mass spectrum for a given sample.
Richard Pelikan, et al. (^) 651
same reference serum. Differences are readily visible and illustrate stochastic variations that obscure what would ideally be identical profiles. Many causes may be respon- sible for the differences: variation in the SELDI-TOF MS instrumentation condition over time, fluctuation in the intensity of the laser and even surface irregularities on the protein chip spots or the matrix crystallization. All stochastic variations in profiles show up as differ- ences in intensity readings. However, these differences are the result of two intertwined problems: mass inaccu- racy and intensity measurement errors. Mass inaccuracy refers to the misalignment of readings for different m/z values. The mass inaccuracy for Ciphergen’s SELDI-TOF MS system is reported to be approximately 0.1 per cent for externally calibrated experiments. The intensity meas- urement error may arise from imperfect performance of
the ion detector in registering the abundance of ions at a given time point (detector saturation). Both types of errors are illustrated in the right panel of Figure 57.3. In addition, the left panel of Figure 57.3 illustrates baseline variation, a systematic intensity measurement error for which the measurements of the profile differ from 0. Note that the baseline shifts between two samples differ despite the fact that the same serum is being analyzed. The mass inaccuracy and intensity measurement errors can lead to significant fluctuation in profile readings. In addition, if we analyze samples from multiple individuals a natural biological variation in sera is observed. This can show up as differences in intensity values or as the pres- ence or absence of peaks in the profile. The peaks are believed to indicate the presence of peptides or their fragments. These problems lead to serious challenges in
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Mass/charge ratio × 104
0
20
40
60
80
100
Intensity
Figure 57.2 A sample SELDI-TOF MS profile. The x -axis plots mass-to-charge ratio. The y - axis plots the relative intensity (flux) of analyte species with the identified mass-to-charge ratio.
Intensity
5
10
15
20
25
30
35
2000 3000 4000 5000 6000 7000 Mz 7100 7150 7200 7250 7300 7350 7400 7450 7500
8
9
10
11
4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000
5
10
15
20
Figure 57.3 Two SELDI-TOF profiles obtained for the same pooled reference serum. The differences in the baseline (left panel), intensity measurements (right top panel) and mass inaccuracies (right bottom panel) are apparent.
Richard Pelikan, et al. (^) 653
One of the objectives of SELDI-TOF MS data analysis is to build a predictive model that is able to determine the tar- get condition (case or control, cancer or non-cancer) for a given patient’s profile. The predictive model is built from a set of SELDI-TOF profiles (samples) assembled during the study. Each sample in the dataset is associated with a class label determining the target patient condition (case or control, cancer or non-cancer) we would like automati- cally to recognize. Our objective is to exploit the informa- tion in the data and to construct (or learn) a classifier model that is able to predict accurately the label of any new profile. Such a model can be then used to predict labels of new profile samples (diagnosis). The ultimate goal is to build (learn) the best possible classifier, i.e. a model that achieves the highest possible accuracy on future, yet to be seen, proteomic profile samples. Many types of classification models exist. These include classic statistical models such as logistic regres- sion (Kleinbaum, 1994), linear and quadratic discriminant analysis (Duda et al., 2000; Hastie et al., 2001), or modern statistical approaches such as support vector machines (Vapnik, 1995; Burges, 1998; Scholkopf and Smola, 2002). In general, the model defines a decision boundary – a surface in the high dimensional space of profile measure- ments – that attempts to separate case and control profiles in the best possible way. The left panel of Figure 57.6 illustrates a linear surface (a hyperplane) that lets us separate 193 of the 200 samples from the ovarian study using the intensity information in three profile posi- tions (0.0735, 0.0786, 0.4153 m/z ). However, note that in general, the perfect separation of two profiles via a linear surface using just three positions may not be possible. This is illustrated in the right panel of Figure 57.6 where
the linear surface allowing the perfect separation of case and control profiles does not exist. This scenario leads to sample misclassification – the assignment of incorrect profile labels to some samples relative to the decision boundary.
To evaluate the quality of a classification model, one must determine the ability of the model to generate accurate predictions of future unseen profile data. Obviously, since such data are unavailable, we can simulate this scenario by splitting the available data (obtained during the study) into a ‘training’ set and a ‘testing’ set. The training set is used to learn/construct the classifier[s]. The learning process adjusts the predictive model (classifier) so that the examples in the training set are classified with a high accuracy. The ability of the model to predict the case and control samples in the future is evaluated on the test set. Figure 57.7 illustrates the basic evaluation setup. The quality of a binary (case versus control) classifica- tion model may be determined among many different metrics. For the purposes of this review, the classification models are evaluated using statistics computed from the confusion matrix, a two-by-two grid that represents the types and percentages of correct and incorrect classifications. A sample confusion matrix is shown in Figure 57.8. The following useful measures can be derived from the confusion matrix:
● Error (misclassification) rate: E FP FN ● Sensitivity (SN): TP TP FN
× 10 4 × 104
0
5
10
15
20
25
30
35
40
45
50
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
5
10
15
20
25
30
35
40
45
50
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Figure 57.5 An example of baseline correction. Left panel: a profile with a baseline drift. Right panel: the corrected profile. The additive component in the signal is removed and the baseline is shifted to the zero intensity level.
654 Serum Proteomic Profiling and Analysis
● Specificity (SP):
● Positive predictive value (PPV):
● Negative predictive value (NPV):
A confusion matrix and thus also the above measures can be generated for both the training and testing set. However, the evaluation of the testing set matters more; it is an indicator of how well the classification model generalizes to unseen data. Sensitivity and specificity measures on the testing set are very useful if we consider adopting the classification model for the purpose of clini- cal screening. Sensitivity reports the percentage of samples with a condition that were correctly classified as having the condition. Specificity, on the other hand, reports the percentage of samples without a condition that were correctly classified as not having the condition. Very high values of these statistics indicate a good screen- ing test. Moreover, one type of error can be often reduced at the expense of the other error. This is very important if we care more about one type of error. For example, we may want to achieve a low number of patients incorrectly classified as suffering from the disease and care a bit less about missing a patient with the disease during screening. In the simplest evaluation framework, statistics recorded in the confusion matrix are based on a single train/test set split. To eliminate a potential bias due to a lucky or an unlucky train/test split, the average of these statistics over multiple random splits is typically reported.
Classification models come in different guises. They may use a different decision boundary type or optimize slightly different learning criteria. Deeper analyses of many existing
Figure 57.6 (Left panel) An example of a perfect separation of case (X) versus control (O) using a hyperplane in a 2-dimensional projection of the 3-dimensional space defined by m/z positions (0.0735, 0.0786, 0.4153) in the ovarian cancer study. Note that all controls samples are above the hyperplane while all cases are below. (Right panel) The perfect separation of the two groups does not exist. Some case and control samples appear on the opposite side of the hyperplane.
Dataset
Evaluate
Learn (fit)
Training set Testing set
Predictive model
Figure 57.7 The basic evaluation setup. The dataset of samples (case and control profiles) is divided into the training and testing set. The training set is used to learn (fit) the predictive model and testing set serves as a surrogate sample of the future ‘unseen’ examples.
PredictionFN
FP
TN
TP
Actual
Control
Case
Case Control
Figure 57.8 Diagram of a confusion matrix. The four areas are labeled as follows:
● True positive (TP): the percentage of profiles which the classifier predicts correctly as belonging to the group of cases (cancerous) ● False positive (FP): the percentage of profiles which the classifier predicts as belonging to cases, but truly belong to controls (healthy) ● False negative (FN): the percentage of profiles which the classifier predicts as belonging to controls, but truly belong to cases ● True negatives (TN): the percentage of profiles which the classifier predicts correctly as belonging to the group of controls.
Note that the numbers should always sum to 1. An ideal classifier will have a value of 0 for both false negatives and false positives.
656 Serum Proteomic Profiling and Analysis
amplified over many related positions and thus it is less sensitive to random fluctuations/noise. An example of such a transformation is the principal component analysis (PCA) (Jolliffe, 1986). Intuitively, the PCA identifies ortho- gonal sets of correlated features and constructs aggre- gate features (or components) which are uncorrelated, yet are responsible for most of the variance in the original data. Retaining the variance is important; it is often helpful to explore parts of the data where the features spread out across a large space, which gives more room to find a deci- sion boundary between classes. These methods often con- struct useful features due to their inherent ability to utilize the entire signal, rather than a limited number of positions. Another popular solution that attempts to alleviate the problem of a noisy signal assumes that all relevant informa- tion is carried by peaks (Adam et al., 2002). The subsequent discriminative analysis is restricted to identified peaks only and/or their additional characteristics. Numerous versions of peak selection strategies exist (Adam et al., 2002). However, the utility and advantage of these strategies over other feature selection methods for the purpose of dis- criminant analysis has not been demonstrated.
The goal of this section is to illustrate the potential of the SELDI-TOF profiling technology on the analysis of the ovarian cancer data from April 2002 (Petricoin et al., 2002). The dataset 2 consists of 200 profiles samples: 100 sam- ples represent cancer, 100 samples are controls. We rely on concepts and approaches discussed in previous sections and show that it is indeed possible to build predictive models that can classify with high accuracy test samples taken from the SELDI-TOF dataset. The predictive model used in all our experiments is the support vector machine (see above). The model builds a linear decision boundary that separates cancer and control samples provided in the training set. In all experiments the model is learned using a limited number of features (5–25). We try different feature selection approaches. The process of finding a good set of features is a highly exploratory process and often remains the bottleneck of the discovery.
Table 57.1 illustrates the quality of the predictive model learned using the top 5–25 m/z positions according to the Fisher score criteria. To remove highly correlated features we used the maximum allowed correlation (MAC) thresh- old of 0.8. The table shows the test errors (E), sensitivities (SN) and specificities (SP) for a different number of features. All statistics listed are averages over 40 different splits of the data into training and testing sets. This assures that the results are not biased due to a single lucky and unlucky train/test split. The split proportions are 70:30, i.e. 70 per cent of samples are assigned to the training set and 30 per cent to the test set. Using the univariate statistical analysis approach as dis- cussed above, the results are quite impressive. The best result occurs when the classifier is allowed to use the top 25 features selected by Fisher score. Under this condition, the classification model achieves 96.6 per cent sensitivity and 97.36 per cent specificity. On average, 2.99 per cent of the samples seen during the testing phase were misclassified. A different number of features used can show a tradeoff in the improvement of sensitivity or speci- ficity. Note that sensitivity is highest when using only five features, yet specificity is highest when using 25. Instead of narrowly examining a small number of indi- vidual positions, we can examine the effectiveness of aggregate features. Table 57.2 illustrates the perfor- mance statistics of our classification model using features constructed using PCA. The results are included over a range of 5 to 25 principal component features, which amplify patterns found in the profile signal. Again, the resulting statistics are averages over the same
1
2
3
4
5
6
7
3200 3300 3400 3500 3600 3700 3800
Figure 57.10 Thirty out of the top 100 positions selected by the Fisher score on the mean cancer profile. The positions (defined by the markers) accumulate on only two peak complexes and many features are highly correlated, thus carrying minimum additional discriminative information.
Table 57.1 Predictive statistics for the linear SVM model on the ovarian cancer dataset. The features are selected according to the Fisher score criterion. The maximum allowed correlation (MAC) threshold is 0.8. Test errors range in between 4 and 2.9 per cent. Sensitivities and specificities are between 94.9 and 97.6 per cent
No. of features Testing error Sensitivity Specificity
5 0.0352 0.9764 0. 10 0.0402 0.9698 0. 15 0.0406 0.9584 0. 20 0.0332 0.9641 0. 25 0.0299 0.9666 0.
(^2) The ovarian cancer dataset is available at http://ncifdaproteomics.com/
Richard Pelikan, et al. (^) 657
40 train/test splits as used in the univariate analysis, to allow for a more direct comparison of behavior. The predictive performance of our classifier falls when used in conjunction with principal component features. The reason may be due to too many independencies between positions in the profile. Such a condition causes a problem for PCA, which attempts to find signal-wide rela- tionships between multiple positions. If the signal-wide relationships that do exist are weak, then the benefits from using these features will be minimal. In addition, including many of these features to compensate for their weakness complicates the process of discovering biomarkers. As a final example of feature selection, we illustrate the behavior of the aforementioned peak selection strategy. We refer to a peak as the local maximum over a region in the profile. The signal is smoothed before detecting the maxima on the mean case and control profiles. Peak posi- tions from both mean profiles are then used as the pri- mary set of features. Table 57.3 displays performance statistics of the above model using peak selection prior to selecting the top Fisher score features. Unfortunately, the performance was below what was achieved without the peak selection strategy (see Table 57.1). Testing error is relatively high considering the results presented earlier in Table 57.1. The likely reason is that informative features are not only found on peaks, but in valleys as well. Another reason is that the particular behavior of the peak detection algorithm is not optimal. As mentioned before, there exist many methods for per- forming peak selection. Different criteria for selecting peaks will undoubtedly yield differing results. The results presented above show that it is possible to learn predictive models that can achieve a very low classi- fication error on SELDI-TOF samples. To support the sig- nificance of these results, in particular, the fact that the sample profiles carry useful discriminative signals, one may want to perform additional statistical validation. The goal of one such a test, the random class-permutation test, is to verify that the discriminative signal captured by the classifier model is unlikely to be the result of the ran- dom case versus control labeling. Figure 57.11 shows the result of the random permutation test for classifiers ana- lyzed in Table 57.1. The figure plots the estimate of the mean test error one would obtain by learning the classifier
on 5–25 features for randomly assigned class labels and estimates of 95 per cent and 99 per cent test error bounds. The estimates are obtained using 100 random class-label permutations of the original ovarian dataset. The results illustrate a large gap between classification errors achieved on the data and classification errors under the null (random class-label) hypothesis. This shows that our achieved error results are not a coincidence.
This review deals solely with clinical proteomics, but the analysis techniques reported are typically those that could be used in applications to other domains, including immunologic factors determined in the serum including cytokines, chemokines and antibodies or microarray data/transcription profiling of the peripheral blood. The profiles generated by SELDI-TOF MS are a rich source of information, which floats to the surface after careful analy- sis. Although there are many ways to analyze and evaluate proteomic profile data, a simple framework such as the one presented above serves as a foothold for future data analysis work. Using proper feature selection techniques, proteomic profiling can be a valuable discovery tool for locating protein expression patterns in separate case and control populations. As seen above, by comparing expected generalization results on an unseen testing set, one can evaluate the performance of many feature selection strategies. The resulting classification models each contribute knowledge about the profiles, whether there is success or failure with subsequent test sets. In the case explored above, the peak selection strategy used was not effective due to important information being expressed in the ‘valleys’ of the profile. When these were taken into account, the predictive ability of the classification model is higher. When features were found to cluster among one another due to correlation in relation to proximity of m/z values, de-correlation became an important step in the process. Ultimately, the classification model was able to obtain a testing error of 3 per cent when using the top 25 m/z intensities ranked by Fisher score and having intra-correlation coefficients less than 0.8.
Table 57.2 Predictive statistics for the linear SVM model on the ovarian cancer dataset. The features are constructed using princi- pal component analysis (PCA). Test errors range between 19. and 8.9 per cent. Sensitivities and specificities range between 85.3 and 91.1 per cent
No. of features Testing error Sensitivity Specificity
5 0.1992 0.8533 0. 10 0.1111 0.9161 0. 15 0.1078 0.8998 0. 20 0.0926 0.9038 0. 25 0.0898 0.9087 0.
Table 57.3 Performance statistics of the linear SVM classifier after using peak detection. A MAC threshold of 0.8 was enforced before selecting the top 5–25 features using the Fisher score. Testing error ranges from 11.6 to 9.8 per cent, while sensitivity and specificity range from 85 to 93.4 per cent
No. of features Testing error Sensitivity Specificity
5 0.1168 0.9152 0. 10 0.1049 0.9169 0. 15 0.1033 0.9209 0. 20 0.0984 0.934 0. 25 0.1012 0.934 0.
Richard Pelikan, et al. (^) 659
Scholkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press.[**4] Steel, L.F., Shumpert, D. and Trotter, M. (2003). A strategy for the comparative analysis of serum proteomes for the discovery of biomarkers for hepatocellular carcinoma. Proteomics 3 , 601–609. Vapnik, V.N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Wadsworth, J.T., Somers, K. and Stack, B. (2004). Identification of patients with head and neck cancer using serum protein profiles. Arch Otolaryngol Head Neck Surg 130 , 98–104.
Watkins, B., Szaro, R., Shannon, B. et al. (2001). Detection of early stage cancer by serum protein analysis. Am Lab 32–36.[**5] Zeindl-Eberhart, E., Haraida, S. and Liebmann, S. (2004). Detection and identification of tumor-associated protein variants in human hepatocellular carcinomas. Hepatology 39 , 540–549. Zhukov, T.A., Johanson, R.A., Cantor, A.B., Clark, R.A. and Tockman, M.S. (2003). Discovery of distinct protein profiles specific for lung tumors and pre-malignant lung lesions by SELDI mass spectrometry. Lung Cancer 40 , 267–279.