My Publications
ARTICLES
This paper asks whether susceptibility to early-onset (diagnosis before age 40) of a particularly deadly form of cancer, Multiple Myeloma, can be predicted from single-nucleotide polymorphism (SNP) profiles with an accuracy greater than chance. Specifically, given SNP profiles for 80 Multiple Myeloma patients - of which we believe 40 to have high susceptibility and 40 to have lower susceptibility - we train a support vector machine (SVM) to predict age at diagnosis. We chose SVMs for this task because they are well suited to deal with interactions among features and redundant features. The accuracy of the trained SVM estimated by leave-one-out cross-validation is 71%, significantly greater than random guessing. This result is particularly encouraging since only 3000 SNPs were used in profiling, whereas several million SNPs are known.
Motivation: Standard laboratory classification of the plasma cell dyscrasia monoclonal gammopathy of undetermined significance (MGUS) and the overt plasma cell neoplasm multiple myeloma (MM) is quite accurate, yet, for the most part, biologically uninformative. Most, if not all, cancers are caused by inherited or acquired genetic mutations that manifest themselves in altered gene expression patterns in the clonally related cancer cells. Microarray technology allows for qualitative and quantitative measurements of the expression levels of thousands of genes simultaneously, and it has now been used both to classify cancers that are morphologically indistinguishable and to predict response to therapy. It is anticipated that this information can also be used to develop molecular diagnostic models and to provide insight into mechanisms of disease progression, e.g., transition from healthy to benign hyperplasia or conversion of a benign hyperplasia to overt malignancy. However, standard data analysis techniques are not trivial to employ on these large data sets. Methodology designed to handle large data sets (or modified to do so) is needed to access the vital information contained in the genetic samples, which in turn can be used to develop more robust and accurate methods of clinical diagnostics and prognostics.
Results: Here we report on the application of a panel of statistical and data mining methodologies to classify groups of samples based on expression of 12,000 genes derived from a high density oligonucleotide microarray analysis of highly purified plasma cells from newly diagnosed MM, MGUS, and normal healthy donors. The three groups of samples are each tested against each other. The methods are found to be similar in their ability to predict group membership; all do quite well at predicting MM vs. normal and MGUS vs. normal. However, no method appears to be able to distinguish explicitly the genetic mechanisms between MM and MGUS. We believe this might be due to the lack of genetic differences between these two conditions, and may not be due to the failure of the models. We report the prediction errors for each of the models and each of the methods. Additionally, we report ROC curves for the comparisons and results on predicting MGUS from a model that distinguishes MM samples from normal samples.
Gene-expression microarrays, commonly called 'gene chips,' make it possible to simultaneously measure the rate at which a cell or tissue is expressing - translating into a protein - each of its thousands of genes. One can use these comprehensive snapshots of biological activity to infer regulatory pathways in cells, identify novel targets for drug design, and improve the diagnosis, prognosis, and treatment planning for those suffering from disease. However, the amount of data this new technology produces is more than one can manually analyze. Hence, the need for automated analysis of microarray data offers an opportunity for machine learning to have a significant impact on biology and medicine. This article describes microarray technology, the data it produces, and the types of machine-learning tasks that naturally arise with this data. It also reviews some of the recent prominent applications of machine learning to gene-chip data, points to related tasks where machine learning may have a further impact on biology and medicine, and describes additional types of interesting data that recent advances in biotechnology allow biomedical researchers to collect.
Large-scale applications that require executing very large numbers of tasks are only feasible through parallelism. In this work we present a system that automatically handles large numbers of experiments and data in the context of machine learning. Our system controls all experiments, including re-submission of failed jobs and relies on available resource managers to spawn jobs through pools of machines. Our results show that we can manage a very large number of experiments, using a reasonable amount of idle CPU cycles, with very little user intervention.
Collagen-like peptides of the type (Pro-Pro-Gly)(10) fold into stable triple helices. An electron-withdrawing substituent at the H(gamma)(3) ring position of the second proline residue stabilizes these triple helices. The aim of this study was to reveal the structural and energetic origins of this effect. The approach was to obtain experimental NMR data on model systems and to use these results to validate computational chemical analyses of these systems. The most striking effects of an electron-withdrawing substituent are on the ring pucker of the substituted proline (Pro(i)) and on the trans/cis ratio of the Xaa(i-1)-Pro(i) peptide bond. NMR experiments demonstrated that N-acetylproline methyl ester (AcProOMe) exists in both the C(gamma)-endo and C(gamma)-exo conformations (with the endo conformation slightly preferred), N-acetyl-4(R)-fluoroproline methyl ester (Ac-4R-FlpOMe) exists almost exclusively in the C(gamma)-exo conformation, and N-acetyl-4(S)-fluoroproline methyl ester (Ac-4S-FlpOMe) exists almost exclusively in the C(gamma)-endo conformation. In dioxane, the K(trans/cis) values for AcProOMe, Ac-4R-FlpOMe, and Ac-4S-FlpOMe are 3.0, 4.0, and 1.2, respectively. Density functional theory (DFT) calculations with the (hybrid) B3LYP method were in good agreement with the experimental data. Computational analysis with the natural bond orbital (NBO) paradigm shows that the pucker preference of the substituted prolyl ring is due to the gauche effect. The backbone torsional angles, phi and psi, were shown to correlate with ring pucker, which in turn correlates with the known phi and psi angles in collagen-like peptides. The difference in K(trans/cis) between AcProOMe and Ac-4R-FlpOMe is due to an n->pi interaction associated with the Burg-Dunitz trajectory. The decrease in K(trans/cis) for Ac-4S-FlpOMe can be explained by destabilization of the trans isomer because of unfavorable electronic and steric interactions. Analysis of the results herein along with the structures of collagen-like peptides has led to a theory that links collagen stability to the interplay between the pyrrolidine ring pucker, phi and psi torsional angles, and peptide bond trans/cis ratio of substituted proline residues.
Supervised machine learning and data mining tools have become popular for the analysis of gene expression microarray data. They have the potential to uncover new therapeutic targets for diseases, to predict how patients will respond to specific treatments, and to uncover regulatory relationships among genes in normal and disease situations. Comparative experiments are needed to identify the advantages of the leading supervised learning algorithms for microarray data, as well as to give direction in methodological decisions. This paper compares support vector machines, Bayesian networks, decision trees, boosted decision trees, and voting (ensembles of decision stumps) on a new microarray data set for cancer with over 100 samples. The paper provides evidence for several important lessons for mining microarray data, including: (1) Bayes nets and ensembles perform at least as well as other approaches but arguably provide more direct insight; (2) the common practice of throwing out low or negative average differences, or those accompanied by an absent call, is a mistake; (3) looking for consistent differences in expression may be more important than large differences.
SEMINARS
This paper asks whether susceptibility to early-onset (diagnosis before age 40) of a particularly deadly form of cancer, Multiple Myeloma, can be predicted from single-nucleotide polymorphism (SNP) profiles with an accuracy greater than chance. Specifically, given SNP profiles for 80 Multiple Myeloma patients - of which we believe 40 to have high susceptibility and 40 to have lower susceptibility - we train a support vector machine (SVM) to predict age at diagnosis. We chose SVMs for this task because they are well suited to deal with interactions among features and redundant features. The accuracy of the trained SVM estimated by leave-one-out cross-validation is 71%, significantly greater than random guessing. This result is particularly encouraging since only 3000 SNPs were used in profiling, whereas several million SNPs are known.
The past two decades have witnessed the identification of genes responsible for a number of inherited human disorders. However, most of these successes were with disorders that are caused by single genes. Attempts to identify groups of genes that result in inherited disorders through their collective action have been largely unsuccessful. One reason for this difficulty is that standard genetics techniques are more sensitive to large, consistent changes in single genes than to consistent patterns of small changes in groups of genes.
In order to find the groups of genes that result in these more complex disorders, researchers need to identify groups of genes whose collective action is consistent, even though individual genes may not be. We propose that standard machine learning algorithms can be utilized to address this goal. In this seminar, I will discuss using support vector machines (SMVs) to model patterns in single nucleotide polymorphisms (SNPs) that are associated with early versus late onset of a particularly deadly form of cancer, Multiple Myeloma. The goal of building accurate models is not only to assess risk, but to provide insight into the disease and potentially offer novel drug targets.
Standard laboratory classification of the plasma cell dyscrasia monoclonal gammopathy of undetermined significance (MGUS) and the overt plasma cell neoplasm multiple myeloma (MM) is quite accurate, yet, for the most part, prognostically uninformative. Most, if not all, cancers are caused by inherited or acquired genetic mutations that manifest themselves in altered gene expression patterns in the clonally related cancer cells. Microarray technology allows for qualitative and quantitative measurements of the expression levels of thousands of genes simultaneously, and it has now been used both to classify cancers that are morphologically indistinguishable and to predict response to therapy. However, standard data analysis techniques are not trivial to employ on these large data sets. We report on the application of a panel of statistical and data mining methodologies to classify groups of samples based on expression of 12,000 genes derived from a high density oligonucleotide microarray analysis of highly purified plasma cells from newly diagnosed MM, MGUS, and normal healthy donors and the prediction errors for each of the models and each of the methods. Additionally, we report ROC curves for the comparisons of MM versus MGUS and results on predicting MGUS from a model that distinguishes MM samples from normal samples.
POSTERS
This poster describes the results of the first part of a two-part study to measure the effectiveness of providing users with learned hypotheses when performing a learning task. The first part of the study uses a simple domain based on the “East-West Challenge.” In this study, subjects are given a set of pictures of cartoon trains where half are labeled eastbound and half are labeled “westbound.” The subjects' task is to correctly label a second, unlabeled set of trains using the hypothesis they learned from the first set. Some subjects will receive the output of a relational learning system to aid them in this task. In the second part of this study, we will conduct a similar task using mass spectrometry data. These preliminary studies will lay the groundwork for the validation of other types of collaborative machine learning systems. This type of validation is not routinely done when working with machine learning systems, but instead it is assumed that any help a system can provide is beneficial. This study will challenge that assumption and propose instead that the type of assistance given by a system is as important as the quality of its learned hypotheses.
One of the technologies being used at Abbott for discovering biomarkers for disease states, toxicity and treatment is Surface Enhanced Laser Desorption and Ionization (SELDI) combined with time of flight (TOF) mass spectrometry. SELDI combines highly specific sample enrichment with a sensitive mass measurement, thereby allowing researchers to discover very low abundant (down to 10 fmol) biomarkers. However, the volume of data obtained with SELDI can quickly overwhelm the researcher's ability to analyze manually.
SELDI Filter is a program that I have developed during my internship at Abbott to automate much of this tedious analysis process. It integrates with the Ciphergen ProteinChip(c)software that the researchers are using to collect the SELDI data and presents its analysis results in either Microsoft Word(c)or Microsoft PowerPoint(c)for ease of integration into papers and presentations.
SELDI Filter currently automates the following analysis methods: ANOVA, Discriminant Analysis and Partition Analysis. However, due to its modular design, other techniques can quickly and easily be added.
These studies compare SVMs, Bayesian networks, decision trees, boosted decision trees and voting (ensembles of decision stumps) on a new microarray data set for cancer (multiple myeloma) with over 100 samples. They provide evidence for several important lessons about how these techniques should be used for mining microarray data.
![[Waddell Informatics]](http://www.waddellinformatics.com/images/header.jpg)