Data Mining for Genomics and Proteomics uses pragmatic examples and a complete case study to demonstrate step-by-step how biomedical studies can be used to maximize the chance of extracting new and useful biomedical knowledge from data. It is an excellent resource for students and professionals involved with gene or protein expression data in a variety of settings.
Preface. Acknowledgments. 1 Introduction. 1.1 Basic Terminology.
1.1.1 The Central Dogma of Molecular Biology. 1.1.2 Genome. 1.1.3 Proteome.
1.1.4 DNA (Deoxyribonucleic Acid). 1.1.5 RNA (Ribonucleic Acid). 1.1.6 mRNA
(messenger RNA). 1.1.7 Genetic Code. 1.1.8 Gene. 1.1.9 Gene Expression and
the Gene Expression Level. 1.1.10 Protein. 1.2 Overlapping Areas of
Research. 1.2.1 Genomics. 1.2.2 Proteomics. 1.2.3 Bioinformatics. 1.2.4
Transcriptomics and Other -omics. 1.2.5 Data Mining. 2 Basic Analysis of
Gene Expression Microarray Data. 2.1 Introduction. 2.2 Microarray
Technology. 2.2.1 Spotted Microarrays. 2.2.2 Affymetrix GeneChip(R)
Microarrays. 2.2.3 Bead-Based Microarrays. 2.3 Low-Level Preprocessing of
Assymetrix Microarrays. 2.3.1 MAS5. 2.3.2 RMA. 2.3.3 GCRMA. 2.3.4 PLIER.
2.4 Public Repositories of Microarray Data. 2.4.1 Microarray Gene Expression
Data Society (MGED) Standards. 2.4.2 Public Databases. 2.4.2.1 Gene
Expression Omnibus (GEO). 2.4.2.2 ArrayExpress. 2.5 Gene Expression Matrix.
2.5.1 Elements of Gene Expression Microarray Data Analysis. 2.6 Additional
Preprocessing, Quality Assessment, and Filtering. 2.6.1 Quality Assessment.
2.6.2 Filtering. 2.7 Basic Exploratory Data Analysis. 2.7.1 t Test. 2.7.1.1
t Test for Equal Variances. 2.7.1.2 t Test for Unequal Variances. 2.7.2 ANOVA
F Test. 2.7.3 SAM t Statistic. 2.7.4 Limma. 2.7.5 Adjustment for Multiple
Comparisons. 2.7.5.1 Single-Step Bonferroni Procedure. 2.7.5.2 Single-Step
Sidak Procedure. 2.7.5.3 Step-Down Holm Procedure. 2.7.5.4 Step-Up Benjamini
and Hochberg Procedure. 2.7.5.5 Permutation Based Multiplicity Adjustment.
2.8 Unsupervised Learning (Taxonomy-Related Analysis). 2.8.1 Cluster
Analysis. 2.8.1.1 Measures of Similarity or Distance. 2.8.1.2 K-Means
Clustering. 2.8.1.3 Hierarchical Clustering. 2.8.1.4 Two-Way Clustering and
Related Methods. 2.8.2 Principal Component Analysis. 2.8.3 Self-Organizing
Maps. Exercises. 3 Biomarker Discovery and Classification. 3.1 Overview.
3.1.1 Gene Expression Matrix ... Again. 3.1.2 Biomarker Discovery. 3.1.3
Classification Systems. 3.1.3.1 Parametric and Nonparametric Learning
Algorithms. 3.1.3.2 Terms Associated with Common Assumptions Underlying
Parametric Learning Algorithms. 3.1.3.3 Visualization of Classification
Results. 3.1.4 Validation of the Classification Model. 3.1.4.1
Reclassification. 3.1.4.2 Leave-One-Out and K-Fold Cross-Validation. 3.1.4.3
External and Internal Cross-Validation. 3.1.4.4 Holdout Method of Validation.
3.1.4.5 Ensemble-Based Validation (Using Out-of-Bag Samples). 3.1.4.6
Validation on an Independent Data Set. 3.1.5 Reporting Validation Results.
3.1.5.1 Binary Classifiers. 3.1.5.2 Multiclass Classifiers. 3.1.6 Identifying
Biological Processes Underlying the Class Differentiation. 3.2 Feature
Selection. 3.2.1 Introduction. 3.2.2 Univariate Versus Multivariate
Approaches. 3.2.3 Supervised Versus Unsupervised Methods. 3.2.4 Taxonomy of
Feature Selection Methods. 3.2.4.1 Filters, Wrappers, Hybrid, and Embedded
Models. 3.2.4.2 Strategy: Exhaustive, Complete, Sequential, Random, and
Hybrid Searches. 3.2.4.3 Subset Evaluation Criteria. 3.2.4.4 Search-Stopping
Criteria. 3.2.5 Feature Selection for Multiclass Discrimination. 3.2.6
Regularization and Feature Selection. 3.2.7 Stability of Biomarkers. 3.3
Discriminant Analysis. 3.3.1 Introduction. 3.3.2 Learning Algorithm. 3.3.3 A
Stepwise Hybrid Feature Selection with T2. 3.4 Support Vector Machines.
3.4.1 Hard-Margin Support Vector Machines. 3.4.2 Soft-Margin Support Vector
Machines. 3.4.3 Kernels. 3.4.4 SVMs and Multiclass Discrimination. 3.4.4.1
One-Versus-the-Rest Approach. 3.4.4.2 Pairwise Approach. 3.4.4.3
All-Classes-Simultaneously Approach. 3.4.5 SVMs and Feature Selection:
Recursive Feature Elimination. 3.4.6 Summary. 3.5 Random Forests. 3.5.1
Introduction. 3.5.2 Random Forests Learning Algorithm. 3.5.3 Random Forests
and Feature Selection. 3.5.4 Summary. 3.6 Ensemble Classifiers, Bootstrap
Methods, and The Modified Bagging Schema. 3.6.1 Ensemble Classifiers.
3.6.1.1 Parallel Approach. 3.6.1.2 Serial Approach. 3.6.1.3 Ensemble
Classifiers and Biomarker Discovery. 3.6.2 Bootstrap Methods. 3.6.3 Bootstrap
and Linear Discriminant Analysis. 3.6.4 The Modified Bagging Schema. 3.7
Other Learning Algorithms. 3.7.1 k-Nearest Neighbor Classifiers. 3.7.2
Artificial Neural Networks. 3.7.2.1 Perceptron. 3.7.2.2 Multilayer
Feedforward Neural Networks. 3.7.2.3 Training the Network (Supervised
Learning). 3.8 Eight Commandments of Gene Expression Analysis (for Biomarker
Discovery). 4 The Informative Set of Genes. 4.1 Introduction. 4.2
Definitions. 4.3 The Method. 4.3.1 Identification of the Informative Set
of Genes. 4.3.2 Primary Expression Patterns of the Informative Set of Genes.
4.3.3 The Most Frequently Used Genes of the Primary Expression Patterns. 4.4
Using the Informative Set of Genes to Identify Robust Multivariate
Biomarkers. 4.5 Summary. 5 Analysis of Protein Expression Data. 5.1
Introduction. 5.2 Protein Chip Technology. 5.2.1 Antibody Microarrays.
5.2.2 Peptide Microarrays. 5.2.3 Protein Microarrays. 5.2.4 Reverse Phase
Microarrays. 5.3 Two-Dimensional Gel Electrophoresis. 5.4 MALDI-TOF and
SELDI-TOF Mass Spectrometry. 5.4.1 MALDI-TOF Mass Spectrometry. 5.4.2
SELDI-TOF Mass Spectrometry. 5.5 Preprocessing of Mass Spectrometry Data.
5.5.1 Introduction. 5.5.2 Elements of Preprocessing of SELDI-TOF Mass
Spectrometry Data. 5.5.2.1 Quality Assessment. 5.5.2.2 Calibration. 5.5.2.3
Baseline Correction. 5.5.2.4 Noise Reduction and Smoothing. 5.5.2.5 Peak
Detection. 5.5.2.6 Intensity Normalization. 5.5.2.7 Peak Alignment Across
Spectra. 5.6 Analysis of Protein Expression Data. 5.6.1 Additional
Preprocessing. 5.6.2 Basic Exploratory Data Analysis. 5.6.3 Unsupervised
Learning. 5.6.4 Supervised Learning-Feature Selection and Biomarker
Discovery. 5.6.5 Supervised Learning-Classification Systems. 5.7 Associating
Biomarker Peaks with Proteins. 5.7.1 Introduction. 5.7.2 The Universal
Protein Resource (UniProt). 5.7.3 Search Programs. 5.7.4 Tandem Mass
Spectrometry. 5.8 Summary. 6 Sketches for Selected Exercises. 6.1
Introduction. 6.2 Multiclass Discrimination (Exercise 3.2). 6.2.1 Data Set
Selection, Downloading, and Consolidation. 6.2.2 Filtering Probe Sets. 6.2.3
Designing a Multistage Classification Schema. 6.3 Identifying the
Informative Set of Genes (Exercises 4.2-4.6). 6.3.1 The Informative Set of
Genes. 6.3.2 Primary Expression Patterns of the Informative Set. 6.3.3 The
Most Frequently Used Genes of the Primary Expression Patterns. 6.4 Using the
Informative Set of Genes to Identify Robust Multivariate Markers (Exercise
4.8). 6.5 Validating Biomarkers on an Independent Test Data Set (Exercise
4.8). 6.6 Using a Training Set that Combines More than One Data Set
(Exercises 3.5 and 4.1-4.8). 6.6.1 Combining the Two Data Sets into a Single
Training Set. 6.6.2 Filtering Probe Sets of the Combined Data. 6.6.3
Assessing the Discriminatory Power of the Biomarkers and Their
Generalization. 6.6.4 Identifying the Informative Set of Genes. 6.6.5 Primary
Expression Patterns of the Informative Set of Genes. 6.6.6 The Most
Frequently Used Genes of the Primary Expression Patterns. 6.6.7 Using the
Informative Set of Genes to Identify Robust Multivariate Markers. 6.6.8
Validating Biomarkers on an Independent Test Data Set. References. Index.
Darius M. Dziuda , PhD, is Associate Professor of Data Mining and Statistics in the Department of Mathematical Sciences at Central Connecticut State University (CCSU). His research and professional activities have been focused on efficient data mining of biomedical data and on methods for identification of parsimonious multivariate biomarkers for medical diagnosis, prognosis, personalized medicine, and drug discovery. For CCSU's data mining program, Dr. Dziuda developed and teaches graduate-level courses on Data Mining for Genomics and Proteomics and on Biomarker Discovery.