Muutke küpsiste eelistusi

E-raamat: Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data [Wiley Online]

  • Wiley Online
  • Hind: 116,25 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Proper analysis and mining of the rapidly growing amount of available genomic and proteomic data is vital for advances in biomedical research. Data Mining for Genomics and Proteomics describes efficient methods for analysis of gene and protein expression data. Dr. Darius Dziuda demonstrates step by step how biomedical studies can and should be performed to maximize the chance of extracting new and useful biomedical knowledge from available data. Readers receive clear guidance on when to use particular data mining methods and why, along with the reasons why some popular approaches can lead to inferior results.

This book covers all aspects of gene and protein expression analysis---from technology, data preprocessing, quality assessment, and basic exploratory analysis to unsupervised and supervised learning algorithms, feature selection, and biomarker discovery. Also presented is a novel method for identification of the Informative Set of Genes, defined as a set containing all information significant for the differentiation of classes represented in training data. Special attention is given to multivariate biomarker discovery leading to parsimonious and generalizable classifiers. In addition, exercises and examples of hands-on analysis of real-world gene expression data sets give readers an opportunity to put the methods they have learned to practical use.

Data Mining for Genomics and Proteomics is an excellent resource for data mining specialists, bioinformaricians, computational biologists, biomedical scientists, computer scientists, molecular biologists, and life scientists. It is also ideal for upper-level undergraduate and graduate-level students of bioinformatics, data mining, computational biology, and biomedical sciences, as well as anyone interested in efficient methods of knowledge discovery based on high-dimensional data.

Data Mining for Genomics and Proteomics uses pragmatic examples and a complete case study to demonstrate step-by-step how biomedical studies can be used to maximize the chance of extracting new and useful biomedical knowledge from data. It is an excellent resource for students and professionals involved with gene or protein expression data in a variety of settings.
Preface xiii
Acknowledgments xvii
1 Introduction
1(16)
1.1 Basic Terminology
2(12)
1.1.1 The Central Dogma of Molecular Biology
2(1)
1.1.2 Genome
3(1)
1.1.3 Proteome
4(1)
1.1.4 DNA (Deoxyribonucleic Acid)
5(1)
1.1.5 RNA (Ribonucleic Acid)
6(1)
1.1.6 mRNA (messenger RNA)
7(1)
1.1.7 Genetic Code
7(2)
1.1.8 Gene
9(3)
1.1.9 Gene Expression and the Gene Expression Level
12(1)
1.1.10 Protein
13(1)
1.2 Overlapping Areas of Research
14(3)
1.2.1 Genomics
14(1)
1.2.2 Proteomics
14(1)
1.2.3 Bioinformatics
14(1)
1.2.4 Transcriptomics and Other-omics
14(1)
1.2.5 Data Mining
15(2)
2 Basic Analysis Of Gene Expression Microarray Data
17(78)
2.1 Introduction
17(1)
2.2 Microarray Technology
18(7)
2.2.1 Spotted Microarrays
19(1)
2.2.2 Affymetrix GeneChip ® Microarrays
20(4)
2.2.3 Bead-Based Microarrays
24(1)
2.3 Low-Level Preprocessing of Affymetrix Microarrays
25(9)
2.3.1 MASS
27(4)
2.3.2 RMA
31(2)
2.3.3 GCRMA
33(1)
2.3.4 PLIER
34(1)
2.4 Public Repositories of Microarray Data
34(4)
2.4.1 Microarray Gene Expression Data Society (MGED) Standards
34(3)
2.4.2 Public Databases
37(1)
2.4.2.1 Gene Expression Omnibus (GEO)
37(1)
2.4.2.2 ArrayExpress
38(1)
2.5 Gene Expression Matrix
38(5)
2.5.1 Elements of Gene Expression Microarray Data Analysis
42(1)
2.6 Additional Preprocessing, Quality Assessment, and Filtering
43(9)
2.6.1 Quality Assessment
45(5)
2.6.2 Filtering
50(2)
2.7 Basic Exploratory Data Analysis
52(12)
2.7.1 t Test
54(1)
2.7.1.1 t Test for Equal Variances
55(1)
2.7.1.2 t Test for Unequal Variances
55(1)
2.7.2 ANOVA F Test
56(1)
2.7.3 SAM t Statistic
57(2)
2.7.4 Limma
59(1)
2.7.5 Adjustment for Multiple Comparisons
59(2)
2.7.5.1 Single-Step Bonferroni Procedure
61(1)
2.7.5.2 Single-Step Sidak Procedure
61(1)
2.7.5.3 Step-Down Holm Procedure
61(1)
2.7.5.4 Step-Up Benjamini and Hochberg Procedure
62(1)
2.7.5.5 Permutation Based Multiplicity Adjustment
63(1)
2.8 Unsupervised Learning (Taxonomy-Related Analysis)
64(26)
2.8.1 Cluster Analysis
65(2)
2.8.1.1 Measures of Similarity or Distance
67(3)
2.8.1.2 k-Means Clustering
70(1)
2.8.1.3 Hierarchical Clustering
71(7)
2.8.1.4 Two-Way Clustering and Related Methods
78(2)
2.8.2 Principal Component Analysis
80(5)
2.8.3 Self-Organizing Maps
85(5)
Exercises
90(5)
3 Biomarker Discovery and Classification
95(106)
3.1 Overview
95(24)
3.1.1 Gene Expression Matrix...Again
98(2)
3.1.2 Biomarker Discovery
100(5)
3.1.3 Classification Systems
105(1)
3.1.3.1 Parametric and Nonparametric Learning Algorithms
106(1)
3.1.3.2 Terms Associated with Common Assumptions Underlying Parametric Learning Algorithms
106(4)
3.1.3.3 Visualization of Classification Results
110(1)
3.1.4 Validation of the Classification Model
111(1)
3.1.4.1 Reclassification
111(1)
3.1.4.2 Leave-One-Out and K-Fold Cross-Validation
111(1)
3.1.4.3 External and Internal Cross-Validation
112(1)
3.1.4.4 Holdout Method of Validation
113(1)
3.1.4.5 Ensemble-Based Validation (Using Out-of-Bag Samples)
113(1)
3.1.4.6 Validation on an Independent Data Set
114(1)
3.1.5 Reporting Validation Results
114(1)
3.1.5.1 Binary Classifiers
115(2)
3.1.5.2 Multiclass Classifiers
117(2)
3.1.6 Identifying Biological Processes Underlying the Class Differentiation
119(1)
3.2 Feature Selection
119(17)
3.2.1 Introduction
119(2)
3.2.2 Univariate Versus Multivariate Approaches
121(2)
3.2.3 Supervised Versus Unsupervised Methods
123(3)
3.2.4 Taxonomy of Feature Selection Methods
126(1)
3.2.4.1 Filters, Wrappers, Hybrid, and Embedded Models
126(5)
3.2.4.2 Strategy: Exhaustive, Complete, Sequential, Random, and Hybrid Searches
131(2)
3.2.4.3 Subset Evaluation Criteria
133(1)
3.2.4.4 Search-Stopping Criteria
133(1)
3.2.5 Feature Selection for Multiclass Discrimination
133(1)
3.2.6 Regularization and Feature Selection
134(1)
3.2.7 Stability of Biomarkers
135(1)
3.3 Discriminant Analysis
136(13)
3.3.1 Introduction
136(3)
3.3.2 Learning Algorithm
139(8)
3.3.3 A Stepwise Hybrid Feature Selection with T2
147(2)
3.4 Support Vector Machines
149(19)
3.4.1 Hard-Margin Support Vector Machines
150(7)
3.4.2 Soft-Margin Support Vector Machines
157(3)
3.4.3 Kernels
160(5)
3.4.4 SVMs and Multiclass Discrimination
165(1)
3.4.4.1 One-Versus-the-Rest Approach
165(1)
3.4.4.2 Pairwise Approach
165(1)
3.4.4.3 All-Classes-Simultaneously Approach
166(1)
3.4.5 SVMs and Feature Selection: Recursive Feature Elimination
166(1)
3.4.6 Summary
167(1)
3.5 Random Forests
168(9)
3.5.1 Introduction
168(4)
3.5.2 Random Forests Learning Algorithm
172(2)
3.5.3 Random Forests and Feature Selection
174(2)
3.5.4 Summary
176(1)
3.6 Ensemble Classifiers, Bootstrap Methods, and The Modified Bagging Schema
177(5)
3.6.1 Ensemble Classifiers
177(1)
3.6.1.1 Parallel Approach
177(1)
3.6.1.2 Serial Approach
177(1)
3.6.1.3 Ensemble Classifiers and Biomarker Discovery
177(1)
3.6.2 Bootstrap Methods
178(1)
3.6.3 Bootstrap and Linear Discriminant Analysis
179(1)
3.6.4 The Modified Bagging Schema
180(2)
3.7 Other Learning Algorithms
182(15)
3.7.1 k-Nearest Neighbor Classifiers
183(2)
3.7.2 Artificial Neural Networks
185(1)
3.7.2.1 Perceptron
186(1)
3.7.2.2 Multilayer Feedforward Neural Networks
187(5)
3.7.2.3 Training the Network (Supervised Learning)
192(5)
3.8 Eight Commandments of Gene Expression Analysis (for Biomarker Discovery)
197(1)
Exercises
198(3)
4 The Informative Set of Genes
201(18)
4.1 Introduction
201(1)
4.2 Definitions
202(1)
4.3 The Method
202(9)
4.3.1 Identification of the Informative Set of Genes
203(5)
4.3.2 Primary Expression Patterns of the informative Set of Genes
208(3)
4.3.3 The Most Frequently Used Genes of the Primary Expression Patterns
211(1)
4.4 Using the Informative Set of Genes to Identify Robust Multivariate Biomarkers
211(1)
4.5 Summary
212(3)
Exercises
215(4)
5 Analysis of Protein Expression Data
219(34)
5.1 Introduction
219(3)
5.2 Protein Chip Technology
222(4)
5.2.1 Antibody Microarrays
223(2)
5.2.2 Peptide Microarrays
225(1)
5.2.3 Protein Microarrays
225(1)
5.2.4 Reverse Phase Microarrays
226(1)
5.3 Two-Dimensional Gel Electrophoresis
226(2)
5.4 MALDI-TOF and SELDI-TOF Mass Spectrometry
228(4)
5.4.1 MALDI-TOF Mass Spectrometry
229(1)
5.4.2 SELDI-TOF Mass Spectrometry
230(2)
5.5 Preprocessing of Mass Spectrometry Data
232(5)
5.5.1 Introduction
232(2)
5.5.2 Elements of Preprocessing of SELDI-TOF Mass Spectrometry Data
234(1)
5.5.2.1 Quality Assessment
234(1)
5.5.2.2 Calibration
235(1)
5.5.2.3 Baseline Correction
235(1)
5.5.2.4 Noise Reduction and Smoothing
235(1)
5.5.2.5 Peak Detection
235(1)
5.5.2.6 Intensity Normalization
236(1)
5.5.2.7 Peak Alignment Across Spectra
237(1)
5.6 Analysis of Protein Expression Data
237(7)
5.6.1 Additional Preprocessing
239(1)
5.6.2 Basic Exploratory Data Analysis
239(1)
5.6.3 Unsupervised Learning
240(2)
5.6.4 Supervised Learning---Feature Selection and Biomarker Discovery
242(1)
5.6.5 Supervised Learning---Classification Systems
243(1)
5.7 Associating Biomarker Peaks with Proteins
244(7)
5.7.1 Introduction
244(2)
5.7.2 The Universal Protein Resource (UniProt)
246(1)
5.7.3 Search Programs
247(2)
5.7.4 Tandem Mass Spectrometry
249(2)
5.8 Summary
251(2)
6 Sketches for Selected Exercises
253(36)
6.1 Introduction
253(1)
6.2 Multiclass Discrimination (Exercise 3.2)
254(11)
6.2.1 Data Set Selection, Downloading, and Consolidation
254(2)
6.2.2 Filtering Probe Sets
256(1)
6.2.3 Designing a Multistage Classification Schema
257(8)
6.3 Identifying the Informative Set of Genes (Exercises 4.2-4.6)
265(6)
6.3.1 The Informative Set of Genes
266(1)
6.3.2 Primary Expression Patterns of the Informative Set
267(3)
6.3.3 The Most Frequently Used Genes of the Primary Expression Patterns
270(1)
6.4 Using the Informative Set of Genes to Identify Robust Multivariate Markers (Exercise 4.8)
271(1)
6.5 Validating Biomarkers on an Independent Test Data Set (Exercise 4.8)
272(2)
6.6 Using a Training Set that Combines More than One Data Set (Exercises 3.5 and 4.1-4.8)
274(15)
6.6.1 Combining the Two Data Sets into a Single Training Set
275(1)
6.6.2 Filtering Probe Sets of the Combined Data
276(1)
6.6.3 Assessing the Discriminatory Power of the Biomarkers and Their Generalization
276(1)
6.6.4 Identifying the Informative Set of Genes
276(4)
6.6.5 Primary Expression Patterns of the Informative Set of Genes
280(1)
6.6.6 The Most Frequently Used Genes of the Primary Expression Patterns
281(4)
6.6.7 Using the Informative Set of Genes to Identify Robust Multivariate Markers
285(2)
6.6.8 Validating Biomarkers on an Independent Test Data Set
287(2)
References 289(18)
Index 307
Darius M. Dziuda, PhD, is Associate Professor of Data Mining and Statistics in the Department of Mathematical Sciences at Central Connecticut State University (CCSU). His research and professional activities have been focused on efficient data mining of biomedical data and on methods for identification of parsimonious multivariate biomarkers for medical diagnosis, prognosis, personalized medicine, and drug discovery. For CCSU's data mining program, Dr. Dziuda developed and teaches graduate-level courses on Data Mining for Genomics and Proteomics and on Biomarker Discovery.