Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data [Wiley Online]

Darius M. Dziuda

Teised formaadid

Other digital carrier (Hind: 92,58 €) - 04-Aug-2010

Formaat: 328 pages
Sari: Wiley Series on Methods and Applications in Data Mining
Ilmumisaeg: 20-Jul-2010
Kirjastus: Wiley-Blackwell
ISBN-10: 470593415
ISBN-13: 9780470593417

Teised raamatud teemal:

Wiley Online
Hind: 116,25 €*
* hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks

Formaat: 328 pages
Sari: Wiley Series on Methods and Applications in Data Mining
Ilmumisaeg: 20-Jul-2010
Kirjastus: Wiley-Blackwell
ISBN-10: 470593415
ISBN-13: 9780470593417

Teised raamatud teemal:

Rohkem infot Wiley Online kohta

Raamatu kodulehekülg: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470593417

Proper analysis and mining of the rapidly growing amount of available genomic and proteomic data is vital for advances in biomedical research. Data Mining for Genomics and Proteomics describes efficient methods for analysis of gene and protein expression data. Dr. Darius Dziuda demonstrates step by step how biomedical studies can and should be performed to maximize the chance of extracting new and useful biomedical knowledge from available data. Readers receive clear guidance on when to use particular data mining methods and why, along with the reasons why some popular approaches can lead to inferior results.

This book covers all aspects of gene and protein expression analysis---from technology, data preprocessing, quality assessment, and basic exploratory analysis to unsupervised and supervised learning algorithms, feature selection, and biomarker discovery. Also presented is a novel method for identification of the Informative Set of Genes, defined as a set containing all information significant for the differentiation of classes represented in training data. Special attention is given to multivariate biomarker discovery leading to parsimonious and generalizable classifiers. In addition, exercises and examples of hands-on analysis of real-world gene expression data sets give readers an opportunity to put the methods they have learned to practical use.

Data Mining for Genomics and Proteomics is an excellent resource for data mining specialists, bioinformaricians, computational biologists, biomedical scientists, computer scientists, molecular biologists, and life scientists. It is also ideal for upper-level undergraduate and graduate-level students of bioinformatics, data mining, computational biology, and biomedical sciences, as well as anyone interested in efficient methods of knowledge discovery based on high-dimensional data.

Data Mining for Genomics and Proteomics uses pragmatic examples and a complete case study to demonstrate step-by-step how biomedical studies can be used to maximize the chance of extracting new and useful biomedical knowledge from data. It is an excellent resource for students and professionals involved with gene or protein expression data in a variety of settings.

Preface

xiii

Acknowledgments

xvii

1 Introduction

(16)

1.1 Basic Terminology

(12)

1.1.1 The Central Dogma of Molecular Biology

(1)

1.1.2 Genome

(1)

1.1.3 Proteome

(1)

1.1.4 DNA (Deoxyribonucleic Acid)

(1)

1.1.5 RNA (Ribonucleic Acid)

(1)

1.1.6 mRNA (messenger RNA)

(1)

1.1.7 Genetic Code

(2)

1.1.8 Gene

(3)

1.1.9 Gene Expression and the Gene Expression Level

(1)

1.1.10 Protein

(1)

1.2 Overlapping Areas of Research

(3)

1.2.1 Genomics

(1)

1.2.2 Proteomics

(1)

1.2.3 Bioinformatics

(1)

1.2.4 Transcriptomics and Other-omics

(1)

1.2.5 Data Mining

(2)

2 Basic Analysis Of Gene Expression Microarray Data

(78)

2.1 Introduction

(1)

2.2 Microarray Technology

(7)

2.2.1 Spotted Microarrays

(1)

2.2.2 Affymetrix GeneChip ® Microarrays

(4)

2.2.3 Bead-Based Microarrays

(1)

2.3 Low-Level Preprocessing of Affymetrix Microarrays

(9)

2.3.1 MASS

(4)

2.3.2 RMA

(2)

2.3.3 GCRMA

(1)

2.3.4 PLIER

(1)

2.4 Public Repositories of Microarray Data

(4)

2.4.1 Microarray Gene Expression Data Society (MGED) Standards

(3)

2.4.2 Public Databases

(1)

2.4.2.1 Gene Expression Omnibus (GEO)

(1)

2.4.2.2 ArrayExpress

(1)

2.5 Gene Expression Matrix

(5)

2.5.1 Elements of Gene Expression Microarray Data Analysis

(1)

2.6 Additional Preprocessing, Quality Assessment, and Filtering

(9)

2.6.1 Quality Assessment

(5)

2.6.2 Filtering

(2)

2.7 Basic Exploratory Data Analysis

(12)

2.7.1 t Test

(1)

2.7.1.1 t Test for Equal Variances

(1)

2.7.1.2 t Test for Unequal Variances

(1)

2.7.2 ANOVA F Test

(1)

2.7.3 SAM t Statistic

(2)

2.7.4 Limma

(1)

2.7.5 Adjustment for Multiple Comparisons

(2)

2.7.5.1 Single-Step Bonferroni Procedure

(1)

2.7.5.2 Single-Step Sidak Procedure

(1)

2.7.5.3 Step-Down Holm Procedure

(1)

2.7.5.4 Step-Up Benjamini and Hochberg Procedure

(1)

2.7.5.5 Permutation Based Multiplicity Adjustment

(1)

2.8 Unsupervised Learning (Taxonomy-Related Analysis)

(26)

2.8.1 Cluster Analysis

(2)

2.8.1.1 Measures of Similarity or Distance

(3)

2.8.1.2 k-Means Clustering

(1)

2.8.1.3 Hierarchical Clustering

(7)

2.8.1.4 Two-Way Clustering and Related Methods

(2)

2.8.2 Principal Component Analysis

(5)

2.8.3 Self-Organizing Maps

(5)

Exercises

(5)

3 Biomarker Discovery and Classification

(106)

3.1 Overview

(24)

3.1.1 Gene Expression Matrix...Again

(2)

3.1.2 Biomarker Discovery

100

(5)

3.1.3 Classification Systems

105

(1)

3.1.3.1 Parametric and Nonparametric Learning Algorithms

106

(1)

3.1.3.2 Terms Associated with Common Assumptions Underlying Parametric Learning Algorithms

106

(4)

3.1.3.3 Visualization of Classification Results

110

(1)

3.1.4 Validation of the Classification Model

111

(1)

3.1.4.1 Reclassification

111

(1)

3.1.4.2 Leave-One-Out and K-Fold Cross-Validation

111

(1)

3.1.4.3 External and Internal Cross-Validation

112

(1)

3.1.4.4 Holdout Method of Validation

113

(1)

3.1.4.5 Ensemble-Based Validation (Using Out-of-Bag Samples)

113

(1)

3.1.4.6 Validation on an Independent Data Set

114

(1)

3.1.5 Reporting Validation Results

114

(1)

3.1.5.1 Binary Classifiers

115

(2)

3.1.5.2 Multiclass Classifiers

117

(2)

3.1.6 Identifying Biological Processes Underlying the Class Differentiation

119

(1)

3.2 Feature Selection

119

(17)

3.2.1 Introduction

119

(2)

3.2.2 Univariate Versus Multivariate Approaches

121

(2)

3.2.3 Supervised Versus Unsupervised Methods

123

(3)

3.2.4 Taxonomy of Feature Selection Methods

126

(1)

3.2.4.1 Filters, Wrappers, Hybrid, and Embedded Models

126

(5)

3.2.4.2 Strategy: Exhaustive, Complete, Sequential, Random, and Hybrid Searches

131

(2)

3.2.4.3 Subset Evaluation Criteria

133

(1)

3.2.4.4 Search-Stopping Criteria

133

(1)

3.2.5 Feature Selection for Multiclass Discrimination

133

(1)

3.2.6 Regularization and Feature Selection

134

(1)

3.2.7 Stability of Biomarkers

135

(1)

3.3 Discriminant Analysis

136

(13)

3.3.1 Introduction

136

(3)

3.3.2 Learning Algorithm

139

(8)

3.3.3 A Stepwise Hybrid Feature Selection with T2

147

(2)

3.4 Support Vector Machines

149

(19)

3.4.1 Hard-Margin Support Vector Machines

150

(7)

3.4.2 Soft-Margin Support Vector Machines

157

(3)

3.4.3 Kernels

160

(5)

3.4.4 SVMs and Multiclass Discrimination

165

(1)

3.4.4.1 One-Versus-the-Rest Approach

165

(1)

3.4.4.2 Pairwise Approach

165

(1)

3.4.4.3 All-Classes-Simultaneously Approach

166

(1)

3.4.5 SVMs and Feature Selection: Recursive Feature Elimination

166

(1)

3.4.6 Summary

167

(1)

3.5 Random Forests

168

(9)

3.5.1 Introduction

168

(4)

3.5.2 Random Forests Learning Algorithm

172

(2)

3.5.3 Random Forests and Feature Selection

174

(2)

3.5.4 Summary

176

(1)

3.6 Ensemble Classifiers, Bootstrap Methods, and The Modified Bagging Schema

177

(5)

3.6.1 Ensemble Classifiers

177

(1)

3.6.1.1 Parallel Approach

177

(1)

3.6.1.2 Serial Approach

177

(1)

3.6.1.3 Ensemble Classifiers and Biomarker Discovery

177

(1)

3.6.2 Bootstrap Methods

178

(1)

3.6.3 Bootstrap and Linear Discriminant Analysis

179

(1)

3.6.4 The Modified Bagging Schema

180

(2)

3.7 Other Learning Algorithms

182

(15)

3.7.1 k-Nearest Neighbor Classifiers

183

(2)

3.7.2 Artificial Neural Networks

185

(1)

3.7.2.1 Perceptron

186

(1)

3.7.2.2 Multilayer Feedforward Neural Networks

187

(5)

3.7.2.3 Training the Network (Supervised Learning)

192

(5)

3.8 Eight Commandments of Gene Expression Analysis (for Biomarker Discovery)

197

(1)

Exercises

198

(3)

4 The Informative Set of Genes

201

(18)

4.1 Introduction

201

(1)

4.2 Definitions

202

(1)

4.3 The Method

202

(9)

4.3.1 Identification of the Informative Set of Genes

203

(5)

4.3.2 Primary Expression Patterns of the informative Set of Genes

208

(3)

4.3.3 The Most Frequently Used Genes of the Primary Expression Patterns

211

(1)

4.4 Using the Informative Set of Genes to Identify Robust Multivariate Biomarkers

211

(1)

4.5 Summary

212

(3)

Exercises

215

(4)

5 Analysis of Protein Expression Data

219

(34)

5.1 Introduction

219

(3)

5.2 Protein Chip Technology

222

(4)

5.2.1 Antibody Microarrays

223

(2)

5.2.2 Peptide Microarrays

225

(1)

5.2.3 Protein Microarrays

225

(1)

5.2.4 Reverse Phase Microarrays

226

(1)

5.3 Two-Dimensional Gel Electrophoresis

226

(2)

5.4 MALDI-TOF and SELDI-TOF Mass Spectrometry

228

(4)

5.4.1 MALDI-TOF Mass Spectrometry

229

(1)

5.4.2 SELDI-TOF Mass Spectrometry

230

(2)

5.5 Preprocessing of Mass Spectrometry Data

232

(5)

5.5.1 Introduction

232

(2)

5.5.2 Elements of Preprocessing of SELDI-TOF Mass Spectrometry Data

234

(1)

5.5.2.1 Quality Assessment

234

(1)

5.5.2.2 Calibration

235

(1)

5.5.2.3 Baseline Correction

235

(1)

5.5.2.4 Noise Reduction and Smoothing

235

(1)

5.5.2.5 Peak Detection

235

(1)

5.5.2.6 Intensity Normalization

236

(1)

5.5.2.7 Peak Alignment Across Spectra

237

(1)

5.6 Analysis of Protein Expression Data

237

(7)

5.6.1 Additional Preprocessing

239

(1)

5.6.2 Basic Exploratory Data Analysis

239

(1)

5.6.3 Unsupervised Learning

240

(2)

5.6.4 Supervised Learning---Feature Selection and Biomarker Discovery

242

(1)

5.6.5 Supervised Learning---Classification Systems

243

(1)

5.7 Associating Biomarker Peaks with Proteins

244

(7)

5.7.1 Introduction

244

(2)

5.7.2 The Universal Protein Resource (UniProt)

246

(1)

5.7.3 Search Programs

247

(2)

5.7.4 Tandem Mass Spectrometry

249

(2)

5.8 Summary

251

(2)

6 Sketches for Selected Exercises

253

(36)

6.1 Introduction

253

(1)

6.2 Multiclass Discrimination (Exercise 3.2)

254

(11)

6.2.1 Data Set Selection, Downloading, and Consolidation

254

(2)

6.2.2 Filtering Probe Sets

256

(1)

6.2.3 Designing a Multistage Classification Schema

257

(8)

6.3 Identifying the Informative Set of Genes (Exercises 4.2-4.6)

265

(6)

6.3.1 The Informative Set of Genes

266

(1)

6.3.2 Primary Expression Patterns of the Informative Set

267

(3)

6.3.3 The Most Frequently Used Genes of the Primary Expression Patterns

270

(1)

6.4 Using the Informative Set of Genes to Identify Robust Multivariate Markers (Exercise 4.8)

271

(1)

6.5 Validating Biomarkers on an Independent Test Data Set (Exercise 4.8)

272

(2)

6.6 Using a Training Set that Combines More than One Data Set (Exercises 3.5 and 4.1-4.8)

274

(15)

6.6.1 Combining the Two Data Sets into a Single Training Set

275

(1)

6.6.2 Filtering Probe Sets of the Combined Data

276

(1)

6.6.3 Assessing the Discriminatory Power of the Biomarkers and Their Generalization

276

(1)

6.6.4 Identifying the Informative Set of Genes

276

(4)

6.6.5 Primary Expression Patterns of the Informative Set of Genes

280

(1)

6.6.6 The Most Frequently Used Genes of the Primary Expression Patterns

281

(4)

6.6.7 Using the Informative Set of Genes to Identify Robust Multivariate Markers

285

(2)

6.6.8 Validating Biomarkers on an Independent Test Data Set

287

(2)

References

289

(18)

Index

307

Darius M. Dziuda, PhD, is Associate Professor of Data Mining and Statistics in the Department of Mathematical Sciences at Central Connecticut State University (CCSU). His research and professional activities have been focused on efficient data mining of biomedical data and on methods for identification of parsimonious multivariate biomarkers for medical diagnosis, prognosis, personalized medicine, and drug discovery. For CCSU's data mining program, Dr. Dziuda developed and teaches graduate-level courses on Data Mining for Genomics and Proteomics and on Biomarker Discovery.

Püsilink: https://www.kriso.ee/db/9780470593417_pe.html

Märksõnad:

E-raamat: Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data [Wiley Online]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Kirjastuste teemad

Vali ostukorv