Muutke küpsiste eelistusi

E-raamat: Feature Engineering for Machine Learning and Data Analytics

Edited by (Wright State University, Ohio, USA), Edited by (Arizona State University, Arizona, USA)
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 58,49 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Feature engineering plays a vital role in big data analytics. Machine learning and data mining algorithms cannot work without data. Little can be achieved if there are few features to represent the underlying data objects, and the quality of results of those algorithms largely depends on the quality of the available features. Feature Engineering for Machine Learning and Data Analytics provides a comprehensive introduction to feature engineering, including feature generation, feature extraction, feature transformation, feature selection, and feature analysis and evaluation.

The book presents key concepts, methods, examples, and applications, as well as chapters on feature engineering for major data types such as texts, images, sequences, time series, graphs, streaming data, software engineering data, Twitter data, and social media data. It also contains generic feature generation approaches, as well as methods for generating tried-and-tested, hand-crafted, domain-specific features.

The first chapter defines the concepts of features and feature engineering, offers an overview of the book, and provides pointers to topics not covered in this book. The next six chapters are devoted to feature engineering, including feature generation for specific data types. The subsequent four chapters cover generic approaches for feature engineering, namely feature selection, feature transformation based feature engineering, deep learning based feature engineering, and pattern based feature generation and engineering. The last three chapters discuss feature engineering for social bot detection, software management, and Twitter-based applications respectively.

This book can be used as a reference for data analysts, big data scientists, data preprocessing workers, project managers, project developers, prediction modelers, professors, researchers, graduate students, and upper level undergraduate students. It can also be used as the primary text for courses on feature engineering, or as a supplement for courses on machine learning, data mining, and big data analytics.

Preface xv
Contributors xvii
1 Preliminaries and Overview 1(12)
Guozhu Dong
Huan Liu
1.1 Preliminaries
1(3)
1.1.1 Features
1(2)
1.1.2 Feature Engineering
3(1)
1.1.3 Machine Learning and Data Analytic Tasks
3(1)
1.2 Overview of the
Chapters
4(3)
1.3 Beyond this Book
7(8)
1.3.1 Feature Engineering for Specific Data Types
8(1)
1.3.2 Feature Engineering on Non-Data-Specific Topics
9(4)
I Feature Engineering for Various Data Types 13(176)
2 Feature Engineering for Text Data
15(40)
Chase Geigle
Qiaozhu Mei
ChengXiang Zhai
2.1 Introduction
16(1)
2.2 Overview of Text Representation
17(1)
2.3 Text as Strings
18(1)
2.4 Sequence of Words Representation
19(2)
2.5 Bag of Words Representation
21(7)
2.5.1 Term Weighting
22(5)
2.5.2 Beyond Single Words
27(1)
2.6 Structural Representation of Text
28(3)
2.6.1 Semantic Structure Features
30(1)
2.7 Latent Semantic Representation
31(6)
2.7.1 Latent Semantic Analysis
31(2)
2.7.2 Probabilistic Latent Semantic Analysis
33(2)
2.7.3 Latent Dirichlet Allocation
35(2)
2.8 Explicit Semantic Representation
37(1)
2.9 Embeddings for Text Representation
37(5)
2.9.1 Matrix Factorization for Word Embeddings
38(2)
2.9.2 Neural Networks for Word Embeddings
40(1)
2.9.3 Document Representations from Word Embeddings
41(1)
2.10 Context-Sensitive Text Representation
42(3)
2.11 Summary
45(10)
3 Feature Extraction and Learning for Visual Data
55(32)
Parag S. Chandakkar
Ragav Venkatesan
Baoxin Li
3.1 Classical Visual Feature Representations
57(9)
3.1.1 Color Features
57(4)
3.1.2 Texture Features
61(2)
3.1.3 Shape Features
63(3)
3.2 Latent Feature Extraction
66(5)
3.2.1 Principal Component Analysis
67(1)
3.2.2 Kernel Principal Component Analysis
68(1)
3.2.3 Multidimensional Scaling
69(1)
3.2.4 Isomap
69(1)
3.2.5 Laplacian Eigenmaps
70(1)
3.3 Deep Image Features
71(16)
3.3.1 Convolutional Neural Networks
72(1)
3.3.1.1 The Dot-Product Layer
72(1)
3.3.1.2 The Convolution Layer
73(2)
3.3.2 CNN Architecture Design
75(1)
3.3.3 Fine-Tuning Off-the-Shelf Neural Networks
76(3)
3.3.4 Summary and Conclusions
79(8)
4 Feature-Based Time-Series Analysis
87(30)
Ben D. Fulcher
4.1 Introduction
87(5)
4.1.1 The Time Series Data Type
87(2)
4.1.2 Time-Series Characterization
89(1)
4.1.3 Applications of Time-Series Analysis
90(2)
4.2 Feature-Based Representations of Time Series
92(3)
4.3 Global Features
95(7)
4.3.1 Examples of Global Features
95(3)
4.3.2 Massive Feature Vectors and Highly Comparative Time-Series Analysis
98(4)
4.4 Subsequence Features
102(4)
4.4.1 Interval Features
102(1)
4.4.2 Shapelets
103(2)
4.4.3 Pattern Dictionaries
105(1)
4.5 Combining Time-Series Representations
106(2)
4.6 Feature-Based Forecasting
108(1)
4.7 Summary and Outlook
109(8)
5 Feature Engineering for Data Streams
117(28)
Yao Ma
Jiliang Tang
Charu Aggarwal
5.1 Introduction
118(1)
5.2 Streaming Settings
119(2)
5.3 Linear Methods for Streaming Feature Construction
121(4)
5.3.1 Principal Component Analysis for Data Streams
121(2)
5.3.2 Linear Discriminant Analysis for Data Streams
123(2)
5.4 Non-Linear Methods for Streaming Feature Construction
125(7)
5.4.1 Locally Linear Embedding for Data Streams
125(1)
5.4.2 Kernel Learning for Data Streams
126(2)
5.4.3 Neural Networks for Data Streams
128(4)
5.4.4 Discussion
132(1)
5.5 Feature Selection for Data Streams with Streaming Features
132(3)
5.5.1 The Grafting Algorithm
133(1)
5.5.2 The Alpha-Investing Algorithm
133(1)
5.5.3 The Online Streaming Feature Selection Algorithm
134(1)
5.5.4 Unsupervised Streaming Feature Selection in Social Media
135(1)
5.6 Feature Selection for Data Streams with Streaming Instances
135(1)
5.6.1 Online Feature Selection
136(1)
5.6.2 Unsupervised Feature Selection on Data Streams
136(1)
5.7 Discussions and Challenges
136(10)
5.7.1 Stability
137(1)
5.7.2 Number of Features
137(1)
5.7.3 Heterogeneous Streaming Data
137(8)
6 Feature Generation and Feature Engineering for Sequences
145(22)
Guozhu Dong
Lei Duan
Jyrki Nummenmaa
Peng Zhang
6.1 Introduction
146(2)
6.2 Basics on Sequence Data and Sequence Patterns
148(1)
6.3 Approaches to Using Patterns in Sequence Features
149(1)
6.4 Traditional Pattern-Based Sequence Features
150(1)
6.5 Mined Sequence Patterns for Use in Sequence Features
151(10)
6.5.1 Frequent Sequence Patterns
152(2)
6.5.2 Closed Sequential Patterns
154(1)
6.5.3 Gap Constraints for Sequence Patterns
155(1)
6.5.4 Partial Order Patterns
156(2)
6.5.5 Periodic Sequence Patterns
158(1)
6.5.6 Distinguishing Sequence Patterns
158(2)
6.5.7 Pattern Matching for Sequences
160(1)
6.6 Factors for Selecting Sequence Patterns as Features
161(1)
6.7 Sequence Features Not Defined by Patterns
161(1)
6.8 Sequence Databases
162(1)
6.9 Concluding Remarks
163(4)
7 Feature Generation for Graphs and Networks
167(22)
Yuan Yao
Hanghang Tong
Feng Xu
Jian Lu
7.1 Introduction
168(1)
7.2 Feature Types
168(1)
7.3 Feature Generation
169(12)
7.3.1 Basic Models
170(5)
7.3.2 Extensions
175(4)
7.3.3 Summary
179(2)
7.4 Feature Usages
181(2)
7.4.1 Multi-Label Classification
181(1)
7.4.2 Link Prediction
181(1)
7.4.3 Anomaly Detection
182(1)
7.4.4 Visualization
182(1)
7.5 Conclusions and Future Directions
183(5)
7.6 Glossary
188(1)
II General Feature Engineering Techniques 189(120)
8 Feature Selection and Evaluation
191(30)
Yun Li
Tao Li
8.1 Introduction
191(1)
8.2 Feature Selection Frameworks
192(4)
8.2.1 Search-Based Feature Selection Framework
193(1)
8.2.2 Correlation-Based Feature Selection Framework
194(2)
8.3 Advanced Topics for Feature Selection
196(15)
8.3.1 Stable Feature Selection
196(3)
8.3.2 Sparsity-Based Feature Selection
199(1)
8.3.3 Multi-Source Feature Selection
200(3)
8.3.4 Distributed Feature Selection
203(1)
8.3.5 Multi-View Feature Selection
204(1)
8.3.6 Multi-Label Feature Selection
205(1)
8.3.7 Online Feature Selection
206(2)
8.3.8 Privacy-Preserving Feature Selection
208(2)
8.3.9 Adversarial Feature Selection
210(1)
8.4 Future Work and Conclusion
211(10)
9 Automating Feature Engineering in Supervised Learning
221(24)
Udayan Khurana
9.1 Introduction
222(3)
9.1.1 Challenges in Performing Feature Engineering
224(1)
9.2 Terminology and Problem Definition
225(1)
9.3 A Few Simple Approaches
226(1)
9.4 Hierarchical Exploration of Feature Transformations
227(4)
9.4.1 Transformation Graph
228(1)
9.4.2 Transformation Graph Exploration
229(2)
9.5 Learning Optimal Traversal Policy
231(4)
9.5.1 Feature Exploration through Reinforcement Learning
233(2)
9.6 Finding Effective Features without Model Training
235(4)
9.6.1 Learning to Predict Useful Transformations
237(2)
9.7 Miscellaneous
239(7)
9.7.1 Other Related Work
239(1)
9.7.2 Research Opportunities
240(1)
9.7.3 Resources
240(5)
10 Pattern-Based Feature Generation
245(34)
Yunzhe Jia
James Bailey
Ramamohanarao Kotagiri
Christopher Leckie
10.1 Introduction
246(1)
10.2 Preliminaries
247(4)
10.2.1 Data and Patterns
247(1)
10.2.2 Patterns for Non-Transactional Data
248(3)
10.3 Framework of Pattern-Based Feature Generation
251(3)
10.3.1 Pattern Mining
251(1)
10.3.2 Pattern Selection
252(1)
10.3.3 Feature Generation
253(1)
10.4 Pattern Mining Algorithms
254(4)
10.4.1 Frequent Pattern Mining
254(2)
10.4.2 Contrast Pattern Mining
256(2)
10.5 Pattern, Selection Approaches
258(4)
10.5.1 Past-Processing Pruning
258(2)
10.5.2 In-processing Pruning
260(2)
10.6 Pattern-Based Feature Generation
262(4)
10.6.1 Unsupervised Mapping Functions
262(1)
10.6.2 Supervised Mapping Functions
263(2)
10.6.3 Feature Generation for Sequence Data and Graph Data
265(1)
10.6.4 Comparison with Similar Techniques
265(1)
10.7 Pattern-Based Feature Generation for Classification
266(3)
10.7.1 Problem Statement
266(1)
10.7.2 Direct Classification in the Pattern Space
267(1)
10.7.3 Indirect Classification in the Pattern Space
268(1)
10.7.4 Connection with Stacking Technique
269(1)
10.8 Pattern-Based Feature Generation for Clustering
269(2)
10.8.1 Clustering in the Pattern Space
269(1)
10.8.2 Subspace Clustering
270(1)
10.9 Conclusion
271(8)
11 Deep Learning for Feature Representation
279(30)
Suhang Wang
Huan Liu
11.1 Introduction
279(1)
11.2 Restricted Boltzmann Machine
280(4)
11.2.1 Deep Belief Networks and Deep Boltzmann Machine
281(2)
11.2.2 RBM for Real-Valued Data
283(1)
11.3 AutoEncoder
284(4)
11.3.1 Sparse Autoencoder
286(1)
11.3.2 Denoising Autoencoder
287(1)
11.3.3 Stacked Autoencoder
287(1)
11.4 Convolutional Neural Networks
288(3)
11.4.1 Transfer Feature Learning of CNN
290(1)
11.5 Word Embedding and Recurrent Neural Networks
291(5)
11.5.1 Word Embedding
291(3)
11.5.2 Recurrent Neural Networks
294(1)
11.5.3 Gated Recurrent Unit
295(1)
11.5.4 Long Short-Term Memory
296(1)
11.6 Generative Adversarial Networks and Variational Autoencoder
296(3)
11.6.1 Generative Adversarial Networks
297(1)
11.6.2 Variational Autoencoder
298(1)
11.7 Discussion and Further Readings
299(10)
III Feature Engineering in Special Applications 309(86)
12 Feature Engineering for Social Bot Detection
311(24)
Onur Varol
Clayton A. Davis
Filippo Menczer
Alessandro Flammini
12.1 Introduction
312(1)
12.2 Social Bot Detection
312(2)
12.2.1 Holistic Approach
313(1)
12.2.2 Pairwise Account Comparison
313(1)
12.2.3 Egocentric Analysis
314(1)
12.3 Online Bot Detection Framework
314(11)
12.3.1 Feature Extraction
315(1)
12.3.1.1 User-Based Features
316(1)
12.3.1.2 Friend Features
316(1)
12.3.1.3 Network Features
318(1)
12.3.1.4 Content and Language Features
318(1)
12.3.1.5 Sentiment Features
319(1)
12.3.1.6 Temporal Features
320(1)
12.3.2 Possible Directions for Feature Engineering
320(1)
12.3.3 Feature Analysis
320(3)
12.3.4 Feature Selection
323(1)
12.3.4.1 Feature Classes
323(1)
12.3.4.2 Top Individual Features
324(1)
12.4 Conclusions
325(9)
12.5 Glossary
334(1)
13 Feature Generation and Engineering for Software Analytics
335(24)
Xin Xia
David Lo
13.1 Introduction
336(1)
13.2 Features for Defect Prediction
337(6)
13.2.1 File-level Defect Prediction
337(1)
13.2.1.1 Code Features
338(1)
13.2.1.2 Process Features
340(1)
13.2.2 Just-in-time Defect Prediction
341(2)
13.2.3 Prediction Models and Results
343(1)
13.3 Features for Crash Release Prediction for Apps
343(5)
13.3.1 Complexity Dimension
344(1)
13.3.2 Time Dimension
345(1)
13.3.3 Code Dimension
346(1)
13.3.4 Diffusion Dimension
346(1)
13.3.5 Commit Dimension
347(1)
13.3.6 Text Dimension
347(1)
13.3.7 Prediction Models and Results
348(1)
13.4 Features from Mining Monthly Reports to Predict Developer Turnover
348(3)
13.4.1 Working Hours
349(1)
13.4.2 Task Report
349(1)
13.4.3 Project
350(1)
13.4.4 Prediction Models and Results
351(1)
13.5 Summary
351(8)
14 Feature Engineering for Twitter-Based Applications
359(36)
Sanjaya Wijeratne
Amit Sheth
Shreyansh Bhatt
Lakshika Balasuriya
Hussein S. Al-Olimat
Manas Gaur
Amir Hossein Yazdavar
Krishnaprasad Thirunarayan
14.1 Introduction
359(2)
14.2 Data Present in a Tweet
361(3)
14.2.1 Tweet Text-Related Data
362(1)
14.2.2 Twitter User-Related Data
363(1)
14.2.3 Other Metadata
364(1)
14.3 Common Types of Features Used in Twitter-Based Applications
364(6)
14.3.1 Textual Features
365(3)
14.3.2 Image and Video Features
368(1)
14.3.3 Twitter Metadata-Related Features
369(1)
14.3.4 Network Features
370(1)
14.4 Twitter Feature Engineering in Selected Twitter-Based Studies
370(11)
14.4.1 Twitter User Profile Classification
371(1)
14.4.2 Assisting Coordination during Crisis Events
372(3)
14.4.3 Location Extraction from Tweets
375(2)
14.4.4 Studying the Mental Health Conditions of Depressed Twitter Users
377(2)
14.4.5 Sentiment and Emotion Analysis on Twitter
379(2)
14.5 Twitris: A Real-Time Social Media Analysis Platform
381(2)
14.6 Conclusion
383(1)
14.7 Acknowledgment
384(11)
Index 395
Dr. Guozhu Dong is a professor of Computer Science and Engineering at Wright State University. He obtained his Ph.D. in Computer Science from University of Southern California and his B.S. in Mathematics from Shandong University. Before joining Wright State University, he was a faculty member at Flinders University and then at the University of Melbourne. At Wright State University, he was recognized for Excellence in Research in the College of Engineering and Computer Science. His research interests are in data mining, machine learning, database, data science, and artificial intelligence. He co-authored a book on Sequence Data Mining and co-edited a book on Contrast Data Mining. He has served on numerous conference program committees.

Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in Computer Science and Electrical Engineering at Shanghai JiaoTong University. Before he joined ASU, he worked at Telecom Australia Research Labs and was on the faculty at National University of Singapore. At Arizona State University, he was recognized for excellence in teaching and research in Computer Science and Engineering and received the 2014 President's Award for Innovation. His research interests are in data mining, machine learning, social computing, and artificial intelligence, investigating interdisciplinary problems that arise in many real-world, data-intensive applications with high-dimensional data of disparate forms such as social media. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He is a co-author of Social Media Mining: An Introduction by Cambridge University Press. He serves on journal editorial boards and numerous conference program committees, and is a founding organizer of the International Conference Series on Social Computing, Behavioral-Cultural Modeling, and Prediction. He is an IEEE Fellow. More can be found at http://www.public.asu.edu/~huanliu.