Muutke küpsiste eelistusi

Data Wrangler's Handbook: Simple Tools for Powerful Results [Pehme köide]

  • Formaat: Paperback / softback, 176 pages, kõrgus x laius x paksus: 228x152x10 mm, kaal: 257 g
  • Ilmumisaeg: 30-Aug-2019
  • Kirjastus: Association of College & Research Libraries
  • ISBN-10: 083891909X
  • ISBN-13: 9780838919095
  • Formaat: Paperback / softback, 176 pages, kõrgus x laius x paksus: 228x152x10 mm, kaal: 257 g
  • Ilmumisaeg: 30-Aug-2019
  • Kirjastus: Association of College & Research Libraries
  • ISBN-10: 083891909X
  • ISBN-13: 9780838919095

Data manipulation and analysis are far easier than you might imagine—in fact, using tools that come standard with your desktop computer, you can learn how to extract, manipulate, and analyze data (and metadata) of any size and complexity. In this handbook, data wizard Banerjee will familiarize you with easily digestible but powerful concepts that will enable you to feel confident working with data. With his expert guidance, you'll learn how to

  • use a single-word command to sort files of any size by any criteria, identify duplicates, and perform numerous other common library tasks;
  • understand data formats, delimited text and CSV files, XML, JSON, scripting, and other key components of data;
  • undertake more sophisticated tasks such as comparing files, converting data from one format to another, reformatting values, combining data from multiple files, and communicating with APIs (Application Programming Interfaces);
  • save time and stress through simple techniques for transforming text, recognizing symbols that perform important tasks, a Regular Expression cheat sheet, a glossary, and other tools.

Library technologists and those involved in maintaining and analyzing data and metadata will find Banerjee’s resource essential.



Data manipulation and analysis are far easier than you might imagine—in fact, using tools that come standard with your desktop computer, you can learn how to extract, manipulate, and analyze data (and metadata) of any size and complexity.

"Data manipulation and analysis are far easier than you might imagine - in fact, using tools that come standard with your desktop computer, you can learn how to extract, manipulate, and analyze data (and metadata) of any size and complexity. In this handbook, data wizard Banerjee will familiarize you with easily digestible but powerful concepts that will enable you to feel confident working with data. With his expert guidance, you'll learn how to use a single-word command to sort files of any size by anycriteria, identify duplicates, and perform numerous other common library tasks; understand data formats, delimited text and CSV files, XML, JSON, scripting, and other key components of data; undertake more sophisticated tasks such as comparing files, converting data from one format to another, reformatting values, combining data from multiple files, and communicating with APIs (Application Programming Interfaces); and save time and stress through simple techniques for transforming text, recognizing symbols that perform important tasks, a Regular Expression cheat sheet, a glossary, and other tools"--

Banerjee, who has worked with data in academic, government, and nonprofit settings, offers a handbook for librarians that explains simple methods that can be used on any computer for managing, extracting, or analyzing text-based data. Focusing on the most essential information, he describes the computer environment and basic concepts for navigating it, including finding the command line; how to apply command line concepts; formats and performing sophisticated operations on delimited text, XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and other formats; how to simplify complicated data problems; tools and techniques for delimited texts, XML, and JSON; scripting; and solving common problems, such as viewing large files, locating files with particular data or characteristics, working with internal metadata or APIs (Application Programming Interfaces), and combining data from different sources. The book ends with commands and functions useful in a library context. Annotation ©2019 Ringgold, Inc., Portland, OR (protoview.com)

Arvustused

I highly recommend The Data Wranglers Handbook for anyone who now manipulates data or may need to do so in the future. In Banerjees words, 'If these tasks [ that require data wrangling] sound intimidating, this book is for you. You will understand everything in this book even if you have no special technical knowledge or programming experience.'"" Technicalities

List of Figures and Tables
xi
Acknowledgments xiii
Introduction xv
Chapter 1 Getting Started with the Command Line
1(6)
Finding the Command Line
1(3)
Mac
1(1)
Windows
2(2)
Meet the Command Line
4(3)
Chapter 2 Command Line Concepts
7(16)
Two Powerful Symbols
7(3)
Direct Output to a File (Greater than Symbol)
8(1)
Direct Output to Another Program (Pipe Symbol)
9(1)
Command Substitution
10(2)
Regular Expressions---The Swiss Army Knife for Data
12(9)
Literal Characters
13(1)
Special Characters
14(1)
Wildcard Characters
14(3)
Logical Operators
17(1)
Grouping
17(4)
Scripting
21(2)
Chapter 3 Understanding Formats
23(12)
David Forero
Chapter 4 Simplify Complicated Problems
35(14)
Isolating Specific Data Elements
36(6)
Converting Data into Formats That Are Easier to Work With
42(7)
Chapter 5 Delimited Text
49(8)
CSV (Comma Separated Values)
51(4)
Commas and Quotation Marks in CSV Files
51(2)
Multiline Fields in CSV Files
53(2)
Multivalued Fields in Delimited Files
55(2)
Chapter 6 XML
57(40)
So What Is XML, Really?
58(2)
What Makes XML So Useful?
60(1)
Why Is XML So Easy?
61(14)
DOM (Document Object Model)
65(1)
XPath
66(2)
XSLT (extensible Stylesheet Language Transformations)
68(7)
Working with Large XML Files
75(1)
Working with Complex XML Files
76(13)
XmlStarlet
89(1)
Installing XmlStarlet
90(1)
Converting XML Documents
91(6)
Chapter 7 JSON (JavaScript Object Notation)
97(16)
Chapter 8 Scripting
113(10)
Variables
115(1)
Arguments
116(1)
Conditional Execution
117(2)
Loops
119(4)
Chapter 9 Solving Common Problems
123(10)
Viewing Large Files
123(1)
Locating Files That Contain Particular Data
123(1)
Finding Files with Specific Characteristics
124(1)
Working with Internal Metadata
124(2)
Working with APIs
126(4)
Combining Data from Different Sources
130(1)
Other Tasks
131(2)
Chapter 10 Conclusions
133(16)
One-Line Wonders
136(1)
Locating, Viewing, and Performing Basic File Operations
137(1)
Combine Information from Multiple Files into a Single File
137(1)
Combine Three Files, Each Consisting of a Single Column, into a Three-Column Table
137(1)
Extract 1,000 Random Lines or Records from a File
137(1)
Find Files with Specific Characteristics
137(2)
Find All Lines in All Files in the Current Directory as Well as All Subdirectories Containing a Regular Expression
137(1)
Identify All Files in Current Directories and Subdirectories That Contain a Value
137(1)
List All Files in Current Directory and Subdirectories over a 100 MB in Order of Decreasing Size
138(1)
List the Names, Pixel Dimensions, and File Sizes of All Files in the Current Directory and Subdirectories in Tab Delimited Format
138(1)
Print Line Number of File That Match Occurred On
138(1)
Split Large Files into Smaller Chunks with Each File Breaking on a Line
138(1)
View 200 Characters Starting at Position 385621 in a File
138(1)
View Lines 4369--4374 of a File
138(1)
Retrieving and Sending Information over a Network
139(1)
Retrieve a Document from the Web and Send It to a File
139(1)
Send an XML Document to an API Requiring HTTP Authentication
139(1)
Sorting, Counting, Deduplication, and File Comparison
139(1)
Combine Two Files on a Common Field
139(1)
Compare Two Sorted Files
139(1)
Count Occurrences for Each Entry in a File, Listed in Order of Decreasing Frequency
139(1)
Count Records Containing an Expression
139(1)
Count Words, Lines, and Characters in File
140(1)
Identify All Unique Entries and Supply a Count of How Many Times Each Occurs
140(1)
Sort a File and Remove Duplicates, Show Only Duplicated Entries, or Show Only Unique Entries
140(1)
Useful Scripting Operations
140(2)
Capture Parameters Passed to a Script
140(1)
Divide a Line into Parameters
140(1)
Iterate through Every Item in Parameter List
140(1)
Perform a Loop
141(1)
Perform an Operation Conditionally
141(1)
Run a Script on Every Line of a File
141(1)
Send the Output of a Command as Arguments to Another Command
141(1)
Send the Output of a Command to Another Command
141(1)
Send the Output of a Command to a File
141(1)
Store the Output of a Command in a Variable
141(1)
Use Foreign Character Sets in a Terminal Window
141(1)
Transforming Text
142(2)
Convert File of Dates to YYYY-MM-DD Format
142(1)
Convert to Title Case
142(1)
Convert to Upper Case
142(1)
Convert List of Names from Direct Order to Indirect Order
142(1)
Extract and Manipulate All Lines in a File That Match a Complex Pattern
143(1)
Extract and Manipulate All Entries in All Files in an Entire Directory Hierarchy That Match a Pattern
143(1)
Remove Lines from a File That Match a Pattern
143(1)
Remove Carriage Return Characters Inserted by Windows Programs from a File
143(1)
Remove Newline Characters from a File
143(1)
Replace Newlines in a File with Character 7 (Bell)
144(1)
Replace Search_Expr with Replace_Expr Only on Lines That Contain condition_Expr
144(1)
Replace Search_Expr with Replace_Expr Except on Lines That Contain Condition_Expr
144(1)
Replace Smart Quotes with Straight Quotes
144(1)
Working with Delimited Files
144(2)
Convert Comma Delimited File Where Some Values Are Quoted and Some Values Are Not to Tab Delimited
144(1)
Convert Multiline Records to Table
145(1)
Extract Individual Fields from Files
145(1)
Find the Most Common Values in the Second Field of a File
145(1)
Find All Lines in Tab Delimited File Not Containing Six Fields
146(1)
Fix Delimited File That Contains Line Breaks in Fields
146(1)
Remove Trailing and Leading Whitespace from Tab Delimited Data Fields
146(1)
Reorder Fields in a Tab Delimited File
146(1)
Working with JSON and XML
146(3)
Add an Attribute to an XML Document
146(1)
Add an Element to an XML Document
146(1)
Apply XSLT Stylesheet to XML Document
146(1)
Convert JSON to Tab Delimited Format
146(1)
Delete Elements, Attributes, or Values Based on XPath Expressions
146(1)
Display Structure of XML File
147(1)
Pretty Print JSON Document
147(1)
Pretty Print XML Document
147(2)
Glossary 149(2)
Symbols That Perform Important Tasks 151(1)
Useful Commands 152(3)
Regular Expression Cheat Sheet 155(2)
Index 157
Kyle Banerjee has wrangled data for diverse purposes in academic, government, and nonprofit environments since 1996. A firm believer that understanding people is the key to building services of the future from the systems and data of the past, his professional interests revolve around understanding workflows and identifying opportunities in data previously thought inconsistent or incomplete. He has published several books and numerous articles on a variety of topics related to applying technology in library settings.