
Big Data Mining 3/e
Description
Book Introduction
The development of the web, social media, mobile activity, sensors, e-commerce, and many other applications is generating massive amounts of data, and data mining can extract useful information from this data.
This book addresses core challenges in data mining and focuses on practical algorithms applicable to large-scale data.
This book addresses core challenges in data mining and focuses on practical algorithms applicable to large-scale data.
- You can preview some of the book's contents.
Preview
index
Chapter 1.
data mining
1.1 What is data mining?
1.1.1 Modeling
1.1.2 Statistical Modeling
1.1.3 Machine Learning
1.1.4 Computational Approach to Modeling
1.1.5 Summary
1.1.6 Feature Extraction
1.2 Statistical Limitations of Data Mining
1.2.1 Integrated Information Recognition
1.2.2 Bonferroni's theory
1.2.3 Bonferroni's theory example
1.2.4 Section 1.2 Practice Problems
1.3 Useful Facts to Know
1.3.1 Word importance in documents
1.3.2 Hash Function
1.3.3 Index
1.3.4 Auxiliary storage devices
1.3.5 Base of natural logarithms
1.3.6 Power Law
1.3.7 Section 1.3 Practice Problems
1.4 Overview of this book
1.5 Summary
1.6 References
Chapter 2.
MapReduce and a New Software Stack
2.1 Distributed File System
2.1.1 Physical structure of nodes
2.1.2 Large File System Structure
2.2 MapReduce
2.2.1 Map Task
2.2.2 Grouping by key
2.2.3 Reduce Task
2.2.4 Combiner
2.2.5 More detailed explanation of MapReduce execution
2.2.6 Handling Node Failures
2.2.7 Section 2.2 Practice Problems
2.3 Algorithms using MapReduce
2.3.1 Matrix-vector multiplication using MapReduce
2.3.2 If vector v does not load into main memory
2.3.3 Relational Algebra Operations
2.3.4 Selection Operations Using MapReduce
2.3.5 Extraction Operations Using MapReduce
2.3.6 Union, intersection, and difference operations using MapReduce
2.3.7 Natural Join Operations Using MapReduce
2.3.8 Grouping and Aggregation Operations Using MapReduce
2.3.9 Matrix Multiplication
2.3.10 Matrix multiplication using one-step MapReduce
2.3.11 Section 2.3 Practice Problems
2.4 Extending MapReduce
2.4.1 Workflow System
2.4.2 Spark
2.4.3 Spark Implementation
2.4.4 TensorFlow
2.4.5 Recursive Extension of MapReduce
2.4.6 Bulk Synchronous System
2.4.7 Section 2.4 Practice Problems
2.5 Communication Cost Model
2.5.1 Communication Costs in Task Networks
2.5.2 Wall-Clock Time
2.5.3 Multiple Joins
2.5.4 Section 2.5 Practice Problems
2.6 Complexity Theory for MapReduce
2.6.1 Reducer size and replication rate
2.6.2 Example: Similarity Join
2.6.3 Graph Models for MapReduce Problems
2.6.4 Mapping Schema
2.6.5 If not all inputs are given
2.6.6 Lower bound on replication rate
2.6.7 Case Study: Matrix Multiplication
2.6.8 Section 2.6 Practice Problems
2.7 Summary
2.8 References
Chapter 3.
Find similar items
3.1 Applications of set similarity
3.1.1 Jaccard similarity of sets
3.1.2 Document Similarity
3.1.3 Collaborative Filtering in Similar Set Problems
3.1.4 Section 3.1 Practice Problems
3.2 Shingling of documents
3.2.1 k-shingle
3.2.2 Choosing the size of the shingle
3.2.3 Hashing of shingles
3.2.4 Shingles based on words
3.2.5 Section 3.2 Practice Problems
3.3 Summary of Preserving Set Similarity
3.3.1 Matrix representation of sets
3.3.2 Min Haesing
3.3.3 Min Haesing and Jacquard Similarity
3.3.4 Minhae City Signature
3.3.5 Reality of Minhaesi Signature Operation
3.3.6 Improving Minhashing Speed
3.3.7 Speed improvement using hash functions
3.3.8 Section 3.3 Practice Problems
3.4 Document Locality-Based Hashing
3.4.1 LSH of Minhae City Signature
3.4.2 Analysis of band segmentation techniques
3.4.3 Combining techniques
3.4.4 Section 3.4 Practice Problems
3.5 Distance Measurement
3.5.1 Definition of distance measurement method
3.5.2 Euclidean distance
3.5.3 Jacquard distance
3.5.4 Cosine distance
3.5.5 Edit distance
3.5.6 Hamming distance
3.5.7 Section 3.5 Practice Problems
3.6 Theory of locality-based functions
3.6.1 Locality-based functions
3.6.2 Locality-based functions for Jaccard distance
3.6.3 Extension of locality-based functions
3.6.4 Section 3.6 Practice Problems
3.7 LSH function families for other distance measures
3.7.1 LSH function family for Hamming distance
3.7.2 Random hyperplanes and cosine distances
3.7.3 Sketch
3.7.4 LSH family of Euclidean distance functions
3.7.5 A more detailed description of the family of LSH functions in Euclidean space
3.7.6 Section 3.7 Practice Problems
3.8 Locality-Based Hash Applications
3.8.1 Object Identification
3.8.2 Object Identification Example
3.8.3 Record Matching Determination
3.8.4 Fingerprint Reading
3.8.5 LSH function family for fingerprint reading
3.8.6 Similar Newspaper Articles
3.8.7 Section 3.8 Practice Problems
3.9 High Similarity Processing Method
3.9.1 Finding identical items
3.9.2 String representation of sets
3.9.3 Length-based filtering
3.9.4 Prefix Indexing
3.9.5 Use of location information
3.9.6 Using Index Position and Length
3.9.7 Section 3.9 Practice Problems
3.10 Summary
3.11 References
Chapter 4.
Stream Data Mining
4.1 Stream Data Model
4.1.1 Data Stream Management System
4.1.2 Example of a stream source
4.1.3 Stream Queries
4.1.4 Issues in Stream Processing
4.2 Sampling stream data
4.2.1 Examples for Motivation
4.2.2 Representative sampling
4.2.3 General Sampling Problems
4.2.4 Sample size verification
4.2.5 Section 4.2 Practice Problems
4.3 Stream Filtering
4.3.1 Examples for Motivation
4.3.2 Bloom Filter
4.3.3 Bloom Filtering Analysis
4.3.4 Section 4.3 Practice Problems
4.4 Counting the number of elements in a stream with duplicates removed
4.4.1 Number of elements with duplicates removed
4.4.2 Flazzolet-Martin Algorithm
4.4.3 Combination of approximations
4.4.4 Space Requirements
4.4.5 Section 4.4 Practice Problems
4.5 Moment approximation
4.5.1 Definition of Moment
4.5.2 Alon-Mathias-Szegedy Algorithm for Second Moments
4.5.3 How the Alon-Mathias-Szegedy Algorithm Works
4.5.4 High Moment
4.5.5 Handling Infinite Streams
4.5.6 Section 4.5 Practice Problems
4.6 Count within Windows
4.6.1 Cost of counting accurately
4.6.2 Datar-Gionis-Indique-Motwani Algorithm
4.6.3 Space Requirements for the DGIM Algorithm
4.6.4 Query Answering with the DGIM Algorithm
4.6.5 Maintaining DGIM Conditions
4.6.6 Reducing Errors
4.6.7 Extension to general counting
4.6.8 Section 4.6 Practice Problems
4.7 Attenuation Window
4.7.1 Problem of finding frequently appearing elements
4.7.2 Definition of the attenuation window
4.7.3 Finding the most popular elements
4.8 Summary
4.9 References
Chapter 5.
Link Analysis
5.1 PageRank
5.1.1 Early Search Engines and Term Spam
5.1.2 Definition of PageRank
5.1.3 Structure of the Web
5.1.4 Avoiding Dead Ends
5.1.5 Spider Traps and Taxation
5.1.6 Using PageRank in Search Engines
5.1.7 Section 5.1 Practice Problems
5.2 Efficient Operation of PageRank
5.2.1 Representation of the transition matrix
5.2.2 PageRank Iteration Using MapReduce
5.2.3 Using a combiner to sum the result vectors
5.2.4 Block representation of the transition matrix
5.2.5 Other Efficient Approaches to Iterative PageRank Computation
5.2.6 Section 5.2 Practice Problems
5.3 Topic-based PageRank
5.3.1 The Need for Topic-Based PageRank
5.3.2 Biased Random Walk
5.3.3 Using Topic-Based PageRank
5.3.4 Inferring topics from words
5.3.5 Section 5.3 Practice Problems
5.4 Link Spam
5.4.1 Spam Farm Structure
5.4.2 Spam Farm Analysis
5.4.3 Fighting Link Spam
5.4.4 TrustRank
5.4.5 Spam Mass
5.4.6 Section 5.4 Practice Problems
5.5 Hubs and Authorities
5.5.1 Intuitive Understanding of HITS
5.5.2 Formulation of Hub Index and Authority Index
5.5.3 Section 5.5 Practice Problems
5.6 Summary
5.7 References
Chapter 6.
frequent itemset
6.1 Market Basket Model
6.1.1 Definition of frequent itemsets
6.1.2 Applications of frequent itemsets
6.1.3 Association Rules
6.1.4 Finding High-Confidence Association Rules
6.1.5 Section 6.1 Practice Problems
6.2 Market Baskets and A priori Algorithms
6.2.1 Representation of Market Basket Data
6.2.2 Use of main memory to count itemsets
6.2.3 Monotonicity of itemsets
6.2.4 Conclusion on the number of pairs
6.2.5 A priori algorithms
6.2.6 A priori algorithms for all frequent itemsets
6.2.7 Section 6.2 Practice Problems
6.3 Processing larger datasets in main memory
6.3.1 PCY Algorithm
6.3.2 Multi-step algorithm
6.3.3 Multiple Hash Algorithms
6.3.4 Section 6.3 Practice Problems
6.4 Step Bound Algorithm
6.4.1 Simple random algorithm
6.4.2 Preventing Errors in Sampling Algorithms
6.4.3 SON Algorithm
6.4.4 SON Algorithm and MapReduce
6.4.5 Toivonen's Algorithm
6.4.6 Why Toivonen's Algorithm Works
6.4.7 Section 6.4 Practice Problems
6.5 Counting frequent items in a stream
6.5.1 Sampling Methods from Streams
6.5.2 Frequent itemsets in decaying windows
6.5.3 Combining Techniques
6.5.4 Section 6.5 Practice Problems
6.6 Summary
6.7 References
Chapter 7.
Clustering
7.1 Overview of Clustering Techniques
7.1.1 Points, Spaces, and Distances
7.1.2 Clustering Strategy
7.1.3 Curse of Dimensions
7.1.4 Section 7.1 Practice Problems
7.2 Hierarchical clustering
7.2.1 Hierarchical clustering in Euclidean space
7.2.2 Efficiency of Hierarchical Clustering
7.2.3 Other hierarchical clustering processing rules
7.2.4 Hierarchical clustering in non-Euclidean spaces
7.2.5 Section 7.2 Practice Problems
7.3 K-means algorithm
7.3.1 Basics of k-means
7.3.2 Cluster Initialization for k-Means
7.3.3 Choosing an appropriate k value
7.3.4 BFR Algorithm
7.3.5 Data Processing of the BFR Algorithm
7.3.6 Section 7.3 Practice Problems
7.4 CURE Algorithm
7.4.1 Initialization in CURE
7.4.2 Termination of the CURE Algorithm
7.4.3 Section 7.4 Practice Problems
7.5 Clustering in non-Euclidean spaces
7.5.1 Cluster Representation Method of GRGPF Algorithm
7.5.2 Cluster Tree Initialization
7.5.3 Adding Points in the GRGPF Algorithm
7.5.4 Splitting and Merging Clusters
7.5.5 Section 7.5 Practice Problems
7.6 Clustering and Parallel Processing for Streams
7.6.1 Stream Operation Model
7.6.2 Stream-Clustering Algorithm
7.6.3 Bucket Initialization
7.6.4 Bucket Merging
7.6.5 Responses to Questions
7.6.6 Clustering in Distributed Environments
7.6.7 Section 7.6 Practice Problems
7.7 Summary
7.8 References
Chapter 8.
Advertising through the web
8.1 Topics Related to Online Advertising
8.1.1 Advertising Opportunities
8.1.2 Direct Ad Placement
8.1.3 Problems with Display Advertising
8.2 Online Algorithm
8.2.1 Online and Offline Algorithms
8.2.2 Greedy Algorithm
8.2.3 Competition rate
8.2.4 Section 8.2 Practice Problems
8.3 Combination Problems
8.3.1 Combinations and Perfect Combinations
8.3.2 Greedy algorithm for finding the best combination
8.3.3 Competition rate of greedy combinations
8.3.4 Section 8.3 Practice Problems
8.4 AdWords Issues
8.4.1 History of Search Advertising
8.4.2 Defining AdWords Issues
8.4.3 A Greedy Approach to the AdWords Problem
8.4.4 Balance Algorithm
8.4.5 Lower bound on the competition rate of the balance algorithm
8.4.6 Balance Algorithm for Multiple Bidders
8.4.7 Generalization of the Balance Algorithm
8.4.8 Final Facts Regarding AdWords Issues
8.4.9 Section 8.4 Practice Problems
8.5 AdWords Implementation
8.5.1 Combining Bidding and Search Queries
8.5.2 More complex combination problems
8.5.3 Algorithm for combining documents and bid advertisements
8.6 Summary
8.7 References
Chapter 9.
Recommendation system
9.1 Recommender System Model
9.1.1 Multi-objective matrices
9.1.2 Long Tail
9.1.3 Applications of Recommender Systems
9.1.4 Generating multi-purpose matrices
9.2 Content-based recommendations
9.2.1 Item Profile
9.2.2 Feature Extraction from Documents
9.2.3 Item characteristics obtained from tags
9.2.4 Item Profile Representation
9.2.5 User Profiles
9.2.6 Content-based item recommendations
9.2.7 Classification Algorithm
9.2.8 Section 9.2 Practice Problems
9.3 Collaborative Filtering
9.3.1 Similarity Measurement
9.3.2 Duality of Similarity
9.3.3 User and Item Clustering
9.3.4 Section 9.3 Practice Problems
9.4 Dimensional reduction
9.4.1 UV Decomposition
9.4.2 Root Mean Square Error
9.4.3 Stepwise operation of UV decomposition
9.4.4 Random Element Optimization
9.4.5 Implementation of the Completed UV Decomposition Algorithm
9.4.6 Section 9.4 Practice Problems
9.5 Netflix Challenge
9.6 Summary
9.7 References
Chapter 10.
Social Network Graph Mining
10.1 Social Network Graph
10.1.1 What is a social network?
10.1.2 Social Networks as Graphs
10.1.3 Various social networks
10.1.4 Networks with different types of nodes
10.1.5 Section 10.1 Practice Problems
10.2 Social Network Graph Clustering
10.2.1 Distance Metrics in Social Network Graphs
10.2.2 Application of standard clustering methods
10.2.3 Relayability
10.2.4 Govern-Newman Algorithm
10.2.5 Finding Communities Using Relayability
10.2.6 Section 10.2 Practice Problems
10.3 Direct discovery of the community
10.3.1 Finding a group
10.3.2 Completely Bipartite Graph
10.3.3 Finding Completely Bipartite Subgraphs
10.3.4 Why Complete Bipartite Graphs Must Exist
10.3.5 Section 10.3 Practice Problems
10.4 Graph Partitioning
10.4.1 What is a good way to partition?
10.4.2 Normalizing the divider line
10.4.3 Matrices that describe graphs
10.4.4 Eigenvalues of the Laplace matrix
10.4.5 Another partitioning method
10.4.6 Section 10.4 Practice Problems
10.5 Finding Overlapping Communities
10.5.1 The Nature of Community
10.5.2 Maximum likelihood estimation
10.5.3 Affiliation-Graph Model
10.5.4 Discrete Optimization of Community Allocation
10.5.5 How to avoid discrete membership changes
10.5.6 Section 10.5 Practice Problems
10.6 Similarity Ranking
10.6.1 Random Walker in Social Graphs
10.6.2 Random walker with restart
10.6.3 Approximate similarity ranking
10.6.4 Why Approximate Similarity Ranking Works
10.6.5 Application of Similarity Ranking for Community Finding
10.6.6 Section 10.6 Practice Problems
10.7 Counting the number of triangles
10.7.1 Why Count Triangles?
10.7.2 Algorithm for finding triangles
10.7.3 Efficiency of the triangle finding algorithm
10.7.4 Finding Triangles Using MapReduce
10.7.5 Using Fewer Reduce Tasks
10.7.6 Section 10.7 Practice Problems
10.8 Neighborhood Features of Graphs
10.8.1 Directed Graphs and Neighbors
10.8.2 Diameter of the graph
10.8.3 Transitive closure and reachability
10.8.4 Reachability via MapReduce
10.8.5 Semi-naive evaluation
10.8.6 Linear Transitive Closure
10.8.7 Transitive closure by recursive doubling
10.8.8 Intelligent Transitive Closure
10.8.9 Method Comparison
10.8.10 Transitive closure by graph reduction
10.8.11 Estimating the size of neighbors
10.8.12 Section 10.8 Practice Problems
10.9 Summary
10.10 References
Chapter 11.
Dimension reduction
11.1 Eigenvalues and eigenvectors of symmetric matrices
11.1.1 Definition
11.1.2 Calculating Eigenvalues and Eigenvectors
11.1.3 Finding Eigenpairs Using the Repeated Square Method
11.1.4 Matrix of Eigenvectors
11.1.5 Section 11.1 Practice Problems
11.2 Principal Component Analysis
11.2.1 Examples to help explain
11.2.2 Using Eigenvectors for Dimensionality Reduction
11.2.3 Distance matrix
11.2.4 Section 11.2 Practice Problems
11.3 Singular value decomposition
11.3.1 Definition of SVD
11.3.2 Interpretation of SVD
11.3.3 Dimensionality Reduction Using SVD
11.3.4 Why Removing Small Outliers Works
11.3.5 Queries using concepts
11.3.6 Computing the SVD of a Matrix
11.3.7 Section 11.3 Practice Problems
11.4 CUR decomposition
11.4.1 Definition of CUR
11.4.2 Appropriate selection of rows and columns
11.4.3 Intermediate matrix configuration
11.4.4 CUR disassembly completed
11.4.5 Removing duplicate rows and columns
11.4.6 Section 11.4 Practice Problems
11.5 Summary
11.6 References
Chapter 12.
large-scale machine learning
12.1 Machine Learning Models
12.1.1 Training Set
12.1.2 Examples to help explain
12.1.3 Machine Learning Techniques
12.1.4 Machine Learning Structure
12.1.5 Section 12.1 Practice Problems
12.2 Perceptron
12.2.1 Training a Perceptron with a Threshold of 0
12.2.2 Convergence of the perceptron
12.2.3 Winnow Algorithm
12.2.4 Allowing changes to the threshold
12.2.5 Multiclass Perceptron
12.2.6 Training Set Transformation
12.2.7 Problems with the Perceptron
12.2.8 Parallel Implementation of Perceptron
12.2.9 Section 12.2 Practice Problems
12.3 Support Vector Machines
12.3.1 How SVM Works
12.3.2 Hyperplane regularization
12.3.3 Finding the best approximation separator
12.3.4 SVM solution by gradient descent
12.3.5 Stochastic Gradient Descent
12.3.6 Parallel Implementation of SVM
12.3.7 Section 12.3 Practice Problems
12.4 Nearest Neighbor Learning
12.4.1 Framework for Computing Nearest Neighbors
12.4.2 One Nearest Neighbor Learning
12.4.3 Learning One-Dimensional Functions
12.4.4 Kernel Regression Analysis
12.4.5 High-Dimensional Euclidean Data Processing
12.4.6 Non-Euclidean Distance Processing
12.4.7 Section 12.4 Practice Problems
12.5 Decision Tree
12.5.1 Using Decision Trees
12.5.2 Impurity Measurement
12.5.3 Design of Decision Tree Nodes
12.5.4 Test Selection Using Numeric Features
12.5.5 Test Selection Using Categorical Features
12.5.6 Parallel Design of Decision Trees
12.5.7 Node Pruning
12.5.8 Decision Forest
12.5.9 Section 12.5 Practice Problems
12.6 Comparison of Learning Methods
12.7 Summary
12.8 References
Chapter 13.
Neural networks and deep learning
13.1 Introduction to Neural Networks
13.1.1 Neural Networks
13.1.2 Interconnection between nodes
13.1.3 Convolutional Neural Networks
13.1.4 Neural Network Design Problems
13.1.5 Section 13.1 Practice Problems
13.2 High-density feed-forward networks
13.2.1 Linear Algebra Notation
13.2.2 Activation Function
13.2.3 Sigmoid
13.2.4 Hyperbolic Tangent
13.2.5 Softmax
13.2.6 Rectifier linear unit
13.2.7 Loss Function
13.2.8 Regression Loss
13.2.9 Classification Loss
13.2.10 Section 13.2 Practice Problems
13.3 Backpropagation and Gradient Descent
13.3.1 Computational Graph
13.3.2 Slope, Jacobian, and Chain Rule
13.3.3 Backpropagation Algorithm
13.3.4 Repeating gradient descent
13.3.5 Tensors
13.3.6 Section 13.3 Practice Problems
13.4 Convolutional Neural Networks
13.4.1 Convolutional Layer
13.4.2 Convolution and Cross-Correlation
13.4.3 Pooling Layer
13.4.4 CNN Architecture
13.4.5 Implementation and Learning
13.4.6 Section 13.4 Practice Problems
13.5 Recurrent Neural Networks
13.5.1 Training the RNN
13.5.2 Slope loss and explosion
13.5.3 Long and short-term memory
13.5.4 Section 13.5 Practice Problems
13.6 Regularization
13.6.1 norm penalty
13.6.2 Dropout
13.6.3 Early Termination
13.6.4 Dataset Augmentation
13.7 Summary
13.8 References
data mining
1.1 What is data mining?
1.1.1 Modeling
1.1.2 Statistical Modeling
1.1.3 Machine Learning
1.1.4 Computational Approach to Modeling
1.1.5 Summary
1.1.6 Feature Extraction
1.2 Statistical Limitations of Data Mining
1.2.1 Integrated Information Recognition
1.2.2 Bonferroni's theory
1.2.3 Bonferroni's theory example
1.2.4 Section 1.2 Practice Problems
1.3 Useful Facts to Know
1.3.1 Word importance in documents
1.3.2 Hash Function
1.3.3 Index
1.3.4 Auxiliary storage devices
1.3.5 Base of natural logarithms
1.3.6 Power Law
1.3.7 Section 1.3 Practice Problems
1.4 Overview of this book
1.5 Summary
1.6 References
Chapter 2.
MapReduce and a New Software Stack
2.1 Distributed File System
2.1.1 Physical structure of nodes
2.1.2 Large File System Structure
2.2 MapReduce
2.2.1 Map Task
2.2.2 Grouping by key
2.2.3 Reduce Task
2.2.4 Combiner
2.2.5 More detailed explanation of MapReduce execution
2.2.6 Handling Node Failures
2.2.7 Section 2.2 Practice Problems
2.3 Algorithms using MapReduce
2.3.1 Matrix-vector multiplication using MapReduce
2.3.2 If vector v does not load into main memory
2.3.3 Relational Algebra Operations
2.3.4 Selection Operations Using MapReduce
2.3.5 Extraction Operations Using MapReduce
2.3.6 Union, intersection, and difference operations using MapReduce
2.3.7 Natural Join Operations Using MapReduce
2.3.8 Grouping and Aggregation Operations Using MapReduce
2.3.9 Matrix Multiplication
2.3.10 Matrix multiplication using one-step MapReduce
2.3.11 Section 2.3 Practice Problems
2.4 Extending MapReduce
2.4.1 Workflow System
2.4.2 Spark
2.4.3 Spark Implementation
2.4.4 TensorFlow
2.4.5 Recursive Extension of MapReduce
2.4.6 Bulk Synchronous System
2.4.7 Section 2.4 Practice Problems
2.5 Communication Cost Model
2.5.1 Communication Costs in Task Networks
2.5.2 Wall-Clock Time
2.5.3 Multiple Joins
2.5.4 Section 2.5 Practice Problems
2.6 Complexity Theory for MapReduce
2.6.1 Reducer size and replication rate
2.6.2 Example: Similarity Join
2.6.3 Graph Models for MapReduce Problems
2.6.4 Mapping Schema
2.6.5 If not all inputs are given
2.6.6 Lower bound on replication rate
2.6.7 Case Study: Matrix Multiplication
2.6.8 Section 2.6 Practice Problems
2.7 Summary
2.8 References
Chapter 3.
Find similar items
3.1 Applications of set similarity
3.1.1 Jaccard similarity of sets
3.1.2 Document Similarity
3.1.3 Collaborative Filtering in Similar Set Problems
3.1.4 Section 3.1 Practice Problems
3.2 Shingling of documents
3.2.1 k-shingle
3.2.2 Choosing the size of the shingle
3.2.3 Hashing of shingles
3.2.4 Shingles based on words
3.2.5 Section 3.2 Practice Problems
3.3 Summary of Preserving Set Similarity
3.3.1 Matrix representation of sets
3.3.2 Min Haesing
3.3.3 Min Haesing and Jacquard Similarity
3.3.4 Minhae City Signature
3.3.5 Reality of Minhaesi Signature Operation
3.3.6 Improving Minhashing Speed
3.3.7 Speed improvement using hash functions
3.3.8 Section 3.3 Practice Problems
3.4 Document Locality-Based Hashing
3.4.1 LSH of Minhae City Signature
3.4.2 Analysis of band segmentation techniques
3.4.3 Combining techniques
3.4.4 Section 3.4 Practice Problems
3.5 Distance Measurement
3.5.1 Definition of distance measurement method
3.5.2 Euclidean distance
3.5.3 Jacquard distance
3.5.4 Cosine distance
3.5.5 Edit distance
3.5.6 Hamming distance
3.5.7 Section 3.5 Practice Problems
3.6 Theory of locality-based functions
3.6.1 Locality-based functions
3.6.2 Locality-based functions for Jaccard distance
3.6.3 Extension of locality-based functions
3.6.4 Section 3.6 Practice Problems
3.7 LSH function families for other distance measures
3.7.1 LSH function family for Hamming distance
3.7.2 Random hyperplanes and cosine distances
3.7.3 Sketch
3.7.4 LSH family of Euclidean distance functions
3.7.5 A more detailed description of the family of LSH functions in Euclidean space
3.7.6 Section 3.7 Practice Problems
3.8 Locality-Based Hash Applications
3.8.1 Object Identification
3.8.2 Object Identification Example
3.8.3 Record Matching Determination
3.8.4 Fingerprint Reading
3.8.5 LSH function family for fingerprint reading
3.8.6 Similar Newspaper Articles
3.8.7 Section 3.8 Practice Problems
3.9 High Similarity Processing Method
3.9.1 Finding identical items
3.9.2 String representation of sets
3.9.3 Length-based filtering
3.9.4 Prefix Indexing
3.9.5 Use of location information
3.9.6 Using Index Position and Length
3.9.7 Section 3.9 Practice Problems
3.10 Summary
3.11 References
Chapter 4.
Stream Data Mining
4.1 Stream Data Model
4.1.1 Data Stream Management System
4.1.2 Example of a stream source
4.1.3 Stream Queries
4.1.4 Issues in Stream Processing
4.2 Sampling stream data
4.2.1 Examples for Motivation
4.2.2 Representative sampling
4.2.3 General Sampling Problems
4.2.4 Sample size verification
4.2.5 Section 4.2 Practice Problems
4.3 Stream Filtering
4.3.1 Examples for Motivation
4.3.2 Bloom Filter
4.3.3 Bloom Filtering Analysis
4.3.4 Section 4.3 Practice Problems
4.4 Counting the number of elements in a stream with duplicates removed
4.4.1 Number of elements with duplicates removed
4.4.2 Flazzolet-Martin Algorithm
4.4.3 Combination of approximations
4.4.4 Space Requirements
4.4.5 Section 4.4 Practice Problems
4.5 Moment approximation
4.5.1 Definition of Moment
4.5.2 Alon-Mathias-Szegedy Algorithm for Second Moments
4.5.3 How the Alon-Mathias-Szegedy Algorithm Works
4.5.4 High Moment
4.5.5 Handling Infinite Streams
4.5.6 Section 4.5 Practice Problems
4.6 Count within Windows
4.6.1 Cost of counting accurately
4.6.2 Datar-Gionis-Indique-Motwani Algorithm
4.6.3 Space Requirements for the DGIM Algorithm
4.6.4 Query Answering with the DGIM Algorithm
4.6.5 Maintaining DGIM Conditions
4.6.6 Reducing Errors
4.6.7 Extension to general counting
4.6.8 Section 4.6 Practice Problems
4.7 Attenuation Window
4.7.1 Problem of finding frequently appearing elements
4.7.2 Definition of the attenuation window
4.7.3 Finding the most popular elements
4.8 Summary
4.9 References
Chapter 5.
Link Analysis
5.1 PageRank
5.1.1 Early Search Engines and Term Spam
5.1.2 Definition of PageRank
5.1.3 Structure of the Web
5.1.4 Avoiding Dead Ends
5.1.5 Spider Traps and Taxation
5.1.6 Using PageRank in Search Engines
5.1.7 Section 5.1 Practice Problems
5.2 Efficient Operation of PageRank
5.2.1 Representation of the transition matrix
5.2.2 PageRank Iteration Using MapReduce
5.2.3 Using a combiner to sum the result vectors
5.2.4 Block representation of the transition matrix
5.2.5 Other Efficient Approaches to Iterative PageRank Computation
5.2.6 Section 5.2 Practice Problems
5.3 Topic-based PageRank
5.3.1 The Need for Topic-Based PageRank
5.3.2 Biased Random Walk
5.3.3 Using Topic-Based PageRank
5.3.4 Inferring topics from words
5.3.5 Section 5.3 Practice Problems
5.4 Link Spam
5.4.1 Spam Farm Structure
5.4.2 Spam Farm Analysis
5.4.3 Fighting Link Spam
5.4.4 TrustRank
5.4.5 Spam Mass
5.4.6 Section 5.4 Practice Problems
5.5 Hubs and Authorities
5.5.1 Intuitive Understanding of HITS
5.5.2 Formulation of Hub Index and Authority Index
5.5.3 Section 5.5 Practice Problems
5.6 Summary
5.7 References
Chapter 6.
frequent itemset
6.1 Market Basket Model
6.1.1 Definition of frequent itemsets
6.1.2 Applications of frequent itemsets
6.1.3 Association Rules
6.1.4 Finding High-Confidence Association Rules
6.1.5 Section 6.1 Practice Problems
6.2 Market Baskets and A priori Algorithms
6.2.1 Representation of Market Basket Data
6.2.2 Use of main memory to count itemsets
6.2.3 Monotonicity of itemsets
6.2.4 Conclusion on the number of pairs
6.2.5 A priori algorithms
6.2.6 A priori algorithms for all frequent itemsets
6.2.7 Section 6.2 Practice Problems
6.3 Processing larger datasets in main memory
6.3.1 PCY Algorithm
6.3.2 Multi-step algorithm
6.3.3 Multiple Hash Algorithms
6.3.4 Section 6.3 Practice Problems
6.4 Step Bound Algorithm
6.4.1 Simple random algorithm
6.4.2 Preventing Errors in Sampling Algorithms
6.4.3 SON Algorithm
6.4.4 SON Algorithm and MapReduce
6.4.5 Toivonen's Algorithm
6.4.6 Why Toivonen's Algorithm Works
6.4.7 Section 6.4 Practice Problems
6.5 Counting frequent items in a stream
6.5.1 Sampling Methods from Streams
6.5.2 Frequent itemsets in decaying windows
6.5.3 Combining Techniques
6.5.4 Section 6.5 Practice Problems
6.6 Summary
6.7 References
Chapter 7.
Clustering
7.1 Overview of Clustering Techniques
7.1.1 Points, Spaces, and Distances
7.1.2 Clustering Strategy
7.1.3 Curse of Dimensions
7.1.4 Section 7.1 Practice Problems
7.2 Hierarchical clustering
7.2.1 Hierarchical clustering in Euclidean space
7.2.2 Efficiency of Hierarchical Clustering
7.2.3 Other hierarchical clustering processing rules
7.2.4 Hierarchical clustering in non-Euclidean spaces
7.2.5 Section 7.2 Practice Problems
7.3 K-means algorithm
7.3.1 Basics of k-means
7.3.2 Cluster Initialization for k-Means
7.3.3 Choosing an appropriate k value
7.3.4 BFR Algorithm
7.3.5 Data Processing of the BFR Algorithm
7.3.6 Section 7.3 Practice Problems
7.4 CURE Algorithm
7.4.1 Initialization in CURE
7.4.2 Termination of the CURE Algorithm
7.4.3 Section 7.4 Practice Problems
7.5 Clustering in non-Euclidean spaces
7.5.1 Cluster Representation Method of GRGPF Algorithm
7.5.2 Cluster Tree Initialization
7.5.3 Adding Points in the GRGPF Algorithm
7.5.4 Splitting and Merging Clusters
7.5.5 Section 7.5 Practice Problems
7.6 Clustering and Parallel Processing for Streams
7.6.1 Stream Operation Model
7.6.2 Stream-Clustering Algorithm
7.6.3 Bucket Initialization
7.6.4 Bucket Merging
7.6.5 Responses to Questions
7.6.6 Clustering in Distributed Environments
7.6.7 Section 7.6 Practice Problems
7.7 Summary
7.8 References
Chapter 8.
Advertising through the web
8.1 Topics Related to Online Advertising
8.1.1 Advertising Opportunities
8.1.2 Direct Ad Placement
8.1.3 Problems with Display Advertising
8.2 Online Algorithm
8.2.1 Online and Offline Algorithms
8.2.2 Greedy Algorithm
8.2.3 Competition rate
8.2.4 Section 8.2 Practice Problems
8.3 Combination Problems
8.3.1 Combinations and Perfect Combinations
8.3.2 Greedy algorithm for finding the best combination
8.3.3 Competition rate of greedy combinations
8.3.4 Section 8.3 Practice Problems
8.4 AdWords Issues
8.4.1 History of Search Advertising
8.4.2 Defining AdWords Issues
8.4.3 A Greedy Approach to the AdWords Problem
8.4.4 Balance Algorithm
8.4.5 Lower bound on the competition rate of the balance algorithm
8.4.6 Balance Algorithm for Multiple Bidders
8.4.7 Generalization of the Balance Algorithm
8.4.8 Final Facts Regarding AdWords Issues
8.4.9 Section 8.4 Practice Problems
8.5 AdWords Implementation
8.5.1 Combining Bidding and Search Queries
8.5.2 More complex combination problems
8.5.3 Algorithm for combining documents and bid advertisements
8.6 Summary
8.7 References
Chapter 9.
Recommendation system
9.1 Recommender System Model
9.1.1 Multi-objective matrices
9.1.2 Long Tail
9.1.3 Applications of Recommender Systems
9.1.4 Generating multi-purpose matrices
9.2 Content-based recommendations
9.2.1 Item Profile
9.2.2 Feature Extraction from Documents
9.2.3 Item characteristics obtained from tags
9.2.4 Item Profile Representation
9.2.5 User Profiles
9.2.6 Content-based item recommendations
9.2.7 Classification Algorithm
9.2.8 Section 9.2 Practice Problems
9.3 Collaborative Filtering
9.3.1 Similarity Measurement
9.3.2 Duality of Similarity
9.3.3 User and Item Clustering
9.3.4 Section 9.3 Practice Problems
9.4 Dimensional reduction
9.4.1 UV Decomposition
9.4.2 Root Mean Square Error
9.4.3 Stepwise operation of UV decomposition
9.4.4 Random Element Optimization
9.4.5 Implementation of the Completed UV Decomposition Algorithm
9.4.6 Section 9.4 Practice Problems
9.5 Netflix Challenge
9.6 Summary
9.7 References
Chapter 10.
Social Network Graph Mining
10.1 Social Network Graph
10.1.1 What is a social network?
10.1.2 Social Networks as Graphs
10.1.3 Various social networks
10.1.4 Networks with different types of nodes
10.1.5 Section 10.1 Practice Problems
10.2 Social Network Graph Clustering
10.2.1 Distance Metrics in Social Network Graphs
10.2.2 Application of standard clustering methods
10.2.3 Relayability
10.2.4 Govern-Newman Algorithm
10.2.5 Finding Communities Using Relayability
10.2.6 Section 10.2 Practice Problems
10.3 Direct discovery of the community
10.3.1 Finding a group
10.3.2 Completely Bipartite Graph
10.3.3 Finding Completely Bipartite Subgraphs
10.3.4 Why Complete Bipartite Graphs Must Exist
10.3.5 Section 10.3 Practice Problems
10.4 Graph Partitioning
10.4.1 What is a good way to partition?
10.4.2 Normalizing the divider line
10.4.3 Matrices that describe graphs
10.4.4 Eigenvalues of the Laplace matrix
10.4.5 Another partitioning method
10.4.6 Section 10.4 Practice Problems
10.5 Finding Overlapping Communities
10.5.1 The Nature of Community
10.5.2 Maximum likelihood estimation
10.5.3 Affiliation-Graph Model
10.5.4 Discrete Optimization of Community Allocation
10.5.5 How to avoid discrete membership changes
10.5.6 Section 10.5 Practice Problems
10.6 Similarity Ranking
10.6.1 Random Walker in Social Graphs
10.6.2 Random walker with restart
10.6.3 Approximate similarity ranking
10.6.4 Why Approximate Similarity Ranking Works
10.6.5 Application of Similarity Ranking for Community Finding
10.6.6 Section 10.6 Practice Problems
10.7 Counting the number of triangles
10.7.1 Why Count Triangles?
10.7.2 Algorithm for finding triangles
10.7.3 Efficiency of the triangle finding algorithm
10.7.4 Finding Triangles Using MapReduce
10.7.5 Using Fewer Reduce Tasks
10.7.6 Section 10.7 Practice Problems
10.8 Neighborhood Features of Graphs
10.8.1 Directed Graphs and Neighbors
10.8.2 Diameter of the graph
10.8.3 Transitive closure and reachability
10.8.4 Reachability via MapReduce
10.8.5 Semi-naive evaluation
10.8.6 Linear Transitive Closure
10.8.7 Transitive closure by recursive doubling
10.8.8 Intelligent Transitive Closure
10.8.9 Method Comparison
10.8.10 Transitive closure by graph reduction
10.8.11 Estimating the size of neighbors
10.8.12 Section 10.8 Practice Problems
10.9 Summary
10.10 References
Chapter 11.
Dimension reduction
11.1 Eigenvalues and eigenvectors of symmetric matrices
11.1.1 Definition
11.1.2 Calculating Eigenvalues and Eigenvectors
11.1.3 Finding Eigenpairs Using the Repeated Square Method
11.1.4 Matrix of Eigenvectors
11.1.5 Section 11.1 Practice Problems
11.2 Principal Component Analysis
11.2.1 Examples to help explain
11.2.2 Using Eigenvectors for Dimensionality Reduction
11.2.3 Distance matrix
11.2.4 Section 11.2 Practice Problems
11.3 Singular value decomposition
11.3.1 Definition of SVD
11.3.2 Interpretation of SVD
11.3.3 Dimensionality Reduction Using SVD
11.3.4 Why Removing Small Outliers Works
11.3.5 Queries using concepts
11.3.6 Computing the SVD of a Matrix
11.3.7 Section 11.3 Practice Problems
11.4 CUR decomposition
11.4.1 Definition of CUR
11.4.2 Appropriate selection of rows and columns
11.4.3 Intermediate matrix configuration
11.4.4 CUR disassembly completed
11.4.5 Removing duplicate rows and columns
11.4.6 Section 11.4 Practice Problems
11.5 Summary
11.6 References
Chapter 12.
large-scale machine learning
12.1 Machine Learning Models
12.1.1 Training Set
12.1.2 Examples to help explain
12.1.3 Machine Learning Techniques
12.1.4 Machine Learning Structure
12.1.5 Section 12.1 Practice Problems
12.2 Perceptron
12.2.1 Training a Perceptron with a Threshold of 0
12.2.2 Convergence of the perceptron
12.2.3 Winnow Algorithm
12.2.4 Allowing changes to the threshold
12.2.5 Multiclass Perceptron
12.2.6 Training Set Transformation
12.2.7 Problems with the Perceptron
12.2.8 Parallel Implementation of Perceptron
12.2.9 Section 12.2 Practice Problems
12.3 Support Vector Machines
12.3.1 How SVM Works
12.3.2 Hyperplane regularization
12.3.3 Finding the best approximation separator
12.3.4 SVM solution by gradient descent
12.3.5 Stochastic Gradient Descent
12.3.6 Parallel Implementation of SVM
12.3.7 Section 12.3 Practice Problems
12.4 Nearest Neighbor Learning
12.4.1 Framework for Computing Nearest Neighbors
12.4.2 One Nearest Neighbor Learning
12.4.3 Learning One-Dimensional Functions
12.4.4 Kernel Regression Analysis
12.4.5 High-Dimensional Euclidean Data Processing
12.4.6 Non-Euclidean Distance Processing
12.4.7 Section 12.4 Practice Problems
12.5 Decision Tree
12.5.1 Using Decision Trees
12.5.2 Impurity Measurement
12.5.3 Design of Decision Tree Nodes
12.5.4 Test Selection Using Numeric Features
12.5.5 Test Selection Using Categorical Features
12.5.6 Parallel Design of Decision Trees
12.5.7 Node Pruning
12.5.8 Decision Forest
12.5.9 Section 12.5 Practice Problems
12.6 Comparison of Learning Methods
12.7 Summary
12.8 References
Chapter 13.
Neural networks and deep learning
13.1 Introduction to Neural Networks
13.1.1 Neural Networks
13.1.2 Interconnection between nodes
13.1.3 Convolutional Neural Networks
13.1.4 Neural Network Design Problems
13.1.5 Section 13.1 Practice Problems
13.2 High-density feed-forward networks
13.2.1 Linear Algebra Notation
13.2.2 Activation Function
13.2.3 Sigmoid
13.2.4 Hyperbolic Tangent
13.2.5 Softmax
13.2.6 Rectifier linear unit
13.2.7 Loss Function
13.2.8 Regression Loss
13.2.9 Classification Loss
13.2.10 Section 13.2 Practice Problems
13.3 Backpropagation and Gradient Descent
13.3.1 Computational Graph
13.3.2 Slope, Jacobian, and Chain Rule
13.3.3 Backpropagation Algorithm
13.3.4 Repeating gradient descent
13.3.5 Tensors
13.3.6 Section 13.3 Practice Problems
13.4 Convolutional Neural Networks
13.4.1 Convolutional Layer
13.4.2 Convolution and Cross-Correlation
13.4.3 Pooling Layer
13.4.4 CNN Architecture
13.4.5 Implementation and Learning
13.4.6 Section 13.4 Practice Problems
13.5 Recurrent Neural Networks
13.5.1 Training the RNN
13.5.2 Slope loss and explosion
13.5.3 Long and short-term memory
13.5.4 Section 13.5 Practice Problems
13.6 Regularization
13.6.1 norm penalty
13.6.2 Dropout
13.6.3 Early Termination
13.6.4 Dataset Augmentation
13.7 Summary
13.8 References
Publisher's Review
What this book covers
- Distributed file system and MapReduce, a tool for creating parallel algorithms that can process large amounts of data.
- Core technologies of locality-based hashing algorithms and similarity search
- Data stream processing and algorithms specialized for handling data that is input very quickly and would otherwise be lost if not processed immediately.
- Search engine technologies including Google's PageRank, link spam detection, and hub and authority techniques
- Association rules, market basket models, a priori algorithms and their improvements, and frequent itemset mining
- An algorithm for clustering large, high-dimensional data sets.
- Two problems related to web applications: advertising and recommendation systems.
- Algorithms for analyzing and mining very large structures, such as social network graphs.
- Techniques for extracting important attributes from large-scale data through singular value decomposition, latent semantic indexing, and dimensionality reduction.
- Machine learning algorithms applicable to large-scale data, such as perceptron, support vector machine, and gradient descent.
Neural networks and deep learning, including special cases such as convolutional neural networks, recurrent neural networks, and long short-term memory networks.
Target audience for this book
Written by leading scholars in database and web technologies, this book is a must-read for students and practitioners alike.
This book is suitable for readers who have mastered the following process.
- Introduction to database systems, covering SQL and related programming systems.
- Data structures, algorithms, and discrete mathematics at the college sophomore level
- University sophomore level software systems, software engineering, and programming
language
Author's Note
This book began as a series of lectures given over several years at Stanford University by professors Anand Rajaraman and Jeff Ullman.
Although the CS345A course, titled "Web Mining," was offered as an advanced graduate course, it also attracted and attracted excellent undergraduate students.
After Professor Jure Leskovec took office at Stanford, the content was significantly revised.
He created a new network analysis course, CS224W, and supplemented the material in CS345A, which was renamed CS246.
Additionally, the three professors opened a course on large-scale data mining projects, CS341.
This book is based on the materials from the three lectures above.
Translator's Note
Nowadays, even discussing the tedium of the term big data feels as tedious as the term itself.
However, this book explains how to apply data mining techniques to big data in a realistic, non-pretentious approach.
It provides helpful solutions by dividing each technique into cases where it can be stored in memory and cases where it cannot be stored in memory.
Gradually, data science is becoming essential common sense, rather than optional knowledge, for statisticians and engineers in related industries.
Perhaps we have opened this book to study the common sense of the future.
This book covers statistics, data mining, and computer science simultaneously, yet it presents these three fields in detail and in a harmonious manner.
Thanks to this, it has the advantage of being in-depth enough to be helpful in practical work, even though it is a college textbook (http://www.mmds.org/).
At the same time, it has the disadvantage of being a difficult book for both statisticians and engineers.
So, let me first share some tips that will help you study this book.
1.
The original text can be downloaded for free from the URL below.
If there is a part that you do not understand well with just the translation, please find that part in the original text and read it calmly three times.
http://infolab.stanford.edu/~ullman/mmds/book0n.pdf
2.
Since it is a college textbook, the development method is deductive and rigid, so it may be difficult to understand.
Even if you don't understand the first part of each section, please read it quickly and then look at the examples.
After reading the examples, it will be easier to understand if you look at the theory in the beginning again.
Statisticians and engineers approach the field of data science from different perspectives.
While statisticians are more interested in confidence intervals and uncertainty measures, programmers are more interested in quick implementation and results through machine learning.
To sum it up, Josh Wills (https://twitter.com/josh_wills/) said:
"A data scientist is a software engineer who understands statistics better than most, and a statistician who understands software engineering better than most."
But this alone is not enough to discuss the qualities of an analyst.
When analyzing data in the field, you realize that knowledge of the data domain and analysis know-how are most important.
And sometimes, to understand the domain well, there comes a point where a humanities background is needed.
The quality and quantity of the data itself are more important than theories or techniques, and creating value from data ultimately depends on the analyst's skills.
-Park Hyo-gyun
As the "big" in big data has become increasingly massive, the demands for processing and analyzing it have grown, and no single technology can address this, making interdisciplinary integration essential.
Attempts at interdisciplinary integration have been around for a long time, but never before has such an effort been so effective.
Considering that the disciplines of statistics, computer science, and data mining are based on mathematics, the current phenomenon of solving difficult problems through their integration may be an inevitable result.
As the technology that made this possible, we cannot help but mention Hadoop.
No one can deny the importance of Hadoop, which is a core technology for big data processing and still exerts influence today.
This is why this book explains data processing methods based on MapReduce.
If you don't have an academic foundation in statistics, computer science, and data mining techniques, you'll often find yourself stumped by unfamiliar terms that suddenly appear.
In such cases, it is good to look up the relevant term, understand the content, and then move on, or it is good to first understand the overall context and then organize the detailed terms.
In any case, I encourage all readers who have opened this book to learn about big data mining, and I hope that through this, they will grow to become students, engineers, and practitioners.
As a translator, I had a hard time choosing the terminology.
Even if statistics is the case, most of the terms used in computer science and data mining often lose their meaning or become more difficult to understand when translated into Korean.
Therefore, Korean is given priority, but if the term is more commonly used in English in practice, it is transliterated rather than translated into Korean.
A representative example is the translation of 'clustering' as 'clustering'.
In practice, no one calls 'clustering' 'clustering'.
I would like to express my deepest gratitude to my co-translator, Park Hyo-gyun, who has been a long-time friend and colleague in the same industry, for his unwavering support and advice.
-Lee Mi-jeong
- Distributed file system and MapReduce, a tool for creating parallel algorithms that can process large amounts of data.
- Core technologies of locality-based hashing algorithms and similarity search
- Data stream processing and algorithms specialized for handling data that is input very quickly and would otherwise be lost if not processed immediately.
- Search engine technologies including Google's PageRank, link spam detection, and hub and authority techniques
- Association rules, market basket models, a priori algorithms and their improvements, and frequent itemset mining
- An algorithm for clustering large, high-dimensional data sets.
- Two problems related to web applications: advertising and recommendation systems.
- Algorithms for analyzing and mining very large structures, such as social network graphs.
- Techniques for extracting important attributes from large-scale data through singular value decomposition, latent semantic indexing, and dimensionality reduction.
- Machine learning algorithms applicable to large-scale data, such as perceptron, support vector machine, and gradient descent.
Neural networks and deep learning, including special cases such as convolutional neural networks, recurrent neural networks, and long short-term memory networks.
Target audience for this book
Written by leading scholars in database and web technologies, this book is a must-read for students and practitioners alike.
This book is suitable for readers who have mastered the following process.
- Introduction to database systems, covering SQL and related programming systems.
- Data structures, algorithms, and discrete mathematics at the college sophomore level
- University sophomore level software systems, software engineering, and programming
language
Author's Note
This book began as a series of lectures given over several years at Stanford University by professors Anand Rajaraman and Jeff Ullman.
Although the CS345A course, titled "Web Mining," was offered as an advanced graduate course, it also attracted and attracted excellent undergraduate students.
After Professor Jure Leskovec took office at Stanford, the content was significantly revised.
He created a new network analysis course, CS224W, and supplemented the material in CS345A, which was renamed CS246.
Additionally, the three professors opened a course on large-scale data mining projects, CS341.
This book is based on the materials from the three lectures above.
Translator's Note
Nowadays, even discussing the tedium of the term big data feels as tedious as the term itself.
However, this book explains how to apply data mining techniques to big data in a realistic, non-pretentious approach.
It provides helpful solutions by dividing each technique into cases where it can be stored in memory and cases where it cannot be stored in memory.
Gradually, data science is becoming essential common sense, rather than optional knowledge, for statisticians and engineers in related industries.
Perhaps we have opened this book to study the common sense of the future.
This book covers statistics, data mining, and computer science simultaneously, yet it presents these three fields in detail and in a harmonious manner.
Thanks to this, it has the advantage of being in-depth enough to be helpful in practical work, even though it is a college textbook (http://www.mmds.org/).
At the same time, it has the disadvantage of being a difficult book for both statisticians and engineers.
So, let me first share some tips that will help you study this book.
1.
The original text can be downloaded for free from the URL below.
If there is a part that you do not understand well with just the translation, please find that part in the original text and read it calmly three times.
http://infolab.stanford.edu/~ullman/mmds/book0n.pdf
2.
Since it is a college textbook, the development method is deductive and rigid, so it may be difficult to understand.
Even if you don't understand the first part of each section, please read it quickly and then look at the examples.
After reading the examples, it will be easier to understand if you look at the theory in the beginning again.
Statisticians and engineers approach the field of data science from different perspectives.
While statisticians are more interested in confidence intervals and uncertainty measures, programmers are more interested in quick implementation and results through machine learning.
To sum it up, Josh Wills (https://twitter.com/josh_wills/) said:
"A data scientist is a software engineer who understands statistics better than most, and a statistician who understands software engineering better than most."
But this alone is not enough to discuss the qualities of an analyst.
When analyzing data in the field, you realize that knowledge of the data domain and analysis know-how are most important.
And sometimes, to understand the domain well, there comes a point where a humanities background is needed.
The quality and quantity of the data itself are more important than theories or techniques, and creating value from data ultimately depends on the analyst's skills.
-Park Hyo-gyun
As the "big" in big data has become increasingly massive, the demands for processing and analyzing it have grown, and no single technology can address this, making interdisciplinary integration essential.
Attempts at interdisciplinary integration have been around for a long time, but never before has such an effort been so effective.
Considering that the disciplines of statistics, computer science, and data mining are based on mathematics, the current phenomenon of solving difficult problems through their integration may be an inevitable result.
As the technology that made this possible, we cannot help but mention Hadoop.
No one can deny the importance of Hadoop, which is a core technology for big data processing and still exerts influence today.
This is why this book explains data processing methods based on MapReduce.
If you don't have an academic foundation in statistics, computer science, and data mining techniques, you'll often find yourself stumped by unfamiliar terms that suddenly appear.
In such cases, it is good to look up the relevant term, understand the content, and then move on, or it is good to first understand the overall context and then organize the detailed terms.
In any case, I encourage all readers who have opened this book to learn about big data mining, and I hope that through this, they will grow to become students, engineers, and practitioners.
As a translator, I had a hard time choosing the terminology.
Even if statistics is the case, most of the terms used in computer science and data mining often lose their meaning or become more difficult to understand when translated into Korean.
Therefore, Korean is given priority, but if the term is more commonly used in English in practice, it is transliterated rather than translated into Korean.
A representative example is the translation of 'clustering' as 'clustering'.
In practice, no one calls 'clustering' 'clustering'.
I would like to express my deepest gratitude to my co-translator, Park Hyo-gyun, who has been a long-time friend and colleague in the same industry, for his unwavering support and advice.
-Lee Mi-jeong
GOODS SPECIFICS
- Publication date: April 29, 2021
- Format: Hardcover book binding method guide
- Page count, weight, size: 786 pages | 180*254*37mm
- ISBN13: 9791161755137
- ISBN10: 1161755136
You may also like
카테고리
korean
korean