Skip to product information
Machine Learning with Statistics 2/e
Machine Learning with Statistics 2/e
Description
Book Introduction
Explains model learning through machine learning within a statistical framework.
You can explore various statistical theories and learn their significance in machine learning algorithms ranging from regression to neural networks.
This book is recommended for those who want to gain deeper insights from data by going beyond simply applying machine learning models and understanding the theoretical background of the models.
  • You can preview some of the book's contents.
    Preview
","
index
Chapter 1.
introduction


Chapter 2.
Overview of Supervised Learning
__2.1 Introduction
__2.2 Variable types and terminology
__2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbor
____2.3.1 Linear Models and Least Squares42
____2.3.2 Nearest neighbor method
____2.3.3 From least squares to nearest neighbors
__2.4 Statistical Decision Theory
__2.5 Local methods in high dimensions
__2.6 Statistical Models, Supervised Learning, and Function Approximation60
____2.6.1 Statistical model for the joint distribution Pr(X, Y)
____2.6.2 Supervised Learning
____2.6.3 Function Approximation
__2.7 Structured Regression Model
____2.7.1 Difficulty of the problem
__2.8 Types of restricted estimators
____2.8.1 Illumination Penalty and Bayes Method
____2.8.2 Kernel method and local regression
____2.8.3 Basis Functions and Dictionary Methods
__2.9 Model Selection and Bias - Variance Tradeoff
__References
__Practice Problems


Chapter 3.
Linear methods for regression
__3.1 Introduction
__3.2 Linear Regression Model and Least Squares
____3.2.1 Example: Prostate Cancer
____3.2.2 Gauss-Markov theorem
____3.2.3 Multiple regression from simple univariate regression
____3.2.4 Multiple Outputs
__3.3 Subset selection
____3.3.1 Selecting the Best Subset
____3.3.2 Selection by forward and backward steps
____3.3.3 Forward - Stage-by-Stage Regression
____3.3.4 Prostate Cancer Data Example (Continued)
__3.4 Contraction method
____3.4.1 Ridge regression
____3.4.2 Lasso
____3.4.3 Discussion: Subset Selection, Ridge Regression, and Lasso
____3.4.4 Least Angle Regression
__3.5 Methods using derived input directions
____3.5.1 Principal Component Regression
____3.5.2 Partial least squares
__3.6 Discussion: Comparing the Selection and Contraction Methods
__3.7 Multiple Result Shrinkage and Selection
__3.8 Additional Notes on Lasso and Related Path Algorithms
____3.8.1 Incremental forward stage-by-stage regression
____3.8.2 Piecewise - Linear Path Algorithm
____3.8.3 Danzig selector
____3.8.4 Grouped Lasso
____3.8.5 Additional Properties of Lasso
____3.8.6 Path-specific coordinate optimization
__3.9 Operational Considerations
__References
__Practice Problems


Chapter 4.
Linear methods for classification
__4.1 Introduction
__4.2 Linear regression of the indicator matrix
__4.3 Linear Discriminant Analysis
____4.3.1 Regular discriminant analysis
____4.3.2 Operations for LDA
____4.3.3 Reduced-rank linear discriminant analysis
__4.4 Logistic Regression
____4.4.1 Fitting a Logistic Regression Model
____4.4.2 Example: Heart disease in South Africans
____4.4.3 Quadratic approximation and inference
____4.4.4 L1 regularized logistic regression
____4.4.5 Logistic Regression or LDA?
__4.5 Separating hyperplane
____4.5.1 Rosenblatt's perceptron learning algorithm
____4.5.2 Optimal separating hyperplane
__References
__Practice Problems


Chapter 5.
Basis expansion and regularization
__5.1 Introduction
__5.2 Piecewise Polynomials and Splines
____5.2.1 Natural cubic splines
____5.2.2 Example: South African Heart Disease (continued)
____5.2.3 Example: Phoneme Recognition
__5.3 Filtering and Feature Extraction
__5.4 Smoothing Splines
____5.4.1 Degrees of freedom and smooth matrices
__5.5 Automatic selection of smoothing parameters
____5.5.1 Fixing degrees of freedom
____5.5.2 Bias-Variance Tradeoff
__5.6 Nonparametric logistic regression
__5.7 Multidimensional Splines
__5.8 Regularized and Regenerated Kernel Hilbert Spaces
____5.8.1 Space of functions generated by the kernel
____5.8.2 RKHS Example
__5.9 Wavelet smoothing
____5.9.1 Wavelet basis and wavelet transform
____5.9.2 Adaptive Wavelet Filtering
__References
__Practice Problems
__Appendix: Spline Operations
____B - Spline
____Operation of smoothing splines


Chapter 6.
Kernel smoothing
__6.1 One-dimensional kernel smoother
____6.1.1 Local linear regression
____6.1.2 Local polynomial regression
__6.2 Choosing the width of the kernel
__6.3 Local regression in Rp
__6.4 Structural local regression model in Rp
____6.4.1 Structured Kernel
____6.4.2 Structured Regression Function
__6.5 Local likelihood and other models
__6.6 Kernel Density Estimation and Classification
____6.6.1 Kernel Density Estimation
____6.6.2 Kernel Density Classification
____6.6.3 Simple Bayes Classifier
__6.7 Radial Basis Functions and Kernels
__6.8 Mixture Models for Density Estimation and Classification
__6.9 Operational Considerations
__References
__Practice Problems


Chapter 7.
Model Evaluation and Selection
__7.1 Introduction
__7.2 Bias, Variance, and Model Complexity
__7.3 Bias-Variance Decomposition
____7.3.1 Example: Bias-Variance Tradeoff
__7.4 Optimism about training error rate
__7.5 Estimates of within-sample prediction errors
__7.6 Valid number of parameters
__7.7 Bayesian Approach and BIC
__7.8 Minimum description length
__7.9 Bobnik-Chevnenkos dimension
____7.9.1 Example (continued)
__7.10 Cross-validation
____7.10.1 K-fold cross-validation
____7.10.2 Wrong and Right Ways to Do Cross-Validation
____7.10.3 Does Cross-Validation Really Work?
__7.11 Bootstrap method
____7.11.1 Example (continued)
__7.12 Conditional or Expected Test Error
__References
__Practice Problems


Chapter 8.
Model inference and averaging
__8.1 Introduction
__8.2 Bootstrap and maximum likelihood methods
____8.2.1 Smoothing Example
____8.2.2 Maximum likelihood estimation
____8.2.3 Bootstrap vs. Maximum Likelihood
__8.3 Bayesian method
__8.4 Relationship between Bootstrap and Bayesian Estimation
__8.5 EM algorithm
____8.5.1 2 - Component Mixture Model
____8.5.2 General EM Algorithm
____8.5.3 Maximization - EM as a Maximization Process
__8.6 MCMC for Sampling from the Posterior Distribution
__8.7 Bagging
____8.7.1 Example: Tree with simulated data
__8.8 Model Averaging and Stacking
__8.9 Probabilistic Search: Bumping
__References
__Practice Problems


Chapter 9.
Additive models, trees, and related methods
__9.1 Generalized Additive Model
____9.1.1 Fitting the additive model
____9.1.2 Example: Additive Logistic Regression
____9.1.3 Summary
__9.2 Tree-based method
____9.2.1 Background
____9.2.2 Regression Trees
____9.2.3 Classification Tree
____9.2.4 Other issues
____9.2.5 Spam Example (continued)
__9.3 PRIM: Bump Hunting
____9.3.1 Spam Example (continued)
__9.4 MARS: Multivariate Adaptive Regression Splines
____9.4.1 Spam Data (continued)
____9.4.2 Example (simulated data)
____9.4.3 Other issues
__9.5 Expert Class Mix
__9.6 Missing data
__9.7 Operational Considerations
__References
__Practice Problems


Chapter 10.
Boosting and additive trees
__10.1 Boosting Method
____10.1.1 Overview
__10.2 Boosting Fit and Additive Models
__10.3 Additive modeling by forward stage
__10.4 Exponential Loss and Ada Boost
__10.5 Why exponential loss?
__10.6 Loss Function and Robustness
__10.7 “Off-the-Shelf” Courses for Data Mining
__10.8 Example: Spam Data
__10.9 Boosting Tree
__10.10 Numerical Optimization via Gradient Boosting
____10.10.1 Steepest Descent
____10.10.2 Gradient Boosting
____10.10.3 Implementation of Gradient Boosting
__10.11 Appropriately sized tree for boosting
__10.12 Regularization
____10.12.1 Contraction
____10.12.2 Subsampling
__10.13 Interpretation
____10.13.1 Relative Importance of Predictors
____10.13.2 Partial Dependency Diagram
__10.14 Illustration
____10.14.1 California Housing
____10.14.2 New Zealand Fish
____10.14.3 Demographic Data
__References
__Practice Problems


Chapter 11.
neural network
__11.1 Introduction
__11.2 Projective Traceback
__11.3 Neural Networks
__11.4 Fitting a Neural Network
__11.5 Problems when training neural networks
____11.5.1 Starting Value
____11.5.2 Overfitting
____11.5.3 Scaling of input variables
____11.5.4.
Number of hidden units and layers
____11.5.5 Multiple minimum values
__11.6 Example: Simulation Data
__11.7 Example: Zip Code Data
__11.8 Discussion
__11.9 Bayesian Neural Networks and the NIPS 2003 Challenge
____11.9.1 Bayes, Boosting, and Bagging
____11.9.2 Performance Comparison
__11.10 Operational Considerations
__References
__Practice Problems


Chapter 12.
Support vector machines and flexible discriminants
__12.1 Introduction
__12.2 Support Vector Classifier
____12.2.1 Computing the Support Vector Classifier
____12.2.2 Mixed Example (continued)
__12.3 Support Vector Machines and Kernels
____12.3.1 SVM Operations for Classification
____12.3.2 SVM as a Penalization Method
____12.3.3 Function Estimation and Replay Kernel
____12.3.4 SVMs and the Curse of Dimensionality
____12.3.5 Path Algorithm for SVM Classifier
____12.3.6 Support Vector Machines for Regression
____12.3.7 Regression and Kernels
____12.3.8 Discussion
__12.4 Generalized Linear Discriminant Analysis
__12.5 Flexible Discriminant Analysis
____12.5.1 Calculating FDA Estimates
__12.6 Penalty Discriminant Analysis
__12.7 Mixed Discriminant Analysis
____12.7.1 Example: Waveform Data
__12.8 Operational Considerations
__References
__Practice Problems


Chapter 13.
Prototype method and nearest neighbor method
__13.1 Overview
__13.2 Prototype method
____13.2.1 K-means clustering
____13.2.2 Learning Vector Quantization
____13.2.3 Gaussian Mixture
__13.3 K-Nearest Neighbor Classifier
____13.3.1 Example: Comparative Study
____13.3.2 Example: K-Nearest Neighbors and Image Scene Classification
____13.3.3 Invariant metrics and tangent distance
__13.4 Adaptive Nearest Neighbor Method
____13.4.1 Example
____13.4.2 Global dimensionality reduction for nearest neighbors
__13.5 Operational Considerations
__References
__Practice Problems


Chapter 14.
unsupervised learning
__14.1 Overview
__14.2 Association Rules
____14.2.1 Market Basket Analysis
____14.2.2 Apriori Algorithm
____14.2.3 Example: Market Basket Analysis
____14.2.4 Unsupervised learning like supervised learning
____14.2.5 Generalized Association Rules
____14.2.6 Choosing a Supervised Learning Method
____14.2.7 Example: Market Basket Analysis (continued)
__14.3 Cluster analysis
____14.3.1 Proximity matrix
____14.3.2 Dissimilarity based on attributes
____14.3.3 Individual dissimilarity
____14.3.4 Clustering Algorithms
____14.3.5 Combinatorial Algorithms
____14.3.6 K - Average
____14.3.7 K - Gaussian Mixture as Mean Clustering
____14.3.8 Example: Human Tumor Microarray Data
____14.3.9 Vector Quantization
____14.3.10 K-Median
____14.3.11 Practical Issues
____14.3.12 Hierarchical clustering
__14.4 Self-organizing map
__14.5 Principal Components, Principal Curves, and Principal Surfaces
____14.5.1 Main ingredients
____14.5.2 Principal curves and principal surfaces
____14.5.3 Spectral Clustering
____14.5.4 Kernel Components
____14.5.5 Lean main component
__14.6 Nonnegative matrix decomposition
____14.6.1 Circular Analysis
__14.7 Independent component analysis and exploratory projective tracing
____14.7.1 Latent Variables and Factor Analysis
____14.7.2 Independent component analysis
____14.7.3 Exploratory Projective Tracing
____14.7.4 Direct Approach to ICA
__14.8 Multidimensional Scaling
__14.9 Nonlinear Dimensionality Reduction and Local Multidimensional Scaling
__14.10 Google PageRank Algorithm
__References
__Practice Problems


Chapter 15.
Random Forest
__15.1 Overview
__15.2 Definition of Random Forest
__15.3 Random Forest Details
____15.3.1 Out-of-bag sample
____15.3.2 Variable Importance
____15.3.3 Proximity Chart
____15.3.4 Random Forests and Overfitting
__15.4 Analysis of Random Forests
____15.4.1 Variance and Correlation Effects
____15.4.2 Bias
____15.4.3 Adaptive Nearest Neighbor
__References
__Practice Problems


Chapter 16.
ensemble learning
__16.1 Overview
__16.2 Boosting and regularization paths
____16.2.1 Penalized Regression
____16.2.2 “Scarcity Betting” Principle
____16.2.3 Regularization paths, overfitting, and margins
__16.3 Learning Ensemble
____16.3.1 Learning Good Ensembles
____16.3.2 Rule Ensemble
__References
__Practice Problems


Chapter 17.
undirected graph model
__17.1 Overview
__17.2 Markov graphs and their properties
__17.3 Undirected Graphical Models for Continuous Variables
____17.3.1 Estimation of parameters when the graph structure is known
____17.3.2 Estimating Graph Structure
__17.4 Undirected Graph Models for Discrete Variables
____17.4.1 Estimation of parameters when the graph structure is known
____17.4.2 Hidden Nodes
____17.4.3 Estimating Graph Structure
____17.4.4 Constrained Boltzmann Machine
__References
__Practice Problems


Chapter 18.
High-dimensional problem: p≪N
__18.1 When p is much larger than N
__18.2 Diagonal linear discriminant analysis and nearest shrinkage center
__18.3 Quadratic regularized linear classifier
____18.3.1 Regular discriminant analysis
____18.3.2 Logistic regression with quadratic regularization
____18.3.3 Support Vector Classifier
____18.3.4 Feature Selection
____18.3.5 Operational shortcut when p ≫ N
__18.4 L1 regularized linear classifier
____18.4.1 Application of Lasso to Protein Mass Spectrometry
____18.4.2 Fused Lasso for Functional Data
__18.5 Classification when characteristics cannot be used
____18.5.1 Example: String Kernel and Protein Classification
____18.5.2 Classification and other methods using inner product kernel and pairwise distances
____18.5.3 Example: Abstract Classification
__18.6 High-Dimensional Regression: Supervised Principal Components
____18.6.1 Connection with Latent Variable Modeling
____18.6.2 Relationship with Partial Least Squares
____18.6.3 Preconditioning for feature selection
__18.7 Feature Evaluation and Multiple Testing Problems
____18.7.1 False discovery rate
____18.7.2 Asymmetric Cut Points and the SAM Process
____18.7.3 Bayesian interpretation of FDR
__18.8 References
__Practice Problems
","
Detailed image
Detailed Image 1
","
Publisher's Review
★ Target audience for this book ★

It was written for researchers and students in various fields such as statistics, artificial intelligence, engineering, and finance.
We expect readers of this book to have taken at least one introductory statistics course covering fundamental topics, including linear regression.

Rather than writing a comprehensive guide to learning methods, I wanted to explain a few of the most important techniques.
Additionally, we explain subconcepts and considerations to help researchers evaluate learning methods.
It is written in an intuitive manner, emphasizing concepts rather than mathematical details.

We will naturally reflect our background and expertise as statisticians.
However, over the past eight years, I have been attending conferences on neural networks, data mining, and machine learning, and have been greatly influenced by these exciting fields.



★ Structure of this book ★

Before trying to fully understand a complex method, you must first understand the simple method.
Therefore, we provide an overview of supervised learning problems in Chapter 2, and then discuss linear methods for regression and classification in Chapters 3 and 4.
Chapter 5 explains splines, wavelets, and regularization/penalization methods for single predictors, and Chapter 6 covers kernel methods and local regression.
All of these methods form an important foundation for higher-dimensional learning techniques.
Model evaluation and selection are the topics of Chapter 7, covering concepts such as bias and variance, overfitting, and cross-validation for model selection.
Chapter 8 discusses model inference and averaging, including an overview of maximum likelihood, Bayesian inference and bootstrapping, the EM algorithm, Gibbs sampling, and bagging.
A process called boosting is covered in detail in Chapter 10.

Chapters 9 through 13 describe a series of structural methods for supervised learning.
Chapters 9 and 11 in particular deal with regression, while Chapters 12 and 13 focus on classification.
Chapter 14 describes methods for unsupervised learning.
Recently known techniques such as random forests and ensemble learning are discussed in Chapters 15 and 16.
Undirected graph models are discussed in Chapter 17, and finally, high-dimensional problems are studied in Chapter 18.

At the end of each chapter, we discuss computational considerations important to data mining applications, including how operations scale with the number of observations and predictors.
Each chapter concludes with a bibliography providing background references for the material.

First, I recommend reading chapters 1 through 4 in order.
Chapter 7 is also mandatory reading as it covers key concepts relevant to all learning methods.
The rest of the book can be read in order or selectively according to the reader's interest.


★ Author's Note ★

The field of statistics is constantly challenged by problems from both science and industry.
Initially, these problems arose from agricultural and industrial experiments and were relatively narrow in scope.
With the advent of the computer and information age, statistical problems have exploded in both size and complexity.
The challenges of data storage, organization, and retrieval have led to a new field called 'data mining'.
Statistical and computational problems in biology and pharmacology gave rise to 'bioinformatics'.
Massive amounts of data are being generated in many fields, and it is the job of statisticians to make sense of it all.
Extracting important patterns and trends and understanding what the data is saying is called learning from data.

The challenge of learning from data has led to a revolution in statistical science.
Given the central role that computation plays, it's not surprising that researchers in other fields, such as computer science and engineering, have made much of these new advances.

The learning problems we consider can be roughly categorized into supervised learning and unsupervised learning.
The goal in supervised learning is to predict the value of an output measurement based on a number of input measurements.
In unsupervised learning, there is no output measurement, and the goal is to reveal associations and patterns between a set of input measurements.

The intention of this book is to bring together many important new ideas in learning and explain them within a statistical framework.
Although some mathematical details are required, we want to emphasize the method and conceptual foundation rather than their theoretical properties.
Accordingly, we hope that this book will attract the interest of not only statisticians but also researchers and practitioners in various fields.

As we have learned much from researchers outside of statistics, a statistical perspective may help readers better understand other aspects of learning.


★ Translator's Note ★

This book is a translation of 『Elements of Statistical Learning, Second Edition』 published by Springer.
The three co-authors of the original book are all professors of statistics at Stanford University, renowned for their outstanding academic achievements, and the book has been cited in numerous academic papers.


If you're interested enough in machine learning to have even opened the introduction to this book, you've probably seen a funny meme about statistics being packaged and called machine learning.
I'm sure I'm not alone in thinking these memes aren't just jokes.
I think the fundamental reason I took on the task of translating this book is that statistics are indispensable for understanding machine learning better.


These days, there is a strong tendency to unconditionally solve problems with deep learning.
But as the authors note in Chapter 1, it's important for me to understand the simple methods before attempting the more complex ones.
Of course, applying machine learning models to data isn't difficult even if you lack statistical and mathematical knowledge.
But this book goes even further, providing a broad understanding of the underlying concepts behind models, empowering you to develop practical skills to solve given problems and gain deeper insights from your data.
I believe that this book will be a great help in studying various topics in the future, including statistical theory, regression and classification, kernels and bases, regularization, and additive models.


The authors expect readers to have at least a basic understanding of statistics, but that is unlikely to be sufficient to fully understand this book.
I recommend that you study calculus, linear algebra, probability theory, statistics, and other areas that you feel are lacking while reading the book.
The errata for the original book (https://web.stanford.edu/~hastie/ElemStatLearn/) are based on the website's 'Errata for the 2nd Edition, after the 12th printing (January 2017) and not yet reflected in the online version.'
The terminology was standardized based on the glossaries of the Korean Statistical Society (http://www.kss.or.kr/) and the Korean Mathematical Society (http://www.kms.or.kr/main.html), and for other terms, we tried to use the most frequently used terms found through Internet searches.
"]
GOODS SPECIFICS
- Publication date: November 30, 2020
- Page count, weight, size: 844 pages | 1,185g | 155*235*40mm
- ISBN13: 9791161754727
- ISBN10: 1161754725

You may also like

카테고리