Skip to product information
Korean embedding
Korean embedding
Description
Book Introduction
The Key to Improving the Performance of Natural Language Processing Models: Korean Embedding

Embedding is a term that refers to the result of converting natural language into a vector, which is a list of numbers, or the entire process of doing so.
The name embedding comes from the idea of ​​converting each word or sentence into a vector and 'embedding' it into a vector space.
To enable computers to process natural language, natural language must be converted into a computable form called an embedding.


Embeddings play a very important role as the first gateway for computers to understand natural language.
It is no exaggeration to say that the performance of a natural language processing model is determined by embedding.
This book provides a comprehensive overview of various embedding techniques and introduces the entire process, from Korean data preprocessing to embedding construction, in a tutorial format.
It covers everything from word-level techniques such as Word2Vec to sentence-level embeddings such as ELMo and BERT.
  • You can preview some of the book's contents.
    Preview

index
Chapter 1.
introduction
1.1 What is embedding?
1.2 The Role of Embedding
1.2.1 Calculating word/sentence relevance
1.2.2 Implication of semantic/grammatical information
1.2.3 Transfer Learning
1.3 History and types of embedding techniques
1.3.1 From statistical to neural network based
1.3.2 From word level to sentence level
1.3.3 Rule → End-to-end → Pretrain/Fine Tuning
1.3.4 Types and Performance of Embeddings
1.4 Development Environment
1.4.1 Introduction to the Environment
1.4.2 AWS Configuration
1.4.3 Code Execution
1.4.4 Bug Reports and Q&A
1.4.5 Open sources that this book is using
1.5 Data and Key Terms Covered in This Book
1.6 Summary of this chapter
1.7 References

Chapter 2.
How Vectors Gain Meaning

2.1 Natural language computation and understanding
2.2 Which words are used the most?
2.2.1 Back of Wars Assumptions
2.2.2 TF-IDF
2.2.3 Deep Averaging Network
2.3 In what order are the words written?
2.3.1 Statistical-based language models
2.3.2 Neural Network-Based Language Model
2.4 Which words are used together?
2.4.1 Distribution Assumptions
2.4.2 Distribution and Meaning (1): Morphemes
2.4.3 Distribution and Meaning (2): Parts of Speech
2.4.4 Point-wise mutual information
2.4.5 Word2Vec
2.5 Summary of this chapter
2.6 References

Chapter 3.
Korean preprocessing

3.1 Data Acquisition
3.1.1 Korean Wikipedia
3.1.2 KorQuAD
3.1.3 Naver Movie Review Corpus
3.1.4 Downloading preprocessed data
3.2 Morphological analysis based on supervised learning
3.2.1 How to use KoNLPy
3.2.2 Analysis of performance differences by analyzer within KoNLPy
3.2.3 How to use Khaiii
3.2.4 Adding a User Dictionary to Eunjeonhannyeon
3.3 Morphological analysis based on unsupervised learning
3.3.1 soynlp morphological analyzer
3.3.2 Google Sentence Piece
3.3.3 Spacing Correction
3.3.4 Downloading completed morphological analysis data
3.4 Summary of this chapter
3.5 References

Chapter 4.
Word-level embeddings

4.1 NPLM
4.1.1 Model Basic Structure
4.1.2 Learning of NPLM
4.1.3 NPLM and Semantic Information
4.2 Word2Vec
4.2.1 Model Basic Structure
4.2.2 Building training data
4.2.3 Model Training
4.2.4 Tutorial
4.3 FastText
4.3.1 Model Basic Structure
4.3.2 Tutorial
4.3.3 Korean Characters and FastText
4.4 Latent Semantic Analysis
4.4.1 PPMI matrix
4.4.2 Understanding Latent Semantics through Matrix Decomposition
4.4.3 Understanding Word2Vec through Matrix Decomposition
4.4.4 Tutorial
4.5 GloVe
4.5.1 Model Basic Structure
4.5.2 Tutorial
4.6 Swivel
4.6.1 Model Basic Structure
4.6.2 Tutorial
4.7 Which word embeddings to use
4.7.1 Downloading word embeddings
4.7.2 Word similarity evaluation
4.7.3 Word analogy evaluation
4.7.4 Visualizing Word Embeddings
4.8 Weighted Embedding
4.8.1 Model Overview
4.8.2 Model Implementation
4.8.3 Tutorial
4.9 Summary of this chapter
4.10 References

Chapter 5.
sentence-level embeddings

5.1 Latent Semantic Analysis
5.2 Doc2Vec
5.2.1 Model Overview
5.2.2 Tutorial
5.3 Latent Dirichlet Allocation
5.3.1 Model Overview
5.3.2 Architecture
5.3.3 LDA and Gibbs Sampling
5.3.4 Tutorial
5.4 ELMo
5.4.1 Character-level convolutional layer
5.4.2 Bidirectional LSTM, Score Layer
5.4.3 ELMo Layer
5.4.4 Free Train Tutorial
5.5 Transformer Network
5.5.1 Scaled Dot-Product Attention
5.5.2 Multihead Attention
5.5.3 Position-wise Feed-Forward Networks
5.5.4 Transformer Learning Strategies
5.6 BERT
5.6.1 BERT, ELMo, GPT
5.6.2 Pretraining Tasks and Building Training Data
5.6.3 BERT Model Structure
5.6.4 Free Train Tutorial
5.7 Summary of this chapter
5.8 References

Chapter 6.
Embedding Fine Tuning

6.1 Pretrain and Fine Tuning
6.2 Creating a Pipeline for Classification
6.3 Using word embeddings
6.3.1 Network Overview
6.3.2 Network Implementation
6.3.3 Tutorial
6.4 Using ELMo
6.4.1 Network Overview
6.4.2 Network Implementation
6.4.3 Tutorial
6.5 Using BERT
6.5.1 Network Overview
6.5.2 Network Implementation
6.5.3 Tutorial
6.6 Which sentence embeddings to use
6.7 Summary of this chapter
6.8 References

supplement
Appendix A.
Fundamentals of Linear Algebra
1.1 Vector and matrix operations
1.2 Inner product and covariance
1.3 Dot product and projection
1.4 Inner products and linear transformations
1.5 Matrix factorization-based dimensionality reduction (1): Principal component analysis (PCA)
1.6 Matrix factorization-based dimensionality reduction (2): Singular value decomposition (SVD)

Appendix B.
Probability Theory Fundamentals

2.1 Random variables and probability distributions
2.2 Bayesian probability theory

Appendix C.
Neural Network Basics

3.1 Understanding Neural Networks with DAG
3.2 Neural networks are probabilistic models.
3.3 Maximum likelihood estimation and learning loss
3.4 Gradient descent
3.5 Backpropagation by computational node
3.6 CNN and RNN

Appendix D.
Basic Korean Language

4.1 Syntactic units
4.2 Sentence Types
4.3 Parts of speech
4.4 Amount and tense
4.5 Topic
4.6 Increase
4.7 Aspect
4.8 Semantic role
4.9 Passive
4.10 Sadong
4.11 Denial

Appendix E.
References

Detailed image
Detailed Image 1

Publisher's Review
What this book covers

■ Introduction to the concept, types, and history of embedding, the first gateway to natural language processing.
■ Theoretical background explaining how embeddings encapsulate natural language meaning
■ Sharing know-how on preprocessing Korean corpora, including Wikipedia and KorQuAD
■ KoNLPy, soynlp, and Google Sentencepiece Package Guide
■ Word-level embeddings such as Word2Vec, GloVe, FastText, and Swivel
■ Description of sentence-level embeddings such as LDA, Doc2Vec, ELMo, and BERT
■ The tutorial will begin after explaining the individual model learning and operation process at the code level.
■ Embedding fine-tuning practice focusing on document classification tasks

This book introduces various embedding techniques.
We will broadly cover word-level embeddings and sentence-level embeddings.
It is a technique for converting each word and sentence into a vector.
Word-level embeddings described here include Word2Vec, GloVe, FastText, and Swivel.
Sentence-level embeddings include ELMo and BERT.
This book examines the theoretical background of each embedding technique and then explains the process of building actual embeddings using a Korean corpus.
When explaining each technique, try to follow the formulas and notations of the original paper as much as possible.
The code will also be introduced from the official repository of the paper's authors.

Corpus preprocessing and embedding fine-tuning are also important topics covered in this book.
The former is a process that must be done before building the embedding, and the latter is a process that must be done after building the embedding.
For preprocessing, we explain how to use open source tools such as KoNLPy, soynlp, and Google Sentencepiece.
We will practice fine-tuning embeddings using the example of a document classification task that predicts the polarity of a document, such as positive or negative.

The main contents of each chapter are as follows.

Chapter 1, 'Introduction', examines the definition, history, and types of embedding.
The process of setting up a development environment such as Docker is also explained.

Chapter 2, "How Vectors Gain Meaning," introduces how to embed the meaning of natural language into embeddings.
Although each embedding technique has its own differences, it is important to note that they share a common characteristic: they reflect statistical pattern information in the corpus.

Chapter 3, 'Korean Preprocessing', covers the preprocessing process of Korean data for embedding learning.
This explains how to convert data in the form of web documents or JSON files into pure text files and perform morphological analysis on them.
Spacing correction is also introduced.

Chapter 4, "Word-Level Embedding," describes various word-level embedding models. NPLM, Word2Vec, and FastText are prediction-based models, while LSA, GloVe, and Swivel are matrix factorization-based techniques.
Weighted embedding is a method that extends word embedding to the sentence level.

Chapter 5, 'Sentence-Level Embeddings', covers sentence-level embeddings.
We introduce three types: matrix factorization, probabilistic models, and neural network-based models.
Latent semantic analysis (LSA) is a matrix factorization, latent Dirichlet allocation (LDA) is a probabilistic model, and Doc2Vec, ELMo, and BERT are methods that focus on neural networks.
In particular, BERT is based on a self-attention-based transformer network.

Chapter 6, "Fine-Tuning Embeddings," covers fine-tuning word- and sentence-level embeddings.
We perform a task of classifying polarity using a corpus of Naver movie reviews.

The 'Appendix' briefly reviews the basic knowledge needed to understand this book.
Explains key concepts such as linear algebra, probability theory, neural networks, and Korean linguistics.
GOODS SPECIFICS
- Date of publication: September 26, 2019
- Page count, weight, size: 348 pages | 188*235*30mm
- ISBN13: 9791161753508
- ISBN10: 1161753508

You may also like

카테고리