Skip to product information
LLM Production Engineering
LLM Production Engineering
Description
Book Introduction
Understanding the structure of models and designing accurate and reliable generative AI systems

As LLM commercialization accelerates, production-level implementation capabilities with accuracy, reliability, and scalability are becoming key technologies.
Developers now need to have a structural understanding of the entire LLM technology stack.
This book systematically guides you through the core concepts of generative AI, from system construction to deployment. It covers the operating principles of Transformer-based models, various prompt strategies, RAG design, fine-tuning techniques, and the use of frameworks like LangChain and LamarIndex. It will serve as a guide for developers seeking to organically connect LLM technologies from concept to application.
  • You can preview some of the book's contents.
    Preview

index
About the Author and Translator xi
Translator's Preface xiii
Recommendation xv
Beta Reader Review xvii
Recommendation xix
Beginning with xxii
Acknowledgments xxvii
About this book xxviii

CHAPTER 1 Introduction to LLM 1
1.1 A Brief History of Language Models 1
1.2 What is an LLM? 2
1.3 Basic Components of LLM 3
1.4 Practice ① Translation using LLM (GPT-3.5 API) 19
1.5 Practice ② LLM Output Control through Few-Shot Learning 20
1.6 Summary 22

CHAPTER 2 LLM Architecture and Environment 23
2.1 Understanding Transformers 23
2.2 Design and Selection of Transformer Models 33
2.3 Transformer Architecture Optimization Techniques 41
2.4 GPT Architecture 43
2.5 Introducing the Large Multimodal Model 46
2.6 Commercial Model vs.
Public model vs.
Open Source Language Model 52
2.7 Applications and Use Cases of LLM 59
2.8 Summary 67

CHAPTER 3 Practical Applications of the LLM 69
3.1 Understanding Hallucinations and Bias 69
3.2 How to reduce hallucinations in LLM output 71
3.3 LLM Performance Evaluation 79
3.4 Summary 84

CHAPTER 4 Introduction to Prompt Engineering 86
4.1 Prompting and Prompt Engineering 86
4.2 Prompt Techniques 91
4.3 Prompt Injection and Security 97
4.4 Summary 100

CHAPTER 5 RAG 102
5.1 Why RAG? 102
5.2 Building a Basic RAG Pipeline from Scratch 106
5.3 Summary 119

CHAPTER 6 Introducing LangChain and LlamaIndex 120
6.1 LLM Framework 120
6.2 Introduction to LangChain 121
6.3 Practice ① Building an LLM-based application using LangChain 126
6.4 Practice ② Building a News Article Summarizer 130
6.5 Introduction to LlamaIndex 137
6.6 LangChain vs.
LlamaIndex vs.
OpenAI Assistant 145
6.7 Summary 147

CHAPTER 7 Writing Prompts Using LangChain 148
7.1 What is a LangChain prompt template? 148
7.2 Few-shot prompts and example selectors 156
7.3 What is a Chain in LangChain? 163
7.4 Practice ① Output Management Using Output Parser 171
7.5 Practice ② Improving the News Article Summarizer 183
7.6 Exercise ③ Creating a Knowledge Graph Using Text Data: Discovering Hidden Links 191
7.7 Summary 197

CHAPTER 8 Indexes, Searchers, and Data Preparation 199
8.1 LangChain's Index and Search Engine 199
8.2 Data Collection 205
8.3 Text Splitter 209
8.4 Similarity Search and Vector Embedding 219
8.5 Practice ① Customer Support Q&A Chatbot 225
8.6 Practice ② YouTube Video Summarizer Using Whisper and LangChain 232
8.7 Practice ③ Voice Assistant for Knowledge Base 243
8.8 Practice ④ Preventing Unwanted Output Using a Self-Criticism Chain 255
8.9 Practice ⑤ Preventing Inappropriate Output in Customer Service Chatbots 260
8.10 Summary 265

CHAPTER 9 ADVANCED RAG 268
9.1 From Proof of Concept to Product: Challenges of the RAG System 268
9.2 Advanced RAG Techniques and LlamaIndex 269
9.3 RAG Indicators and Evaluation 284
9.4 LangChain, LangSmith, and LangChain Hub 299
9.5 Summary 304

CHAPTER 10 Agent 306
10.1 Agents: Large Models as Inference Engines 306
10.2 AutoGPT and BabyAGI at a Glance 312
10.3 LangChain's Agent Simulation Project 327
10.4 Practice ① Building an Analysis Report Writing Agent 332
10.5 Practice ② Database Queries and Summarization Using LlamaIndex 340
10.6 Exercise ③ Building an Agent Using the OpenAI Assistant 350
10.7 Practice ④ LangChain OpenGPTs 354
10.8 Exercise ⑤ Analyzing PDF Files with the Multimodal Financial Document Analyzer 357
10.9 Summary 371

CHAPTER 11 Fine Tuning 372
11.1 Understanding Fine Tuning 372
11.2 LoRA 373
11.3 Practice ① SFT 376 using LoRA
11.4 Practice ② Financial Sentiment Analysis Using SFT and LoRA 389
11.5 Practice ③ Cohere LLM Fine-Tuning Using Medical Data 398
11.6 RLHF 408
11.7 Practice ④ Improving LLM Performance through RLHF 411
11.8 Summary 433

CHAPTER 12 Deployment and Optimization 435
12.1 Model Distillation and the Teacher-Student Model 435
12.2 LLM Distribution Optimization: Quantization, Pruning, and Speculative Decoding 441
12.3 Hands-on: Deploying Quantized LLM on CPUs on GCP 452
12.4 Deploying Open Source LLM in a Cloud Environment 461
12.5 Summary 463

Going out 465
Glossary 468
Search 472

Detailed image
Detailed Image 1

Into the book
There are several versions of the GPT architecture for different purposes.
In later chapters, we'll cover other libraries more suitable for production environments, but here we'll introduce minGPT, a simplified version of OpenAI's GPT-2 model developed by Andrey Karpaty.
minGPT is a lightweight version of the GPT model that you can implement and experiment with directly in your repository.
/ minGPT is an educational tool developed to simply explain the GPT structure, condensed to about 300 lines of code, and uses the PyTorch library.
Its simple structure makes it useful for deeply understanding the inner workings of GPT family models, and the code includes clear explanations of each process, which aids learning.

--- p.45

ICL (in-context learning) is an approach in which a model learns by including examples or demonstrations in prompts.
Few-shot prompting is a subset of contextual learning that provides a small set of relevant examples or demonstrations to the model.
This strategy helps the model generalize and improve its performance on more complex tasks.
Few-shot prompting allows language models to learn from a small number of samples.
This adaptability allows the model to handle a wide range of tasks with only a small number of training samples.
/ In zero-shot prompting, the model generates output for a completely new task, whereas few-shot prompting improves performance by leveraging examples in context.
In this technique, the prompt often consists of several samples or inputs followed by corresponding answers.
Language models learn from these examples and apply them to answer similar questions.

--- p.92

The final step in setting up the RAG pipeline is to prepare prompts that encourage LLM to leverage the information it finds rather than relying on its own inherent knowledge.
At this stage, the model acts as an editor, reviewing the given information and organizing or generating answers that fit the prompt.
It's similar to how a lawyer, when not having all the answers memorized, looks up documents, books, and databases to answer questions and "digests" the information to come up with an answer.
Like lawyers, LLMs often seek to reduce errors (hallucinations) by referencing given resources.
/ For this to work, you need to adjust two arguments: system_prompt and user_prompt.
The main change to system_prompt is that it instructs the model to answer the question using specific chunks of information provided.
user_prompt signals the model to respond based only on the data provided between the 〈START_OF_CONTEXT〉 and 〈END_OF_CONTEXT〉 tags.
Here, we use the .join() method to concatenate the retrieved chunks into a single long string, and the .format() function to replace the first and second { } placeholders in the prompt variable with the combined context and the user's question, respectively.

--- pp.115-116

In LlamaIndex, after data collection, documents are transformed within a processing framework.
This process converts the document into smaller, more detailed units called Node objects.
Nodes are derived from the original document and contain main content, metadata, and contextual details.
LlamaIndex includes a NodeParser class that automatically converts document content into structured nodes.
The list of document objects was converted into node objects using SimpleNodeParser.

--- p.139

In RAG, query construction is the process of converting a user's question into a format compatible with various data sources.
For unstructured data, this allows us to convert questions into vector format, compare them with the vector representation of the source document, and identify the most relevant chunks.
It can also be applied to structured data such as databases by writing queries in a language such as SQL.
/ The core idea is to leverage the inherent structure of the data to answer user queries.
For example, the query 'movies about aliens in the year 1980' combines a semantic element like 'aliens' (better searched via a vector store) with a structural element like 'year == 1980'.
This process involves translating natural language queries into the database's specific query language, such as SQL (a structured query language for relational databases) or Cypher (a graph database).

--- pp.272-273

The final step of RLHF is to integrate the previously developed models.
In this step, a reward model is used to align the fine-tuned model more closely with human feedback.
During the training loop, user-defined prompts elicit responses from the fine-tuned OPT model, which are then evaluated via a reward model.
Evaluation scores are assigned based on similarity to responses likely to be generated by a human.
--- p.423

Publisher's Review
Developing Practical AI Services with LLM

While LLM is rapidly evolving, with new models and techniques constantly emerging, the development tools and techniques used today serve as the foundation for handling more advanced AI models.
Those who deeply understand this foundation will be able to most effectively utilize the more powerful models that will emerge in the future. AI is being utilized in diverse fields, including natural language processing, algorithm explanation, software development, academic concept explanation, and generative image creation, and is bringing about innovation across industries.

This book introduces the latest trends in LLM and natural language processing, provides an in-depth explanation of how models work, and presents practical, immediately applicable methods.
Specifically, through the RAG pipeline construction project, you'll directly explore cutting-edge technologies for text processing and contextual interaction. Focusing on prompt engineering, fine-tuning, and RAG—essential technology stacks that enhance accuracy and reliability for specific LLM applications—you'll learn specifically about the process of building products applicable to real-world services.
Beyond simple conceptual explanations, it provides strategies to overcome limitations and practical implementation methods, helping developers complete their own applications and products.

This book, consisting of 12 chapters, systematically covers everything from the core concepts of LLM to practical application.
Chapter 1 explores why LLM is powerful, including scaling laws, context size, and emergent capabilities, while Chapter 2 describes various model designs focusing on the transformer architecture and each layer component.
Chapter 3 analyzes limitations such as illusion, latency, and computational constraints, and Chapter 4 practices prompting techniques such as few-shot learning and chained prompts with code examples.
Chapter 5 covers the basic principles of RAG, vector database concepts, and data storage and retrieval methods, and Chapter 6 explains how to simplify LLM tasks with LangChain and LlamaIndex.


Chapter 7 covers various prompt types, response control, and tracking techniques, while Chapter 8 covers search optimization, including index creation, data partitioning, and storage.
Chapter 9 covers advanced RAG techniques, potential problem solving, chatbot performance evaluation, and even introduces how to use LangSmith.
Chapter 10 then covers intelligent agents that interact with the external environment, and Chapter 11 covers fine-tuning strategies using LoRA and QLoRA.
In the final 12 chapters, we propose optimization methods that reduce cost while maintaining performance, such as model distillation, quantization, and pruning.
Each chapter contains 19 practical projects, including a RAG-based news summarizer, a customer support Q&A chatbot, a YouTube video summarizer using Whisper and LangChain, a PDF financial document analyzer, and a LoRA-based financial sentiment analysis, allowing you to learn concepts through hands-on practice and apply them directly to your work.

Even as models and implementations change over time, the principles and approaches covered in this book remain valid.
It is not only practical knowledge needed now, but can also be applied to more advanced models that will emerge in the future.

Key Contents

● Understanding the LLM structure and model selection strategy
● Prompt engineering and response control techniques
● Building a RAG pipeline based on vector search
● Utilization of Langchain and Rama Index
● LoRA, QLoRA-based fine-tuning
● Agent technologies such as AutoGPT and BabyAGI
● Evaluation and debugging using Langsmith
● Quantization, model lightweighting, optimization, and deployment strategies

19 practical LLM projects you can try yourself in this book.

● Translation using LLM
● LLM output control through few-shot learning
● Building LLM-based applications using LangChain
● Establishing a news article summarizer
● Output management using output parser
● Improved news article summarizer
● Creating a knowledge graph using text data
Customer Support Q&A Chatbot
● YouTube video summarizer using Whisper and LangChain
● Preventing unwanted output using self-criticism chain
● Preventing inappropriate output from customer service chatbots
● Building an agent to write analysis reports
● Database queries and summaries using LlamaIndex
● Building an agent using OpenAI Assistant
● LangChain OpenGPTs
● Analyze PDF files with a multimodal financial document analyzer
● SFT using LoRA
● Financial sentiment analysis using SFT and LoRA
Cohere LLM Fine-Tuning Using Medical Data
● Improving LLM performance through RLHF
GOODS SPECIFICS
- Date of issue: September 11, 2025
- Page count, weight, size: 516 pages | 188*245*25mm
- ISBN13: 9791194587347

You may also like

카테고리