
CUDA-based GPU parallel processing programming
Description
Book Introduction
Process more data, faster, with GPGPU technology and CUDA!
In an era where the use of octa-core CPUs is not surprising, GPGPU allows GPUs to handle CPU calculations, allowing them to process larger amounts of data faster.
"CUDA-Based Parallel Processing Programming" is based on CUDA, a GPGPU architecture developed by NVIDIA, and teaches parallel processing that is revolutionizing speed and efficiency in various aspects.
Additionally, starting with how to program CUDA in Visual Studio, we will examine parallel processing code line by line and get a detailed understanding of how it works within the computer architecture with images.
In an era where the use of octa-core CPUs is not surprising, GPGPU allows GPUs to handle CPU calculations, allowing them to process larger amounts of data faster.
"CUDA-Based Parallel Processing Programming" is based on CUDA, a GPGPU architecture developed by NVIDIA, and teaches parallel processing that is revolutionizing speed and efficiency in various aspects.
Additionally, starting with how to program CUDA in Visual Studio, we will examine parallel processing code line by line and get a detailed understanding of how it works within the computer architecture with images.
- You can preview some of the book's contents.
Preview
index
Chapter 1: Overview of GPGPU and Parallel Processing
_1.1 GPGPU and GPU Programming
_1.2 Concept and necessity of parallel processing
__1.2.1 Concept of parallel processing
__1.2.2 The Need for Parallel Processing
__1.2.3 The Need for Parallel Processing Programming
_1.3 Parallel Processing Hardware
__1.3.1 Flynn's Taxonomy
__1.3.2 Shared memory systems and distributed memory systems
__1.3.3 GPU SIMT architecture
_1.4 CPU and GPU comparison
__1.4.1 Background and development direction of GPU
__1.4.2 CPU vs. GPU
_1.5 Performance of parallel processing
__1.5.1 Parallel Processing Performance Metrics
__1.5.2 Amdahl's Law
Chapter 2 CUDA Overview
_2.1 Introduction to CUDA
__2.1.1 Driver API and Runtime API
__2.1.2 CUDA-enabled GPU
__2.1.3 GPU Performance
__2.1.4 CUDA compute capability
__2.1.5 Checking my GPU
_2.2 Setting up the CUDA development environment
__2.2.1 Installing the CUDA Toolkit
__2.2.2 CUDA program writing and compilation environment
_2.3 Hello CUDA
__2.3.1 Host and Device
__2.3.2 CUDA program
__2.3.3 Hello CUDA - Your First CUDA Program
__2.3.4 CUDA C/C++ Keywords
__2.3.5 Kernel Execution and Execution Configuration
Chapter 3: Basic Flow of CUDA Programs
_3.1 Structure and flow of CUDA programs
_3.2 CUDA Basic Memory API
__3.2.1 Device memory space allocation and initialization API
__3.2.2 Host-Device Memory Data Copy API
_3.3 Vector sum program written in CUDA
__3.3.1 Device Memory Allocation
__3.3.2 Copy input vector (host memory → device memory)
__3.3.3 Calling the vector sum kernel
__3.3.4 Copy result vector (device memory → host memory)
__3.3.5 Freeing device memory
__3.3.6 CUDA-based vector sum program full code
_3.4 Performance Measurement of CUDA Algorithms
__3.4.1 Kernel execution time
__3.4.2 Data transfer time
__3.4.3 Performance Measurement and Analysis of CUDA-Based Vector Sum Program
Chapter 4 CUDA Thread Hierarchy
_4.1 CUDA Thread Hierarchy
__4.1.1 CUDA Thread Hierarchy
__4.1.2 Built-in variables for the CUDA thread hierarchy
__4.1.3 Maximum size limits for grids and blocks
_4.2 CUDA thread structure and kernel calls
__4.2.1 Setting up thread layout and kernel calls
__4.2.2 Example of setting and checking thread layout
_4.3 Vector sum CUDA program for large vectors - Thread layout
Chapter 5 Thread Layout and Indexing
_5.1 Finding the sum of vectors greater than 1,024
__5.1.1 Determining thread layout
__5.1.2 Calculating the index of data to be accessed by each thread
__5.1.3 Writing a kernel that reflects the calculated index
_5.2 Thread Indexing
__5.2.1 Appearance of array in memory
__5.2.2 Thread Indexing Practice I - Global Thread Numbers
__5.2.3 Thread Indexing Exercise II - Indexing Two-Dimensional Data
_5.3 CUDA-based large-scale matrix sum program
__5.3.1 2D Grid, 2D Block Layout
__5.3.2 1D Grid, 1D Block Layout
__5.3.3 2D Grid, 1D Block Layout
Chapter 6 CUDA Execution Model
_6.1 NVIDIA GPU Architecture
__6.1.1 Streaming Multiprocessor
__6.1.2 CUDA cores
_6.2 CUDA Thread Hierarchy and GPU Hardware
__6.2.1 Grid → GPU
__6.2.2 Thread block → SM
__6.2.3 Warp & Thread → CUDA Cores in SM
__6.2.4 Zero context switch overhead
__6.2.5 Warp divergence
_6.3 Strategies for Hiding Memory Access Latency
_6.4 Checking GPU information
Chapter 7: CUDA-based matrix multiplication program
_7.1 What is matrix multiplication?
_7.2 Setting thread layout
__7.2.1 Thread layout based on input matrices A and B
__7.2.2 Thread layout based on result matrix C
_7.3 Thread Indexing
__7.3.1 If the size of matrix C is smaller than the maximum block size (1,024)
__7.3.2 If the size of matrix C is larger than the maximum block size (1,024)
_7.4 Implementation and Performance Evaluation
__7.4.1 Implementation Details
__7.4.2 Performance Evaluation
__7.4.3 Floating-point operation precision issues
Chapter 8 CUDA Memory Hierarchy
_8.1 Memory Hierarchy of Computer Systems
_8.2 CUDA Memory Hierarchy
__8.2.1 Thread-level memory
__8.2.2 Block-level memory
__8.2.3 Grid-level memory
__8.2.4 GPU Cache
__8.2.5 CUDA Memory Summary
_8.3 CUDA Memory Model and Performance
__8.3.1 Maximizing Parallelism
__8.3.2 Active Warp Rate
Chapter 9 CUDA Shared Memory
_9.1 How to use shared memory
__9.1.1 Storage of shared data among threads within a thread block
__9.1.2 L1 cache (HW managed cache)
__9.1.3 User-managed cache
_9.2 Example of using shared memory - Multiplication of matrices smaller than 1,024
Chapter 10: Matrix Multiplication Program Using Shared Memory
_10.1 Problem Definition and Base Code
_10.2 Algorithm Design and Implementation
__10.2.1 Strategy 1: Load some rows of matrix A and some columns of matrix B into shared memory
__10.2.2 Strategy 2: Divide rows and columns into blocks and load them into shared memory.
_10.3 Performance Evaluation
Chapter 11: Optimizing Memory Access Performance
_11.1 Optimizing Global Memory Access
__11.1.1 Aligned and merged memory access
__11.1.2 Example: Thread layout of a matrix multiplication kernel
__11.1.3 Array of Structures vs.
Structure of array
_11.2 Optimizing Shared Memory Access
__11.2.1 Memory Bank and Bank Conflicts
__11.2.2 Example: Matrix Multiplication Kernel Utilizing Shared Memory
Chapter 12 Synchronization and Concurrency
_12.1 Synchronization
__12.1.1 Synchronization in CUDA
_12.2 Concurrent execution with CUDA streams
__12.2.1 Definition and Characteristics of CUDA Streams
__12.2.2 Concurrent execution of CUDA commands
__12.2.3 Example: Hiding Data Transfer Overhead
__12.2.4 Stream Synchronization
_12.3 CUDA Events
__12.3.1 CUDA Event API
__12.3.2 Measuring execution time by kernel and stream using CUDA events
_12.4 Multi-GPU and Heterogeneous Parallel Computing
__12.4.1 Using multiple GPUs
__12.4.2 Heterogeneous Parallel Computing
_1.1 GPGPU and GPU Programming
_1.2 Concept and necessity of parallel processing
__1.2.1 Concept of parallel processing
__1.2.2 The Need for Parallel Processing
__1.2.3 The Need for Parallel Processing Programming
_1.3 Parallel Processing Hardware
__1.3.1 Flynn's Taxonomy
__1.3.2 Shared memory systems and distributed memory systems
__1.3.3 GPU SIMT architecture
_1.4 CPU and GPU comparison
__1.4.1 Background and development direction of GPU
__1.4.2 CPU vs. GPU
_1.5 Performance of parallel processing
__1.5.1 Parallel Processing Performance Metrics
__1.5.2 Amdahl's Law
Chapter 2 CUDA Overview
_2.1 Introduction to CUDA
__2.1.1 Driver API and Runtime API
__2.1.2 CUDA-enabled GPU
__2.1.3 GPU Performance
__2.1.4 CUDA compute capability
__2.1.5 Checking my GPU
_2.2 Setting up the CUDA development environment
__2.2.1 Installing the CUDA Toolkit
__2.2.2 CUDA program writing and compilation environment
_2.3 Hello CUDA
__2.3.1 Host and Device
__2.3.2 CUDA program
__2.3.3 Hello CUDA - Your First CUDA Program
__2.3.4 CUDA C/C++ Keywords
__2.3.5 Kernel Execution and Execution Configuration
Chapter 3: Basic Flow of CUDA Programs
_3.1 Structure and flow of CUDA programs
_3.2 CUDA Basic Memory API
__3.2.1 Device memory space allocation and initialization API
__3.2.2 Host-Device Memory Data Copy API
_3.3 Vector sum program written in CUDA
__3.3.1 Device Memory Allocation
__3.3.2 Copy input vector (host memory → device memory)
__3.3.3 Calling the vector sum kernel
__3.3.4 Copy result vector (device memory → host memory)
__3.3.5 Freeing device memory
__3.3.6 CUDA-based vector sum program full code
_3.4 Performance Measurement of CUDA Algorithms
__3.4.1 Kernel execution time
__3.4.2 Data transfer time
__3.4.3 Performance Measurement and Analysis of CUDA-Based Vector Sum Program
Chapter 4 CUDA Thread Hierarchy
_4.1 CUDA Thread Hierarchy
__4.1.1 CUDA Thread Hierarchy
__4.1.2 Built-in variables for the CUDA thread hierarchy
__4.1.3 Maximum size limits for grids and blocks
_4.2 CUDA thread structure and kernel calls
__4.2.1 Setting up thread layout and kernel calls
__4.2.2 Example of setting and checking thread layout
_4.3 Vector sum CUDA program for large vectors - Thread layout
Chapter 5 Thread Layout and Indexing
_5.1 Finding the sum of vectors greater than 1,024
__5.1.1 Determining thread layout
__5.1.2 Calculating the index of data to be accessed by each thread
__5.1.3 Writing a kernel that reflects the calculated index
_5.2 Thread Indexing
__5.2.1 Appearance of array in memory
__5.2.2 Thread Indexing Practice I - Global Thread Numbers
__5.2.3 Thread Indexing Exercise II - Indexing Two-Dimensional Data
_5.3 CUDA-based large-scale matrix sum program
__5.3.1 2D Grid, 2D Block Layout
__5.3.2 1D Grid, 1D Block Layout
__5.3.3 2D Grid, 1D Block Layout
Chapter 6 CUDA Execution Model
_6.1 NVIDIA GPU Architecture
__6.1.1 Streaming Multiprocessor
__6.1.2 CUDA cores
_6.2 CUDA Thread Hierarchy and GPU Hardware
__6.2.1 Grid → GPU
__6.2.2 Thread block → SM
__6.2.3 Warp & Thread → CUDA Cores in SM
__6.2.4 Zero context switch overhead
__6.2.5 Warp divergence
_6.3 Strategies for Hiding Memory Access Latency
_6.4 Checking GPU information
Chapter 7: CUDA-based matrix multiplication program
_7.1 What is matrix multiplication?
_7.2 Setting thread layout
__7.2.1 Thread layout based on input matrices A and B
__7.2.2 Thread layout based on result matrix C
_7.3 Thread Indexing
__7.3.1 If the size of matrix C is smaller than the maximum block size (1,024)
__7.3.2 If the size of matrix C is larger than the maximum block size (1,024)
_7.4 Implementation and Performance Evaluation
__7.4.1 Implementation Details
__7.4.2 Performance Evaluation
__7.4.3 Floating-point operation precision issues
Chapter 8 CUDA Memory Hierarchy
_8.1 Memory Hierarchy of Computer Systems
_8.2 CUDA Memory Hierarchy
__8.2.1 Thread-level memory
__8.2.2 Block-level memory
__8.2.3 Grid-level memory
__8.2.4 GPU Cache
__8.2.5 CUDA Memory Summary
_8.3 CUDA Memory Model and Performance
__8.3.1 Maximizing Parallelism
__8.3.2 Active Warp Rate
Chapter 9 CUDA Shared Memory
_9.1 How to use shared memory
__9.1.1 Storage of shared data among threads within a thread block
__9.1.2 L1 cache (HW managed cache)
__9.1.3 User-managed cache
_9.2 Example of using shared memory - Multiplication of matrices smaller than 1,024
Chapter 10: Matrix Multiplication Program Using Shared Memory
_10.1 Problem Definition and Base Code
_10.2 Algorithm Design and Implementation
__10.2.1 Strategy 1: Load some rows of matrix A and some columns of matrix B into shared memory
__10.2.2 Strategy 2: Divide rows and columns into blocks and load them into shared memory.
_10.3 Performance Evaluation
Chapter 11: Optimizing Memory Access Performance
_11.1 Optimizing Global Memory Access
__11.1.1 Aligned and merged memory access
__11.1.2 Example: Thread layout of a matrix multiplication kernel
__11.1.3 Array of Structures vs.
Structure of array
_11.2 Optimizing Shared Memory Access
__11.2.1 Memory Bank and Bank Conflicts
__11.2.2 Example: Matrix Multiplication Kernel Utilizing Shared Memory
Chapter 12 Synchronization and Concurrency
_12.1 Synchronization
__12.1.1 Synchronization in CUDA
_12.2 Concurrent execution with CUDA streams
__12.2.1 Definition and Characteristics of CUDA Streams
__12.2.2 Concurrent execution of CUDA commands
__12.2.3 Example: Hiding Data Transfer Overhead
__12.2.4 Stream Synchronization
_12.3 CUDA Events
__12.3.1 CUDA Event API
__12.3.2 Measuring execution time by kernel and stream using CUDA events
_12.4 Multi-GPU and Heterogeneous Parallel Computing
__12.4.1 Using multiple GPUs
__12.4.2 Heterogeneous Parallel Computing
Detailed image

Publisher's Review
With computer architecture images
Parallel Processing Programming at a Glance
If you only learn single-threaded programming, it will be difficult to expand and utilize good models or techniques in various fields.
At this time, GPGPU can be innovatively utilized in industries such as high-spec game development, artificial intelligence, big data analysis, and data mining.
Through "CUDA-Based Parallel Processing Programming," let's thoroughly learn everything about parallel processing, from various terms and dictionary definitions related to GPGPU and parallel processors, to how they work in computer architecture, and even how to use them to improve computational speed.
Readers who need this book
ㆍThose who want to experience what GPGPU technology is
Intermediate C/C++ language users who want to understand parallel processing programming
ㆍThose who want to learn more about computational processing as well as algorithms
ㆍThose who need CUDA-based parallel processing optimization
Parallel Processing Programming at a Glance
If you only learn single-threaded programming, it will be difficult to expand and utilize good models or techniques in various fields.
At this time, GPGPU can be innovatively utilized in industries such as high-spec game development, artificial intelligence, big data analysis, and data mining.
Through "CUDA-Based Parallel Processing Programming," let's thoroughly learn everything about parallel processing, from various terms and dictionary definitions related to GPGPU and parallel processors, to how they work in computer architecture, and even how to use them to improve computational speed.
Readers who need this book
ㆍThose who want to experience what GPGPU technology is
Intermediate C/C++ language users who want to understand parallel processing programming
ㆍThose who want to learn more about computational processing as well as algorithms
ㆍThose who need CUDA-based parallel processing optimization
GOODS SPECIFICS
- Date of issue: May 23, 2023
- Page count, weight, size: 312 pages | 188*245*30mm
- ISBN13: 9791165922238
- ISBN10: 1165922231
You may also like
카테고리
korean
korean