LLM & NLP Learning Guide
Introduction
This folder contains materials for learning Natural Language Processing (NLP) and Large Language Models (LLM). It is structured step-by-step from basic NLP to modern LLM applications.
Target Audience: Learners who have completed the Deep_Learning folder (understanding of Transformer and Attention is required)
Learning Roadmap
[NLP Basics] [Pre-trained Models] [LLM Applications]
│ │ │
▼ ▼ ▼
Tokenization/Embedding ─▶ BERT Understanding ─▶ Prompt Engineering
│ │ │
▼ ▼ ▼
Word2Vec/GloVe ─────────▶ GPT Understanding ──▶ RAG Systems
│ │ │
▼ ▼ ▼
Transformer Review ─────▶ HuggingFace ────────▶ LangChain
│ │
▼ ▼
Fine-Tuning ────────▶ Practical Chatbot
File List
NLP Basics
Pre-trained Models
LLM Applications
Advanced LLM
| File |
Difficulty |
Key Topics |
| 13_Model_Quantization.md |
⭐⭐⭐ |
INT8/INT4, GPTQ, AWQ, bitsandbytes, QLoRA |
| 14_RLHF_Alignment.md |
⭐⭐⭐⭐ |
PPO, Reward Model, DPO, Constitutional AI |
| 15_LLM_Agents.md |
⭐⭐⭐⭐ |
ReAct, Tool Use, AutoGPT, LangChain Agent |
| 16_Evaluation_Metrics.md |
⭐⭐⭐ |
BLEU, ROUGE, BERTScore, Human Eval, Benchmarks |
Key Concepts Preview
NLP Pipeline
# Basic NLP Pipeline
Text → Tokenization → Embedding → Model → Output
# HuggingFace Pipeline
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
BERT vs GPT
| Item |
BERT |
GPT |
| Direction |
Bidirectional (encoder) |
Unidirectional (decoder) |
| Training |
MLM + NSP |
Next token prediction |
| Use Cases |
Classification, QA, NER |
Generation, dialogue |
| Features |
Context understanding |
Text generation |
RAG System
Question → Retrieval (Vector DB) → Relevant Docs → LLM + Docs → Answer
Prerequisites
- Deep_Learning folder (required)
- Attention mechanism
- Transformer architecture
- Text classification basics
- Advanced Python
- PyTorch basics
Environment Setup
Required Packages
# PyTorch
pip install torch torchvision torchaudio
# HuggingFace
pip install transformers datasets tokenizers accelerate
# LangChain
pip install langchain langchain-community langchain-openai
# Vector Databases
pip install chromadb faiss-cpu sentence-transformers
# Others
pip install openai tiktoken numpy pandas
API Key Setup
# OpenAI
export OPENAI_API_KEY="your-api-key"
# HuggingFace (for model downloads)
export HUGGINGFACE_TOKEN="your-token"
Recommended Learning Order
- NLP Basics (3 days): 01 → 02 → 03
- Solidify understanding of tokenization and embedding concepts
- Pre-trained Models (5 days): 04 → 05 → 06 → 07
- Focus on HuggingFace hands-on practice
- LLM Applications (7 days): 08 → 09 → 10 → 11 → 12
- Project-based learning
- Advanced LLM (5 days): 13 → 14 → 15 → 16
- Quantization, RLHF, Agents, evaluation metrics
Reference Links