08_xgboost_lightgbm.ipynb

  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "id": "cell-0",
  6   "metadata": {},
  7   "source": [
  8    "# 08. XGBoost & LightGBM\n",
  9    "\n",
 10    "## 학습 목표\n",
 11    "- Gradient Boosting 개념 이해\n",
 12    "- XGBoost 사용법과 하이퍼파라미터\n",
 13    "- LightGBM 특징과 최적화\n",
 14    "- CatBoost 개요\n",
 15    "- 모델 비교 및 선택"
 16   ]
 17  },
 18  {
 19   "cell_type": "code",
 20   "execution_count": null,
 21   "id": "cell-1",
 22   "metadata": {},
 23   "outputs": [],
 24   "source": [
 25    "import numpy as np\n",
 26    "import pandas as pd\n",
 27    "import matplotlib.pyplot as plt\n",
 28    "import seaborn as sns\n",
 29    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
 30    "from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score\n",
 31    "from sklearn.datasets import make_classification, load_breast_cancer, fetch_california_housing\n",
 32    "import time\n",
 33    "\n",
 34    "plt.rcParams['font.family'] = 'DejaVu Sans'\n",
 35    "plt.rcParams['axes.unicode_minus'] = False"
 36   ]
 37  },
 38  {
 39   "cell_type": "markdown",
 40   "id": "cell-2",
 41   "metadata": {},
 42   "source": [
 43    "## 1. Gradient Boosting 개념"
 44   ]
 45  },
 46  {
 47   "cell_type": "code",
 48   "execution_count": null,
 49   "id": "cell-3",
 50   "metadata": {},
 51   "outputs": [],
 52   "source": [
 53    "print(\"\"\"\n",
 54    "Gradient Boosting 알고리즘:\n",
 55    "\n",
 56    "1. 초기화: F_0(x) = argmin_γ Σ L(y_i, γ)\n",
 57    "\n",
 58    "2. 반복 (m = 1, 2, ..., M):\n",
 59    "   a. 의사 잔차(pseudo-residual) 계산:\n",
 60    "      r_im = -[∂L(y_i, F(x_i))/∂F(x_i)]_{F=F_{m-1}}\n",
 61    "   \n",
 62    "   b. 잔차에 대해 약한 학습기 h_m(x) 학습\n",
 63    "   \n",
 64    "   c. 최적 스텝 크기 계산:\n",
 65    "      γ_m = argmin_γ Σ L(y_i, F_{m-1}(x_i) + γ * h_m(x_i))\n",
 66    "   \n",
 67    "   d. 모델 업데이트:\n",
 68    "      F_m(x) = F_{m-1}(x) + learning_rate * γ_m * h_m(x)\n",
 69    "\n",
 70    "핵심:\n",
 71    "- 각 단계에서 이전 모델의 오차(잔차)를 학습\n",
 72    "- 손실 함수의 그래디언트 방향으로 최적화\n",
 73    "- learning_rate로 과적합 방지\n",
 74    "\"\"\")\n",
 75    "\n",
 76    "# 간단한 시각화 데이터\n",
 77    "np.random.seed(42)\n",
 78    "X_demo = np.linspace(0, 10, 100).reshape(-1, 1)\n",
 79    "y_demo = np.sin(X_demo).ravel() + np.random.randn(100) * 0.3\n",
 80    "\n",
 81    "plt.figure(figsize=(12, 4))\n",
 82    "plt.subplot(1, 2, 1)\n",
 83    "plt.scatter(X_demo, y_demo, alpha=0.5)\n",
 84    "plt.xlabel('X')\n",
 85    "plt.ylabel('y')\n",
 86    "plt.title('Sample Data for Gradient Boosting')\n",
 87    "plt.grid(True, alpha=0.3)\n",
 88    "\n",
 89    "plt.subplot(1, 2, 2)\n",
 90    "stages = [0, 1, 5, 20, 50]\n",
 91    "colors = ['red', 'orange', 'yellow', 'green', 'blue']\n",
 92    "for stage, color in zip(stages, colors):\n",
 93    "    if stage == 0:\n",
 94    "        plt.axhline(y=np.mean(y_demo), color=color, label=f'Stage {stage}', alpha=0.7)\n",
 95    "plt.scatter(X_demo, y_demo, alpha=0.3, color='gray')\n",
 96    "plt.xlabel('X')\n",
 97    "plt.ylabel('y')\n",
 98    "plt.title('Gradient Boosting: Sequential Learning')\n",
 99    "plt.legend()\n",
100    "plt.grid(True, alpha=0.3)\n",
101    "\n",
102    "plt.tight_layout()\n",
103    "plt.show()"
104   ]
105  },
106  {
107   "cell_type": "markdown",
108   "id": "cell-4",
109   "metadata": {},
110   "source": [
111    "## 2. XGBoost (eXtreme Gradient Boosting)"
112   ]
113  },
114  {
115   "cell_type": "code",
116   "execution_count": null,
117   "id": "cell-5",
118   "metadata": {},
119   "outputs": [],
120   "source": [
121    "# XGBoost 설치: pip install xgboost\n",
122    "import xgboost as xgb\n",
123    "from xgboost import XGBClassifier, XGBRegressor\n",
124    "\n",
125    "print(\"\"\"\n",
126    "XGBoost 특징:\n",
127    "\n",
128    "1. 정규화:\n",
129    "   - L1, L2 정규화로 과적합 방지\n",
130    "   - 목표 함수: Σ L(y_i, ŷ_i) + Σ Ω(f_k)\n",
131    "   - Ω(f) = γT + 0.5λ||w||²\n",
132    "\n",
133    "2. 효율적인 계산:\n",
134    "   - 2차 테일러 전개 사용\n",
135    "   - 히스토그램 기반 분할\n",
136    "   - 캐시 최적화\n",
137    "\n",
138    "3. 결측치 처리:\n",
139    "   - 자동으로 최적 방향 학습\n",
140    "\n",
141    "4. 병렬 처리:\n",
142    "   - 특성별 병렬 분할점 탐색\n",
143    "\"\"\")\n",
144    "\n",
145    "print(f\"XGBoost 버전: {xgb.__version__}\")"
146   ]
147  },
148  {
149   "cell_type": "markdown",
150   "id": "cell-6",
151   "metadata": {},
152   "source": [
153    "### 2.1 XGBoost 분류"
154   ]
155  },
156  {
157   "cell_type": "code",
158   "execution_count": null,
159   "id": "cell-7",
160   "metadata": {},
161   "outputs": [],
162   "source": [
163    "# Breast Cancer 데이터 로드\n",
164    "cancer = load_breast_cancer()\n",
165    "X, y = cancer.data, cancer.target\n",
166    "\n",
167    "X_train, X_test, y_train, y_test = train_test_split(\n",
168    "    X, y, test_size=0.2, random_state=42\n",
169    ")\n",
170    "\n",
171    "print(f\"데이터 크기: {X.shape}\")\n",
172    "print(f\"클래스: {cancer.target_names}\")"
173   ]
174  },
175  {
176   "cell_type": "code",
177   "execution_count": null,
178   "id": "cell-8",
179   "metadata": {},
180   "outputs": [],
181   "source": [
182    "# XGBoost 분류기\n",
183    "xgb_clf = XGBClassifier(\n",
184    "    n_estimators=100,\n",
185    "    learning_rate=0.1,\n",
186    "    max_depth=6,\n",
187    "    min_child_weight=1,     # 리프 노드 최소 가중치\n",
188    "    gamma=0,                # 분할에 필요한 최소 손실 감소\n",
189    "    subsample=1.0,          # 행 샘플링 비율\n",
190    "    colsample_bytree=1.0,   # 트리별 열 샘플링 비율\n",
191    "    reg_alpha=0,            # L1 정규화\n",
192    "    reg_lambda=1,           # L2 정규화\n",
193    "    random_state=42,\n",
194    "    eval_metric='logloss'\n",
195    ")\n",
196    "\n",
197    "# 학습\n",
198    "start_time = time.time()\n",
199    "xgb_clf.fit(X_train, y_train)\n",
200    "train_time = time.time() - start_time\n",
201    "\n",
202    "# 예측 및 평가\n",
203    "y_pred = xgb_clf.predict(X_test)\n",
204    "accuracy = accuracy_score(y_test, y_pred)\n",
205    "\n",
206    "print(\"=== XGBoost 분류 결과 ===\")\n",
207    "print(f\"훈련 정확도: {xgb_clf.score(X_train, y_train):.4f}\")\n",
208    "print(f\"테스트 정확도: {accuracy:.4f}\")\n",
209    "print(f\"학습 시간: {train_time:.4f}초\")\n",
210    "print(f\"\\n분류 보고서:\")\n",
211    "print(classification_report(y_test, y_pred, target_names=cancer.target_names))"
212   ]
213  },
214  {
215   "cell_type": "markdown",
216   "id": "cell-9",
217   "metadata": {},
218   "source": [
219    "### 2.2 조기 종료 (Early Stopping)"
220   ]
221  },
222  {
223   "cell_type": "code",
224   "execution_count": null,
225   "id": "cell-10",
226   "metadata": {},
227   "outputs": [],
228   "source": [
229    "# 검증 데이터 분리\n",
230    "X_train_sub, X_val, y_train_sub, y_val = train_test_split(\n",
231    "    X_train, y_train, test_size=0.2, random_state=42\n",
232    ")\n",
233    "\n",
234    "# 조기 종료 사용\n",
235    "xgb_early = XGBClassifier(\n",
236    "    n_estimators=1000,\n",
237    "    learning_rate=0.1,\n",
238    "    max_depth=6,\n",
239    "    random_state=42,\n",
240    "    early_stopping_rounds=10,  # 10 라운드 동안 개선 없으면 중지\n",
241    "    eval_metric='logloss'\n",
242    ")\n",
243    "\n",
244    "xgb_early.fit(\n",
245    "    X_train_sub, y_train_sub,\n",
246    "    eval_set=[(X_val, y_val)],\n",
247    "    verbose=False\n",
248    ")\n",
249    "\n",
250    "print(\"=== 조기 종료 결과 ===\")\n",
251    "print(f\"최적 반복 횟수: {xgb_early.best_iteration}\")\n",
252    "print(f\"최적 점수: {xgb_early.best_score:.4f}\")\n",
253    "print(f\"테스트 정확도: {xgb_early.score(X_test, y_test):.4f}\")"
254   ]
255  },
256  {
257   "cell_type": "markdown",
258   "id": "cell-11",
259   "metadata": {},
260   "source": [
261    "### 2.3 특성 중요도"
262   ]
263  },
264  {
265   "cell_type": "code",
266   "execution_count": null,
267   "id": "cell-12",
268   "metadata": {},
269   "outputs": [],
270   "source": [
271    "# 특성 중요도 시각화\n",
272    "importance_types = ['weight', 'gain', 'cover']\n",
273    "\n",
274    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
275    "\n",
276    "for ax, imp_type in zip(axes, importance_types):\n",
277    "    importance_dict = xgb_clf.get_booster().get_score(importance_type=imp_type)\n",
278    "    \n",
279    "    if importance_dict:\n",
280    "        # 상위 10개만 표시\n",
281    "        sorted_importance = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)[:10]\n",
282    "        features = [x[0] for x in sorted_importance]\n",
283    "        values = [x[1] for x in sorted_importance]\n",
284    "        \n",
285    "        ax.barh(range(len(features)), values)\n",
286    "        ax.set_yticks(range(len(features)))\n",
287    "        ax.set_yticklabels(features)\n",
288    "        ax.set_xlabel('Importance')\n",
289    "        ax.set_title(f'Feature Importance ({imp_type})')\n",
290    "        ax.grid(True, alpha=0.3)\n",
291    "\n",
292    "plt.tight_layout()\n",
293    "plt.show()\n",
294    "\n",
295    "print(\"\"\"\n",
296    "중요도 타입:\n",
297    "- weight: 특성이 분할에 사용된 횟수\n",
298    "- gain: 특성 사용 시 평균 이득\n",
299    "- cover: 특성이 커버하는 평균 샘플 수\n",
300    "\"\"\")"
301   ]
302  },
303  {
304   "cell_type": "markdown",
305   "id": "cell-13",
306   "metadata": {},
307   "source": [
308    "### 2.4 XGBoost 하이퍼파라미터 튜닝"
309   ]
310  },
311  {
312   "cell_type": "code",
313   "execution_count": null,
314   "id": "cell-14",
315   "metadata": {},
316   "outputs": [],
317   "source": [
318    "# 파라미터 그리드\n",
319    "param_grid_xgb = {\n",
320    "    'max_depth': [3, 5, 7],\n",
321    "    'learning_rate': [0.01, 0.1, 0.3],\n",
322    "    'n_estimators': [100, 200],\n",
323    "    'min_child_weight': [1, 3],\n",
324    "    'subsample': [0.8, 1.0],\n",
325    "    'colsample_bytree': [0.8, 1.0]\n",
326    "}\n",
327    "\n",
328    "# Grid Search (시간이 오래 걸릴 수 있으므로 간소화된 그리드 사용)\n",
329    "grid_search_xgb = GridSearchCV(\n",
330    "    XGBClassifier(random_state=42, eval_metric='logloss'),\n",
331    "    param_grid_xgb,\n",
332    "    cv=3,\n",
333    "    scoring='accuracy',\n",
334    "    n_jobs=-1,\n",
335    "    verbose=0\n",
336    ")\n",
337    "\n",
338    "# 시간 절약을 위해 샘플 사용\n",
339    "X_sample, _, y_sample, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=42)\n",
340    "grid_search_xgb.fit(X_sample, y_sample)\n",
341    "\n",
342    "print(\"=== XGBoost Grid Search 결과 ===\")\n",
343    "print(f\"최적 파라미터: {grid_search_xgb.best_params_}\")\n",
344    "print(f\"최적 CV 점수: {grid_search_xgb.best_score_:.4f}\")\n",
345    "print(f\"테스트 점수: {grid_search_xgb.score(X_test, y_test):.4f}\")"
346   ]
347  },
348  {
349   "cell_type": "markdown",
350   "id": "cell-15",
351   "metadata": {},
352   "source": [
353    "## 3. LightGBM"
354   ]
355  },
356  {
357   "cell_type": "code",
358   "execution_count": null,
359   "id": "cell-16",
360   "metadata": {},
361   "outputs": [],
362   "source": [
363    "# LightGBM 설치: pip install lightgbm\n",
364    "import lightgbm as lgb\n",
365    "from lightgbm import LGBMClassifier, LGBMRegressor\n",
366    "\n",
367    "print(\"\"\"\n",
368    "LightGBM 특징:\n",
369    "\n",
370    "1. Leaf-wise 성장:\n",
371    "   - 기존: Level-wise (수평 분할)\n",
372    "   - LightGBM: Leaf-wise (손실 최대 감소 리프 분할)\n",
373    "   - 더 빠르고 정확하지만 과적합 위험\n",
374    "\n",
375    "2. 히스토그램 기반 분할:\n",
376    "   - 연속형 값을 이산화\n",
377    "   - 메모리 효율적, 빠른 학습\n",
378    "\n",
379    "3. GOSS (Gradient-based One-Side Sampling):\n",
380    "   - 그래디언트가 큰 샘플 위주로 샘플링\n",
381    "\n",
382    "4. EFB (Exclusive Feature Bundling):\n",
383    "   - 상호 배타적 특성들을 묶음\n",
384    "   - 희소 특성에 효과적\n",
385    "\"\"\")\n",
386    "\n",
387    "print(f\"LightGBM 버전: {lgb.__version__}\")"
388   ]
389  },
390  {
391   "cell_type": "markdown",
392   "id": "cell-17",
393   "metadata": {},
394   "source": [
395    "### 3.1 LightGBM 분류"
396   ]
397  },
398  {
399   "cell_type": "code",
400   "execution_count": null,
401   "id": "cell-18",
402   "metadata": {},
403   "outputs": [],
404   "source": [
405    "# LightGBM 분류기\n",
406    "lgb_clf = LGBMClassifier(\n",
407    "    n_estimators=100,\n",
408    "    learning_rate=0.1,\n",
409    "    max_depth=-1,           # -1: 제한 없음\n",
410    "    num_leaves=31,          # 리프 노드 최대 수\n",
411    "    min_child_samples=20,   # 리프 노드 최소 샘플 수\n",
412    "    subsample=1.0,          # 행 샘플링\n",
413    "    colsample_bytree=1.0,   # 열 샘플링\n",
414    "    reg_alpha=0,            # L1 정규화\n",
415    "    reg_lambda=0,           # L2 정규화\n",
416    "    random_state=42,\n",
417    "    verbose=-1\n",
418    ")\n",
419    "\n",
420    "# 학습\n",
421    "start_time = time.time()\n",
422    "lgb_clf.fit(X_train, y_train)\n",
423    "train_time_lgb = time.time() - start_time\n",
424    "\n",
425    "# 평가\n",
426    "y_pred_lgb = lgb_clf.predict(X_test)\n",
427    "accuracy_lgb = accuracy_score(y_test, y_pred_lgb)\n",
428    "\n",
429    "print(\"=== LightGBM 분류 결과 ===\")\n",
430    "print(f\"훈련 정확도: {lgb_clf.score(X_train, y_train):.4f}\")\n",
431    "print(f\"테스트 정확도: {accuracy_lgb:.4f}\")\n",
432    "print(f\"학습 시간: {train_time_lgb:.4f}초\")"
433   ]
434  },
435  {
436   "cell_type": "markdown",
437   "id": "cell-19",
438   "metadata": {},
439   "source": [
440    "### 3.2 num_leaves vs max_depth"
441   ]
442  },
443  {
444   "cell_type": "code",
445   "execution_count": null,
446   "id": "cell-20",
447   "metadata": {},
448   "outputs": [],
449   "source": [
450    "print(\"\"\"\n",
451    "num_leaves와 max_depth의 관계:\n",
452    "- max_depth = d일 때, 최대 리프 수 = 2^d\n",
453    "- num_leaves = 31이면 대략 max_depth = 5 수준\n",
454    "- 과적합 방지: num_leaves < 2^max_depth\n",
455    "\n",
456    "권장 설정:\n",
457    "- 대용량 데이터: num_leaves = 2^max_depth - 1 이하\n",
458    "- 소규모 데이터: num_leaves를 작게 (15~31)\n",
459    "\"\"\")\n",
460    "\n",
461    "# num_leaves에 따른 성능\n",
462    "num_leaves_range = [15, 31, 63, 127, 255]\n",
463    "train_scores_lgb = []\n",
464    "test_scores_lgb = []\n",
465    "\n",
466    "for num_leaves in num_leaves_range:\n",
467    "    lgb_temp = LGBMClassifier(\n",
468    "        n_estimators=100,\n",
469    "        num_leaves=num_leaves,\n",
470    "        random_state=42,\n",
471    "        verbose=-1\n",
472    "    )\n",
473    "    lgb_temp.fit(X_train, y_train)\n",
474    "    train_scores_lgb.append(lgb_temp.score(X_train, y_train))\n",
475    "    test_scores_lgb.append(lgb_temp.score(X_test, y_test))\n",
476    "\n",
477    "plt.figure(figsize=(10, 6))\n",
478    "plt.plot(num_leaves_range, train_scores_lgb, 'o-', label='Train')\n",
479    "plt.plot(num_leaves_range, test_scores_lgb, 's-', label='Test')\n",
480    "plt.xlabel('num_leaves')\n",
481    "plt.ylabel('Accuracy')\n",
482    "plt.title('LightGBM: num_leaves Effect')\n",
483    "plt.legend()\n",
484    "plt.grid(True, alpha=0.3)\n",
485    "plt.show()"
486   ]
487  },
488  {
489   "cell_type": "markdown",
490   "id": "cell-21",
491   "metadata": {},
492   "source": [
493    "### 3.3 특성 중요도"
494   ]
495  },
496  {
497   "cell_type": "code",
498   "execution_count": null,
499   "id": "cell-22",
500   "metadata": {},
501   "outputs": [],
502   "source": [
503    "# LightGBM 특성 중요도\n",
504    "importance_lgb = pd.DataFrame({\n",
505    "    'Feature': cancer.feature_names,\n",
506    "    'Importance': lgb_clf.feature_importances_\n",
507    "}).sort_values('Importance', ascending=True).tail(15)\n",
508    "\n",
509    "plt.figure(figsize=(10, 8))\n",
510    "plt.barh(importance_lgb['Feature'], importance_lgb['Importance'])\n",
511    "plt.xlabel('Importance')\n",
512    "plt.title('LightGBM Feature Importance - Top 15')\n",
513    "plt.grid(True, alpha=0.3)\n",
514    "plt.tight_layout()\n",
515    "plt.show()"
516   ]
517  },
518  {
519   "cell_type": "markdown",
520   "id": "cell-23",
521   "metadata": {},
522   "source": [
523    "## 4. CatBoost 개요"
524   ]
525  },
526  {
527   "cell_type": "code",
528   "execution_count": null,
529   "id": "cell-24",
530   "metadata": {},
531   "outputs": [],
532   "source": [
533    "print(\"\"\"\n",
534    "CatBoost 특징:\n",
535    "\n",
536    "1. 범주형 특성 자동 처리:\n",
537    "   - Target Encoding 자동 적용\n",
538    "   - Ordered Target Statistics로 데이터 누수 방지\n",
539    "\n",
540    "2. Ordered Boosting:\n",
541    "   - 학습 순서를 랜덤화하여 편향 감소\n",
542    "   - 과적합 방지\n",
543    "\n",
544    "3. 대칭 트리:\n",
545    "   - 같은 수준의 모든 노드가 동일한 분할 조건 사용\n",
546    "   - 예측 속도 향상\n",
547    "\n",
548    "설치: pip install catboost\n",
549    "\n",
550    "기본 사용법:\n",
551    "from catboost import CatBoostClassifier\n",
552    "\n",
553    "cat_clf = CatBoostClassifier(\n",
554    "    iterations=100,\n",
555    "    learning_rate=0.1,\n",
556    "    depth=6,\n",
557    "    l2_leaf_reg=3,\n",
558    "    random_state=42,\n",
559    "    verbose=False\n",
560    ")\n",
561    "\n",
562    "cat_clf.fit(X_train, y_train)\n",
563    "\"\"\")"
564   ]
565  },
566  {
567   "cell_type": "markdown",
568   "id": "cell-25",
569   "metadata": {},
570   "source": [
571    "## 5. 부스팅 알고리즘 비교"
572   ]
573  },
574  {
575   "cell_type": "code",
576   "execution_count": null,
577   "id": "cell-26",
578   "metadata": {},
579   "outputs": [],
580   "source": [
581    "from sklearn.ensemble import GradientBoostingClassifier\n",
582    "\n",
583    "# 모델 정의\n",
584    "models = {\n",
585    "    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),\n",
586    "    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),\n",
587    "    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)\n",
588    "}\n",
589    "\n",
590    "# 비교\n",
591    "print(\"부스팅 알고리즘 비교:\")\n",
592    "print(\"-\" * 70)\n",
593    "print(f\"{'모델':<20} {'훈련 정확도':>15} {'테스트 정확도':>15} {'학습시간(초)':>15}\")\n",
594    "print(\"-\" * 70)\n",
595    "\n",
596    "results = {}\n",
597    "for name, model in models.items():\n",
598    "    start_time = time.time()\n",
599    "    model.fit(X_train, y_train)\n",
600    "    train_time = time.time() - start_time\n",
601    "    \n",
602    "    train_acc = model.score(X_train, y_train)\n",
603    "    test_acc = model.score(X_test, y_test)\n",
604    "    \n",
605    "    results[name] = {\n",
606    "        'train_accuracy': train_acc,\n",
607    "        'test_accuracy': test_acc,\n",
608    "        'time': train_time\n",
609    "    }\n",
610    "    \n",
611    "    print(f\"{name:<20} {train_acc:>15.4f} {test_acc:>15.4f} {train_time:>15.4f}\")\n",
612    "\n",
613    "print(\"-\" * 70)"
614   ]
615  },
616  {
617   "cell_type": "code",
618   "execution_count": null,
619   "id": "cell-27",
620   "metadata": {},
621   "outputs": [],
622   "source": [
623    "# 시각화 비교\n",
624    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
625    "\n",
626    "# 정확도 비교\n",
627    "names = list(results.keys())\n",
628    "test_accuracies = [results[n]['test_accuracy'] for n in names]\n",
629    "axes[0].barh(names, test_accuracies, color=['skyblue', 'salmon', 'lightgreen'])\n",
630    "axes[0].set_xlabel('Test Accuracy')\n",
631    "axes[0].set_title('Accuracy Comparison')\n",
632    "axes[0].set_xlim([0.9, 1.0])\n",
633    "axes[0].grid(True, alpha=0.3)\n",
634    "\n",
635    "# 학습 시간 비교\n",
636    "times = [results[n]['time'] for n in names]\n",
637    "axes[1].barh(names, times, color=['skyblue', 'salmon', 'lightgreen'])\n",
638    "axes[1].set_xlabel('Training Time (seconds)')\n",
639    "axes[1].set_title('Training Time Comparison')\n",
640    "axes[1].grid(True, alpha=0.3)\n",
641    "\n",
642    "plt.tight_layout()\n",
643    "plt.show()"
644   ]
645  },
646  {
647   "cell_type": "markdown",
648   "id": "cell-28",
649   "metadata": {},
650   "source": [
651    "## 6. 회귀 예제"
652   ]
653  },
654  {
655   "cell_type": "code",
656   "execution_count": null,
657   "id": "cell-29",
658   "metadata": {},
659   "outputs": [],
660   "source": [
661    "# California Housing 데이터\n",
662    "housing = fetch_california_housing()\n",
663    "X_reg, y_reg = housing.data, housing.target\n",
664    "\n",
665    "X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(\n",
666    "    X_reg, y_reg, test_size=0.2, random_state=42\n",
667    ")\n",
668    "\n",
669    "print(f\"데이터 크기: {X_reg.shape}\")\n",
670    "print(f\"특성: {housing.feature_names}\")\n",
671    "print(f\"타겟: Median house value (in $100,000s)\")"
672   ]
673  },
674  {
675   "cell_type": "code",
676   "execution_count": null,
677   "id": "cell-30",
678   "metadata": {},
679   "outputs": [],
680   "source": [
681    "# XGBoost 회귀\n",
682    "xgb_reg = XGBRegressor(\n",
683    "    n_estimators=100,\n",
684    "    learning_rate=0.1,\n",
685    "    max_depth=6,\n",
686    "    random_state=42\n",
687    ")\n",
688    "xgb_reg.fit(X_train_reg, y_train_reg)\n",
689    "y_pred_xgb_reg = xgb_reg.predict(X_test_reg)\n",
690    "\n",
691    "# LightGBM 회귀\n",
692    "lgb_reg = LGBMRegressor(\n",
693    "    n_estimators=100,\n",
694    "    learning_rate=0.1,\n",
695    "    num_leaves=31,\n",
696    "    random_state=42,\n",
697    "    verbose=-1\n",
698    ")\n",
699    "lgb_reg.fit(X_train_reg, y_train_reg)\n",
700    "y_pred_lgb_reg = lgb_reg.predict(X_test_reg)\n",
701    "\n",
702    "# 평가\n",
703    "print(\"=== XGBoost 회귀 ===\")\n",
704    "print(f\"R² Score: {r2_score(y_test_reg, y_pred_xgb_reg):.4f}\")\n",
705    "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_xgb_reg)):.4f}\")\n",
706    "\n",
707    "print(\"\\n=== LightGBM 회귀 ===\")\n",
708    "print(f\"R² Score: {r2_score(y_test_reg, y_pred_lgb_reg):.4f}\")\n",
709    "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_lgb_reg)):.4f}\")"
710   ]
711  },
712  {
713   "cell_type": "code",
714   "execution_count": null,
715   "id": "cell-31",
716   "metadata": {},
717   "outputs": [],
718   "source": [
719    "# 예측 vs 실제 시각화\n",
720    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
721    "\n",
722    "# XGBoost\n",
723    "axes[0].scatter(y_test_reg, y_pred_xgb_reg, alpha=0.5)\n",
724    "axes[0].plot([y_test_reg.min(), y_test_reg.max()], \n",
725    "             [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
726    "axes[0].set_xlabel('Actual')\n",
727    "axes[0].set_ylabel('Predicted')\n",
728    "axes[0].set_title(f'XGBoost (R²={r2_score(y_test_reg, y_pred_xgb_reg):.4f})')\n",
729    "axes[0].grid(True, alpha=0.3)\n",
730    "\n",
731    "# LightGBM\n",
732    "axes[1].scatter(y_test_reg, y_pred_lgb_reg, alpha=0.5, color='green')\n",
733    "axes[1].plot([y_test_reg.min(), y_test_reg.max()], \n",
734    "             [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
735    "axes[1].set_xlabel('Actual')\n",
736    "axes[1].set_ylabel('Predicted')\n",
737    "axes[1].set_title(f'LightGBM (R²={r2_score(y_test_reg, y_pred_lgb_reg):.4f})')\n",
738    "axes[1].grid(True, alpha=0.3)\n",
739    "\n",
740    "plt.tight_layout()\n",
741    "plt.show()"
742   ]
743  },
744  {
745   "cell_type": "markdown",
746   "id": "cell-32",
747   "metadata": {},
748   "source": [
749    "## 7. 하이퍼파라미터 가이드"
750   ]
751  },
752  {
753   "cell_type": "code",
754   "execution_count": null,
755   "id": "cell-33",
756   "metadata": {},
757   "outputs": [],
758   "source": [
759    "# 하이퍼파라미터 비교표\n",
760    "params_comparison = pd.DataFrame({\n",
761    "    'Parameter': ['학습률', '트리 수', '깊이', '리프 수', 'L1 정규화', 'L2 정규화', '행 샘플링', '열 샘플링'],\n",
762    "    'XGBoost': ['learning_rate', 'n_estimators', 'max_depth', '-', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
763    "    'LightGBM': ['learning_rate', 'n_estimators', 'max_depth', 'num_leaves', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
764    "    'Effect': ['낮으면 안정적', '많으면 정확', '깊으면 복잡', '많으면 복잡', '과적합 방지', '과적합 방지', '분산 감소', '다양성 증가']\n",
765    "})\n",
766    "\n",
767    "print(\"하이퍼파라미터 가이드:\")\n",
768    "print(params_comparison.to_string(index=False))"
769   ]
770  },
771  {
772   "cell_type": "code",
773   "execution_count": null,
774   "id": "cell-34",
775   "metadata": {},
776   "outputs": [],
777   "source": [
778    "print(\"\"\"\n",
779    "권장 튜닝 순서:\n",
780    "\n",
781    "1. 트리 구조 파라미터:\n",
782    "   - max_depth, num_leaves\n",
783    "   - min_child_weight, min_child_samples\n",
784    "\n",
785    "2. 샘플링 파라미터:\n",
786    "   - subsample\n",
787    "   - colsample_bytree\n",
788    "\n",
789    "3. 정규화 파라미터:\n",
790    "   - reg_alpha, reg_lambda\n",
791    "\n",
792    "4. 학습률 조정:\n",
793    "   - learning_rate 낮추고\n",
794    "   - n_estimators 늘리기\n",
795    "\n",
796    "과적합 방지 전략:\n",
797    "- 조기 종료 (early_stopping_rounds)\n",
798    "- 정규화 (reg_alpha, reg_lambda)\n",
799    "- 샘플링 (subsample, colsample_bytree)\n",
800    "- 트리 제한 (max_depth, min_child_weight)\n",
801    "- 학습률 낮추기 (learning_rate)\n",
802    "\"\"\")"
803   ]
804  },
805  {
806   "cell_type": "markdown",
807   "id": "cell-35",
808   "metadata": {},
809   "source": [
810    "## 정리\n",
811    "\n",
812    "### 알고리즘 비교\n",
813    "\n",
814    "| 알고리즘 | 특징 | 장점 | 단점 |\n",
815    "|----------|------|------|------|\n",
816    "| Gradient Boosting | 잔차 학습 | 높은 정확도 | 느린 학습 |\n",
817    "| XGBoost | 정규화 + 병렬화 | 빠름, 정확함 | 메모리 사용 |\n",
818    "| LightGBM | Leaf-wise | 매우 빠름, 대용량 | 과적합 위험 |\n",
819    "| CatBoost | 범주형 처리 | 튜닝 적게 필요 | 느린 시작 |\n",
820    "\n",
821    "### 선택 가이드\n",
822    "\n",
823    "- **작은 데이터 (<10K)**: XGBoost 또는 sklearn GradientBoosting\n",
824    "- **중간 데이터 (10K-100K)**: XGBoost\n",
825    "- **대용량 데이터 (>100K)**: LightGBM\n",
826    "- **범주형 특성 많음**: CatBoost\n",
827    "- **빠른 학습 필요**: LightGBM\n",
828    "- **최고 정확도**: 모두 시도 후 앙상블\n",
829    "\n",
830    "### 주요 하이퍼파라미터\n",
831    "\n",
832    "**공통:**\n",
833    "- `n_estimators`: 트리 개수\n",
834    "- `learning_rate`: 학습률\n",
835    "- `max_depth`: 트리 깊이\n",
836    "\n",
837    "**XGBoost 전용:**\n",
838    "- `min_child_weight`: 리프 노드 최소 가중치\n",
839    "- `gamma`: 분할 최소 손실 감소\n",
840    "\n",
841    "**LightGBM 전용:**\n",
842    "- `num_leaves`: 리프 노드 최대 수\n",
843    "- `min_child_samples`: 리프 노드 최소 샘플\n",
844    "\n",
845    "### 다음 단계\n",
846    "- Stacking과 Blending\n",
847    "- AutoML (Optuna, Hyperopt)\n",
848    "- 실전 Kaggle 대회 참여"
849   ]
850  }
851 ],
852 "metadata": {
853  "kernelspec": {
854   "display_name": "Python 3",
855   "language": "python",
856   "name": "python3"
857  },
858  "language_info": {
859   "codemirror_mode": {
860    "name": "ipython",
861    "version": 3
862   },
863   "file_extension": ".py",
864   "mimetype": "text/x-python",
865   "name": "python",
866   "nbconvert_exporter": "python",
867   "pygments_lexer": "ipython3",
868   "version": "3.8.0"
869  }
870 },
871 "nbformat": 4,
872 "nbformat_minor": 5
873}