08_xgboost_lightgbm.ipynb

Download
json 874 lines 26.0 KB
  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "id": "cell-0",
  6   "metadata": {},
  7   "source": [
  8    "# 08. XGBoost & LightGBM\n",
  9    "\n",
 10    "## ํ•™์Šต ๋ชฉํ‘œ\n",
 11    "- Gradient Boosting ๊ฐœ๋… ์ดํ•ด\n",
 12    "- XGBoost ์‚ฌ์šฉ๋ฒ•๊ณผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ\n",
 13    "- LightGBM ํŠน์ง•๊ณผ ์ตœ์ ํ™”\n",
 14    "- CatBoost ๊ฐœ์š”\n",
 15    "- ๋ชจ๋ธ ๋น„๊ต ๋ฐ ์„ ํƒ"
 16   ]
 17  },
 18  {
 19   "cell_type": "code",
 20   "execution_count": null,
 21   "id": "cell-1",
 22   "metadata": {},
 23   "outputs": [],
 24   "source": [
 25    "import numpy as np\n",
 26    "import pandas as pd\n",
 27    "import matplotlib.pyplot as plt\n",
 28    "import seaborn as sns\n",
 29    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
 30    "from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score\n",
 31    "from sklearn.datasets import make_classification, load_breast_cancer, fetch_california_housing\n",
 32    "import time\n",
 33    "\n",
 34    "plt.rcParams['font.family'] = 'DejaVu Sans'\n",
 35    "plt.rcParams['axes.unicode_minus'] = False"
 36   ]
 37  },
 38  {
 39   "cell_type": "markdown",
 40   "id": "cell-2",
 41   "metadata": {},
 42   "source": [
 43    "## 1. Gradient Boosting ๊ฐœ๋…"
 44   ]
 45  },
 46  {
 47   "cell_type": "code",
 48   "execution_count": null,
 49   "id": "cell-3",
 50   "metadata": {},
 51   "outputs": [],
 52   "source": [
 53    "print(\"\"\"\n",
 54    "Gradient Boosting ์•Œ๊ณ ๋ฆฌ์ฆ˜:\n",
 55    "\n",
 56    "1. ์ดˆ๊ธฐํ™”: F_0(x) = argmin_ฮณ ฮฃ L(y_i, ฮณ)\n",
 57    "\n",
 58    "2. ๋ฐ˜๋ณต (m = 1, 2, ..., M):\n",
 59    "   a. ์˜์‚ฌ ์ž”์ฐจ(pseudo-residual) ๊ณ„์‚ฐ:\n",
 60    "      r_im = -[โˆ‚L(y_i, F(x_i))/โˆ‚F(x_i)]_{F=F_{m-1}}\n",
 61    "   \n",
 62    "   b. ์ž”์ฐจ์— ๋Œ€ํ•ด ์•ฝํ•œ ํ•™์Šต๊ธฐ h_m(x) ํ•™์Šต\n",
 63    "   \n",
 64    "   c. ์ตœ์  ์Šคํ… ํฌ๊ธฐ ๊ณ„์‚ฐ:\n",
 65    "      ฮณ_m = argmin_ฮณ ฮฃ L(y_i, F_{m-1}(x_i) + ฮณ * h_m(x_i))\n",
 66    "   \n",
 67    "   d. ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ:\n",
 68    "      F_m(x) = F_{m-1}(x) + learning_rate * ฮณ_m * h_m(x)\n",
 69    "\n",
 70    "ํ•ต์‹ฌ:\n",
 71    "- ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์ „ ๋ชจ๋ธ์˜ ์˜ค์ฐจ(์ž”์ฐจ)๋ฅผ ํ•™์Šต\n",
 72    "- ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉํ–ฅ์œผ๋กœ ์ตœ์ ํ™”\n",
 73    "- learning_rate๋กœ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€\n",
 74    "\"\"\")\n",
 75    "\n",
 76    "# ๊ฐ„๋‹จํ•œ ์‹œ๊ฐํ™” ๋ฐ์ดํ„ฐ\n",
 77    "np.random.seed(42)\n",
 78    "X_demo = np.linspace(0, 10, 100).reshape(-1, 1)\n",
 79    "y_demo = np.sin(X_demo).ravel() + np.random.randn(100) * 0.3\n",
 80    "\n",
 81    "plt.figure(figsize=(12, 4))\n",
 82    "plt.subplot(1, 2, 1)\n",
 83    "plt.scatter(X_demo, y_demo, alpha=0.5)\n",
 84    "plt.xlabel('X')\n",
 85    "plt.ylabel('y')\n",
 86    "plt.title('Sample Data for Gradient Boosting')\n",
 87    "plt.grid(True, alpha=0.3)\n",
 88    "\n",
 89    "plt.subplot(1, 2, 2)\n",
 90    "stages = [0, 1, 5, 20, 50]\n",
 91    "colors = ['red', 'orange', 'yellow', 'green', 'blue']\n",
 92    "for stage, color in zip(stages, colors):\n",
 93    "    if stage == 0:\n",
 94    "        plt.axhline(y=np.mean(y_demo), color=color, label=f'Stage {stage}', alpha=0.7)\n",
 95    "plt.scatter(X_demo, y_demo, alpha=0.3, color='gray')\n",
 96    "plt.xlabel('X')\n",
 97    "plt.ylabel('y')\n",
 98    "plt.title('Gradient Boosting: Sequential Learning')\n",
 99    "plt.legend()\n",
100    "plt.grid(True, alpha=0.3)\n",
101    "\n",
102    "plt.tight_layout()\n",
103    "plt.show()"
104   ]
105  },
106  {
107   "cell_type": "markdown",
108   "id": "cell-4",
109   "metadata": {},
110   "source": [
111    "## 2. XGBoost (eXtreme Gradient Boosting)"
112   ]
113  },
114  {
115   "cell_type": "code",
116   "execution_count": null,
117   "id": "cell-5",
118   "metadata": {},
119   "outputs": [],
120   "source": [
121    "# XGBoost ์„ค์น˜: pip install xgboost\n",
122    "import xgboost as xgb\n",
123    "from xgboost import XGBClassifier, XGBRegressor\n",
124    "\n",
125    "print(\"\"\"\n",
126    "XGBoost ํŠน์ง•:\n",
127    "\n",
128    "1. ์ •๊ทœํ™”:\n",
129    "   - L1, L2 ์ •๊ทœํ™”๋กœ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€\n",
130    "   - ๋ชฉํ‘œ ํ•จ์ˆ˜: ฮฃ L(y_i, ลท_i) + ฮฃ ฮฉ(f_k)\n",
131    "   - ฮฉ(f) = ฮณT + 0.5ฮป||w||ยฒ\n",
132    "\n",
133    "2. ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ:\n",
134    "   - 2์ฐจ ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœ ์‚ฌ์šฉ\n",
135    "   - ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๋ถ„ํ• \n",
136    "   - ์บ์‹œ ์ตœ์ ํ™”\n",
137    "\n",
138    "3. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ:\n",
139    "   - ์ž๋™์œผ๋กœ ์ตœ์  ๋ฐฉํ–ฅ ํ•™์Šต\n",
140    "\n",
141    "4. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ:\n",
142    "   - ํŠน์„ฑ๋ณ„ ๋ณ‘๋ ฌ ๋ถ„ํ• ์  ํƒ์ƒ‰\n",
143    "\"\"\")\n",
144    "\n",
145    "print(f\"XGBoost ๋ฒ„์ „: {xgb.__version__}\")"
146   ]
147  },
148  {
149   "cell_type": "markdown",
150   "id": "cell-6",
151   "metadata": {},
152   "source": [
153    "### 2.1 XGBoost ๋ถ„๋ฅ˜"
154   ]
155  },
156  {
157   "cell_type": "code",
158   "execution_count": null,
159   "id": "cell-7",
160   "metadata": {},
161   "outputs": [],
162   "source": [
163    "# Breast Cancer ๋ฐ์ดํ„ฐ ๋กœ๋“œ\n",
164    "cancer = load_breast_cancer()\n",
165    "X, y = cancer.data, cancer.target\n",
166    "\n",
167    "X_train, X_test, y_train, y_test = train_test_split(\n",
168    "    X, y, test_size=0.2, random_state=42\n",
169    ")\n",
170    "\n",
171    "print(f\"๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {X.shape}\")\n",
172    "print(f\"ํด๋ž˜์Šค: {cancer.target_names}\")"
173   ]
174  },
175  {
176   "cell_type": "code",
177   "execution_count": null,
178   "id": "cell-8",
179   "metadata": {},
180   "outputs": [],
181   "source": [
182    "# XGBoost ๋ถ„๋ฅ˜๊ธฐ\n",
183    "xgb_clf = XGBClassifier(\n",
184    "    n_estimators=100,\n",
185    "    learning_rate=0.1,\n",
186    "    max_depth=6,\n",
187    "    min_child_weight=1,     # ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ์†Œ ๊ฐ€์ค‘์น˜\n",
188    "    gamma=0,                # ๋ถ„ํ• ์— ํ•„์š”ํ•œ ์ตœ์†Œ ์†์‹ค ๊ฐ์†Œ\n",
189    "    subsample=1.0,          # ํ–‰ ์ƒ˜ํ”Œ๋ง ๋น„์œจ\n",
190    "    colsample_bytree=1.0,   # ํŠธ๋ฆฌ๋ณ„ ์—ด ์ƒ˜ํ”Œ๋ง ๋น„์œจ\n",
191    "    reg_alpha=0,            # L1 ์ •๊ทœํ™”\n",
192    "    reg_lambda=1,           # L2 ์ •๊ทœํ™”\n",
193    "    random_state=42,\n",
194    "    eval_metric='logloss'\n",
195    ")\n",
196    "\n",
197    "# ํ•™์Šต\n",
198    "start_time = time.time()\n",
199    "xgb_clf.fit(X_train, y_train)\n",
200    "train_time = time.time() - start_time\n",
201    "\n",
202    "# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€\n",
203    "y_pred = xgb_clf.predict(X_test)\n",
204    "accuracy = accuracy_score(y_test, y_pred)\n",
205    "\n",
206    "print(\"=== XGBoost ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ===\")\n",
207    "print(f\"ํ›ˆ๋ จ ์ •ํ™•๋„: {xgb_clf.score(X_train, y_train):.4f}\")\n",
208    "print(f\"ํ…Œ์ŠคํŠธ ์ •ํ™•๋„: {accuracy:.4f}\")\n",
209    "print(f\"ํ•™์Šต ์‹œ๊ฐ„: {train_time:.4f}์ดˆ\")\n",
210    "print(f\"\\n๋ถ„๋ฅ˜ ๋ณด๊ณ ์„œ:\")\n",
211    "print(classification_report(y_test, y_pred, target_names=cancer.target_names))"
212   ]
213  },
214  {
215   "cell_type": "markdown",
216   "id": "cell-9",
217   "metadata": {},
218   "source": [
219    "### 2.2 ์กฐ๊ธฐ ์ข…๋ฃŒ (Early Stopping)"
220   ]
221  },
222  {
223   "cell_type": "code",
224   "execution_count": null,
225   "id": "cell-10",
226   "metadata": {},
227   "outputs": [],
228   "source": [
229    "# ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ\n",
230    "X_train_sub, X_val, y_train_sub, y_val = train_test_split(\n",
231    "    X_train, y_train, test_size=0.2, random_state=42\n",
232    ")\n",
233    "\n",
234    "# ์กฐ๊ธฐ ์ข…๋ฃŒ ์‚ฌ์šฉ\n",
235    "xgb_early = XGBClassifier(\n",
236    "    n_estimators=1000,\n",
237    "    learning_rate=0.1,\n",
238    "    max_depth=6,\n",
239    "    random_state=42,\n",
240    "    early_stopping_rounds=10,  # 10 ๋ผ์šด๋“œ ๋™์•ˆ ๊ฐœ์„  ์—†์œผ๋ฉด ์ค‘์ง€\n",
241    "    eval_metric='logloss'\n",
242    ")\n",
243    "\n",
244    "xgb_early.fit(\n",
245    "    X_train_sub, y_train_sub,\n",
246    "    eval_set=[(X_val, y_val)],\n",
247    "    verbose=False\n",
248    ")\n",
249    "\n",
250    "print(\"=== ์กฐ๊ธฐ ์ข…๋ฃŒ ๊ฒฐ๊ณผ ===\")\n",
251    "print(f\"์ตœ์  ๋ฐ˜๋ณต ํšŸ์ˆ˜: {xgb_early.best_iteration}\")\n",
252    "print(f\"์ตœ์  ์ ์ˆ˜: {xgb_early.best_score:.4f}\")\n",
253    "print(f\"ํ…Œ์ŠคํŠธ ์ •ํ™•๋„: {xgb_early.score(X_test, y_test):.4f}\")"
254   ]
255  },
256  {
257   "cell_type": "markdown",
258   "id": "cell-11",
259   "metadata": {},
260   "source": [
261    "### 2.3 ํŠน์„ฑ ์ค‘์š”๋„"
262   ]
263  },
264  {
265   "cell_type": "code",
266   "execution_count": null,
267   "id": "cell-12",
268   "metadata": {},
269   "outputs": [],
270   "source": [
271    "# ํŠน์„ฑ ์ค‘์š”๋„ ์‹œ๊ฐํ™”\n",
272    "importance_types = ['weight', 'gain', 'cover']\n",
273    "\n",
274    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
275    "\n",
276    "for ax, imp_type in zip(axes, importance_types):\n",
277    "    importance_dict = xgb_clf.get_booster().get_score(importance_type=imp_type)\n",
278    "    \n",
279    "    if importance_dict:\n",
280    "        # ์ƒ์œ„ 10๊ฐœ๋งŒ ํ‘œ์‹œ\n",
281    "        sorted_importance = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)[:10]\n",
282    "        features = [x[0] for x in sorted_importance]\n",
283    "        values = [x[1] for x in sorted_importance]\n",
284    "        \n",
285    "        ax.barh(range(len(features)), values)\n",
286    "        ax.set_yticks(range(len(features)))\n",
287    "        ax.set_yticklabels(features)\n",
288    "        ax.set_xlabel('Importance')\n",
289    "        ax.set_title(f'Feature Importance ({imp_type})')\n",
290    "        ax.grid(True, alpha=0.3)\n",
291    "\n",
292    "plt.tight_layout()\n",
293    "plt.show()\n",
294    "\n",
295    "print(\"\"\"\n",
296    "์ค‘์š”๋„ ํƒ€์ž…:\n",
297    "- weight: ํŠน์„ฑ์ด ๋ถ„ํ• ์— ์‚ฌ์šฉ๋œ ํšŸ์ˆ˜\n",
298    "- gain: ํŠน์„ฑ ์‚ฌ์šฉ ์‹œ ํ‰๊ท  ์ด๋“\n",
299    "- cover: ํŠน์„ฑ์ด ์ปค๋ฒ„ํ•˜๋Š” ํ‰๊ท  ์ƒ˜ํ”Œ ์ˆ˜\n",
300    "\"\"\")"
301   ]
302  },
303  {
304   "cell_type": "markdown",
305   "id": "cell-13",
306   "metadata": {},
307   "source": [
308    "### 2.4 XGBoost ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹"
309   ]
310  },
311  {
312   "cell_type": "code",
313   "execution_count": null,
314   "id": "cell-14",
315   "metadata": {},
316   "outputs": [],
317   "source": [
318    "# ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ\n",
319    "param_grid_xgb = {\n",
320    "    'max_depth': [3, 5, 7],\n",
321    "    'learning_rate': [0.01, 0.1, 0.3],\n",
322    "    'n_estimators': [100, 200],\n",
323    "    'min_child_weight': [1, 3],\n",
324    "    'subsample': [0.8, 1.0],\n",
325    "    'colsample_bytree': [0.8, 1.0]\n",
326    "}\n",
327    "\n",
328    "# Grid Search (์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ„์†Œํ™”๋œ ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ)\n",
329    "grid_search_xgb = GridSearchCV(\n",
330    "    XGBClassifier(random_state=42, eval_metric='logloss'),\n",
331    "    param_grid_xgb,\n",
332    "    cv=3,\n",
333    "    scoring='accuracy',\n",
334    "    n_jobs=-1,\n",
335    "    verbose=0\n",
336    ")\n",
337    "\n",
338    "# ์‹œ๊ฐ„ ์ ˆ์•ฝ์„ ์œ„ํ•ด ์ƒ˜ํ”Œ ์‚ฌ์šฉ\n",
339    "X_sample, _, y_sample, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=42)\n",
340    "grid_search_xgb.fit(X_sample, y_sample)\n",
341    "\n",
342    "print(\"=== XGBoost Grid Search ๊ฒฐ๊ณผ ===\")\n",
343    "print(f\"์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ: {grid_search_xgb.best_params_}\")\n",
344    "print(f\"์ตœ์  CV ์ ์ˆ˜: {grid_search_xgb.best_score_:.4f}\")\n",
345    "print(f\"ํ…Œ์ŠคํŠธ ์ ์ˆ˜: {grid_search_xgb.score(X_test, y_test):.4f}\")"
346   ]
347  },
348  {
349   "cell_type": "markdown",
350   "id": "cell-15",
351   "metadata": {},
352   "source": [
353    "## 3. LightGBM"
354   ]
355  },
356  {
357   "cell_type": "code",
358   "execution_count": null,
359   "id": "cell-16",
360   "metadata": {},
361   "outputs": [],
362   "source": [
363    "# LightGBM ์„ค์น˜: pip install lightgbm\n",
364    "import lightgbm as lgb\n",
365    "from lightgbm import LGBMClassifier, LGBMRegressor\n",
366    "\n",
367    "print(\"\"\"\n",
368    "LightGBM ํŠน์ง•:\n",
369    "\n",
370    "1. Leaf-wise ์„ฑ์žฅ:\n",
371    "   - ๊ธฐ์กด: Level-wise (์ˆ˜ํ‰ ๋ถ„ํ• )\n",
372    "   - LightGBM: Leaf-wise (์†์‹ค ์ตœ๋Œ€ ๊ฐ์†Œ ๋ฆฌํ”„ ๋ถ„ํ• )\n",
373    "   - ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜์ง€๋งŒ ๊ณผ์ ํ•ฉ ์œ„ํ—˜\n",
374    "\n",
375    "2. ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๋ถ„ํ• :\n",
376    "   - ์—ฐ์†ํ˜• ๊ฐ’์„ ์ด์‚ฐํ™”\n",
377    "   - ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ , ๋น ๋ฅธ ํ•™์Šต\n",
378    "\n",
379    "3. GOSS (Gradient-based One-Side Sampling):\n",
380    "   - ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ํฐ ์ƒ˜ํ”Œ ์œ„์ฃผ๋กœ ์ƒ˜ํ”Œ๋ง\n",
381    "\n",
382    "4. EFB (Exclusive Feature Bundling):\n",
383    "   - ์ƒํ˜ธ ๋ฐฐํƒ€์  ํŠน์„ฑ๋“ค์„ ๋ฌถ์Œ\n",
384    "   - ํฌ์†Œ ํŠน์„ฑ์— ํšจ๊ณผ์ \n",
385    "\"\"\")\n",
386    "\n",
387    "print(f\"LightGBM ๋ฒ„์ „: {lgb.__version__}\")"
388   ]
389  },
390  {
391   "cell_type": "markdown",
392   "id": "cell-17",
393   "metadata": {},
394   "source": [
395    "### 3.1 LightGBM ๋ถ„๋ฅ˜"
396   ]
397  },
398  {
399   "cell_type": "code",
400   "execution_count": null,
401   "id": "cell-18",
402   "metadata": {},
403   "outputs": [],
404   "source": [
405    "# LightGBM ๋ถ„๋ฅ˜๊ธฐ\n",
406    "lgb_clf = LGBMClassifier(\n",
407    "    n_estimators=100,\n",
408    "    learning_rate=0.1,\n",
409    "    max_depth=-1,           # -1: ์ œํ•œ ์—†์Œ\n",
410    "    num_leaves=31,          # ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ๋Œ€ ์ˆ˜\n",
411    "    min_child_samples=20,   # ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜\n",
412    "    subsample=1.0,          # ํ–‰ ์ƒ˜ํ”Œ๋ง\n",
413    "    colsample_bytree=1.0,   # ์—ด ์ƒ˜ํ”Œ๋ง\n",
414    "    reg_alpha=0,            # L1 ์ •๊ทœํ™”\n",
415    "    reg_lambda=0,           # L2 ์ •๊ทœํ™”\n",
416    "    random_state=42,\n",
417    "    verbose=-1\n",
418    ")\n",
419    "\n",
420    "# ํ•™์Šต\n",
421    "start_time = time.time()\n",
422    "lgb_clf.fit(X_train, y_train)\n",
423    "train_time_lgb = time.time() - start_time\n",
424    "\n",
425    "# ํ‰๊ฐ€\n",
426    "y_pred_lgb = lgb_clf.predict(X_test)\n",
427    "accuracy_lgb = accuracy_score(y_test, y_pred_lgb)\n",
428    "\n",
429    "print(\"=== LightGBM ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ===\")\n",
430    "print(f\"ํ›ˆ๋ จ ์ •ํ™•๋„: {lgb_clf.score(X_train, y_train):.4f}\")\n",
431    "print(f\"ํ…Œ์ŠคํŠธ ์ •ํ™•๋„: {accuracy_lgb:.4f}\")\n",
432    "print(f\"ํ•™์Šต ์‹œ๊ฐ„: {train_time_lgb:.4f}์ดˆ\")"
433   ]
434  },
435  {
436   "cell_type": "markdown",
437   "id": "cell-19",
438   "metadata": {},
439   "source": [
440    "### 3.2 num_leaves vs max_depth"
441   ]
442  },
443  {
444   "cell_type": "code",
445   "execution_count": null,
446   "id": "cell-20",
447   "metadata": {},
448   "outputs": [],
449   "source": [
450    "print(\"\"\"\n",
451    "num_leaves์™€ max_depth์˜ ๊ด€๊ณ„:\n",
452    "- max_depth = d์ผ ๋•Œ, ์ตœ๋Œ€ ๋ฆฌํ”„ ์ˆ˜ = 2^d\n",
453    "- num_leaves = 31์ด๋ฉด ๋Œ€๋žต max_depth = 5 ์ˆ˜์ค€\n",
454    "- ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€: num_leaves < 2^max_depth\n",
455    "\n",
456    "๊ถŒ์žฅ ์„ค์ •:\n",
457    "- ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ: num_leaves = 2^max_depth - 1 ์ดํ•˜\n",
458    "- ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ: num_leaves๋ฅผ ์ž‘๊ฒŒ (15~31)\n",
459    "\"\"\")\n",
460    "\n",
461    "# num_leaves์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ\n",
462    "num_leaves_range = [15, 31, 63, 127, 255]\n",
463    "train_scores_lgb = []\n",
464    "test_scores_lgb = []\n",
465    "\n",
466    "for num_leaves in num_leaves_range:\n",
467    "    lgb_temp = LGBMClassifier(\n",
468    "        n_estimators=100,\n",
469    "        num_leaves=num_leaves,\n",
470    "        random_state=42,\n",
471    "        verbose=-1\n",
472    "    )\n",
473    "    lgb_temp.fit(X_train, y_train)\n",
474    "    train_scores_lgb.append(lgb_temp.score(X_train, y_train))\n",
475    "    test_scores_lgb.append(lgb_temp.score(X_test, y_test))\n",
476    "\n",
477    "plt.figure(figsize=(10, 6))\n",
478    "plt.plot(num_leaves_range, train_scores_lgb, 'o-', label='Train')\n",
479    "plt.plot(num_leaves_range, test_scores_lgb, 's-', label='Test')\n",
480    "plt.xlabel('num_leaves')\n",
481    "plt.ylabel('Accuracy')\n",
482    "plt.title('LightGBM: num_leaves Effect')\n",
483    "plt.legend()\n",
484    "plt.grid(True, alpha=0.3)\n",
485    "plt.show()"
486   ]
487  },
488  {
489   "cell_type": "markdown",
490   "id": "cell-21",
491   "metadata": {},
492   "source": [
493    "### 3.3 ํŠน์„ฑ ์ค‘์š”๋„"
494   ]
495  },
496  {
497   "cell_type": "code",
498   "execution_count": null,
499   "id": "cell-22",
500   "metadata": {},
501   "outputs": [],
502   "source": [
503    "# LightGBM ํŠน์„ฑ ์ค‘์š”๋„\n",
504    "importance_lgb = pd.DataFrame({\n",
505    "    'Feature': cancer.feature_names,\n",
506    "    'Importance': lgb_clf.feature_importances_\n",
507    "}).sort_values('Importance', ascending=True).tail(15)\n",
508    "\n",
509    "plt.figure(figsize=(10, 8))\n",
510    "plt.barh(importance_lgb['Feature'], importance_lgb['Importance'])\n",
511    "plt.xlabel('Importance')\n",
512    "plt.title('LightGBM Feature Importance - Top 15')\n",
513    "plt.grid(True, alpha=0.3)\n",
514    "plt.tight_layout()\n",
515    "plt.show()"
516   ]
517  },
518  {
519   "cell_type": "markdown",
520   "id": "cell-23",
521   "metadata": {},
522   "source": [
523    "## 4. CatBoost ๊ฐœ์š”"
524   ]
525  },
526  {
527   "cell_type": "code",
528   "execution_count": null,
529   "id": "cell-24",
530   "metadata": {},
531   "outputs": [],
532   "source": [
533    "print(\"\"\"\n",
534    "CatBoost ํŠน์ง•:\n",
535    "\n",
536    "1. ๋ฒ”์ฃผํ˜• ํŠน์„ฑ ์ž๋™ ์ฒ˜๋ฆฌ:\n",
537    "   - Target Encoding ์ž๋™ ์ ์šฉ\n",
538    "   - Ordered Target Statistics๋กœ ๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ ๋ฐฉ์ง€\n",
539    "\n",
540    "2. Ordered Boosting:\n",
541    "   - ํ•™์Šต ์ˆœ์„œ๋ฅผ ๋žœ๋คํ™”ํ•˜์—ฌ ํŽธํ–ฅ ๊ฐ์†Œ\n",
542    "   - ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€\n",
543    "\n",
544    "3. ๋Œ€์นญ ํŠธ๋ฆฌ:\n",
545    "   - ๊ฐ™์€ ์ˆ˜์ค€์˜ ๋ชจ๋“  ๋…ธ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ถ„ํ•  ์กฐ๊ฑด ์‚ฌ์šฉ\n",
546    "   - ์˜ˆ์ธก ์†๋„ ํ–ฅ์ƒ\n",
547    "\n",
548    "์„ค์น˜: pip install catboost\n",
549    "\n",
550    "๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•:\n",
551    "from catboost import CatBoostClassifier\n",
552    "\n",
553    "cat_clf = CatBoostClassifier(\n",
554    "    iterations=100,\n",
555    "    learning_rate=0.1,\n",
556    "    depth=6,\n",
557    "    l2_leaf_reg=3,\n",
558    "    random_state=42,\n",
559    "    verbose=False\n",
560    ")\n",
561    "\n",
562    "cat_clf.fit(X_train, y_train)\n",
563    "\"\"\")"
564   ]
565  },
566  {
567   "cell_type": "markdown",
568   "id": "cell-25",
569   "metadata": {},
570   "source": [
571    "## 5. ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต"
572   ]
573  },
574  {
575   "cell_type": "code",
576   "execution_count": null,
577   "id": "cell-26",
578   "metadata": {},
579   "outputs": [],
580   "source": [
581    "from sklearn.ensemble import GradientBoostingClassifier\n",
582    "\n",
583    "# ๋ชจ๋ธ ์ •์˜\n",
584    "models = {\n",
585    "    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),\n",
586    "    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),\n",
587    "    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)\n",
588    "}\n",
589    "\n",
590    "# ๋น„๊ต\n",
591    "print(\"๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต:\")\n",
592    "print(\"-\" * 70)\n",
593    "print(f\"{'๋ชจ๋ธ':<20} {'ํ›ˆ๋ จ ์ •ํ™•๋„':>15} {'ํ…Œ์ŠคํŠธ ์ •ํ™•๋„':>15} {'ํ•™์Šต์‹œ๊ฐ„(์ดˆ)':>15}\")\n",
594    "print(\"-\" * 70)\n",
595    "\n",
596    "results = {}\n",
597    "for name, model in models.items():\n",
598    "    start_time = time.time()\n",
599    "    model.fit(X_train, y_train)\n",
600    "    train_time = time.time() - start_time\n",
601    "    \n",
602    "    train_acc = model.score(X_train, y_train)\n",
603    "    test_acc = model.score(X_test, y_test)\n",
604    "    \n",
605    "    results[name] = {\n",
606    "        'train_accuracy': train_acc,\n",
607    "        'test_accuracy': test_acc,\n",
608    "        'time': train_time\n",
609    "    }\n",
610    "    \n",
611    "    print(f\"{name:<20} {train_acc:>15.4f} {test_acc:>15.4f} {train_time:>15.4f}\")\n",
612    "\n",
613    "print(\"-\" * 70)"
614   ]
615  },
616  {
617   "cell_type": "code",
618   "execution_count": null,
619   "id": "cell-27",
620   "metadata": {},
621   "outputs": [],
622   "source": [
623    "# ์‹œ๊ฐํ™” ๋น„๊ต\n",
624    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
625    "\n",
626    "# ์ •ํ™•๋„ ๋น„๊ต\n",
627    "names = list(results.keys())\n",
628    "test_accuracies = [results[n]['test_accuracy'] for n in names]\n",
629    "axes[0].barh(names, test_accuracies, color=['skyblue', 'salmon', 'lightgreen'])\n",
630    "axes[0].set_xlabel('Test Accuracy')\n",
631    "axes[0].set_title('Accuracy Comparison')\n",
632    "axes[0].set_xlim([0.9, 1.0])\n",
633    "axes[0].grid(True, alpha=0.3)\n",
634    "\n",
635    "# ํ•™์Šต ์‹œ๊ฐ„ ๋น„๊ต\n",
636    "times = [results[n]['time'] for n in names]\n",
637    "axes[1].barh(names, times, color=['skyblue', 'salmon', 'lightgreen'])\n",
638    "axes[1].set_xlabel('Training Time (seconds)')\n",
639    "axes[1].set_title('Training Time Comparison')\n",
640    "axes[1].grid(True, alpha=0.3)\n",
641    "\n",
642    "plt.tight_layout()\n",
643    "plt.show()"
644   ]
645  },
646  {
647   "cell_type": "markdown",
648   "id": "cell-28",
649   "metadata": {},
650   "source": [
651    "## 6. ํšŒ๊ท€ ์˜ˆ์ œ"
652   ]
653  },
654  {
655   "cell_type": "code",
656   "execution_count": null,
657   "id": "cell-29",
658   "metadata": {},
659   "outputs": [],
660   "source": [
661    "# California Housing ๋ฐ์ดํ„ฐ\n",
662    "housing = fetch_california_housing()\n",
663    "X_reg, y_reg = housing.data, housing.target\n",
664    "\n",
665    "X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(\n",
666    "    X_reg, y_reg, test_size=0.2, random_state=42\n",
667    ")\n",
668    "\n",
669    "print(f\"๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {X_reg.shape}\")\n",
670    "print(f\"ํŠน์„ฑ: {housing.feature_names}\")\n",
671    "print(f\"ํƒ€๊ฒŸ: Median house value (in $100,000s)\")"
672   ]
673  },
674  {
675   "cell_type": "code",
676   "execution_count": null,
677   "id": "cell-30",
678   "metadata": {},
679   "outputs": [],
680   "source": [
681    "# XGBoost ํšŒ๊ท€\n",
682    "xgb_reg = XGBRegressor(\n",
683    "    n_estimators=100,\n",
684    "    learning_rate=0.1,\n",
685    "    max_depth=6,\n",
686    "    random_state=42\n",
687    ")\n",
688    "xgb_reg.fit(X_train_reg, y_train_reg)\n",
689    "y_pred_xgb_reg = xgb_reg.predict(X_test_reg)\n",
690    "\n",
691    "# LightGBM ํšŒ๊ท€\n",
692    "lgb_reg = LGBMRegressor(\n",
693    "    n_estimators=100,\n",
694    "    learning_rate=0.1,\n",
695    "    num_leaves=31,\n",
696    "    random_state=42,\n",
697    "    verbose=-1\n",
698    ")\n",
699    "lgb_reg.fit(X_train_reg, y_train_reg)\n",
700    "y_pred_lgb_reg = lgb_reg.predict(X_test_reg)\n",
701    "\n",
702    "# ํ‰๊ฐ€\n",
703    "print(\"=== XGBoost ํšŒ๊ท€ ===\")\n",
704    "print(f\"Rยฒ Score: {r2_score(y_test_reg, y_pred_xgb_reg):.4f}\")\n",
705    "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_xgb_reg)):.4f}\")\n",
706    "\n",
707    "print(\"\\n=== LightGBM ํšŒ๊ท€ ===\")\n",
708    "print(f\"Rยฒ Score: {r2_score(y_test_reg, y_pred_lgb_reg):.4f}\")\n",
709    "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_lgb_reg)):.4f}\")"
710   ]
711  },
712  {
713   "cell_type": "code",
714   "execution_count": null,
715   "id": "cell-31",
716   "metadata": {},
717   "outputs": [],
718   "source": [
719    "# ์˜ˆ์ธก vs ์‹ค์ œ ์‹œ๊ฐํ™”\n",
720    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
721    "\n",
722    "# XGBoost\n",
723    "axes[0].scatter(y_test_reg, y_pred_xgb_reg, alpha=0.5)\n",
724    "axes[0].plot([y_test_reg.min(), y_test_reg.max()], \n",
725    "             [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
726    "axes[0].set_xlabel('Actual')\n",
727    "axes[0].set_ylabel('Predicted')\n",
728    "axes[0].set_title(f'XGBoost (Rยฒ={r2_score(y_test_reg, y_pred_xgb_reg):.4f})')\n",
729    "axes[0].grid(True, alpha=0.3)\n",
730    "\n",
731    "# LightGBM\n",
732    "axes[1].scatter(y_test_reg, y_pred_lgb_reg, alpha=0.5, color='green')\n",
733    "axes[1].plot([y_test_reg.min(), y_test_reg.max()], \n",
734    "             [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
735    "axes[1].set_xlabel('Actual')\n",
736    "axes[1].set_ylabel('Predicted')\n",
737    "axes[1].set_title(f'LightGBM (Rยฒ={r2_score(y_test_reg, y_pred_lgb_reg):.4f})')\n",
738    "axes[1].grid(True, alpha=0.3)\n",
739    "\n",
740    "plt.tight_layout()\n",
741    "plt.show()"
742   ]
743  },
744  {
745   "cell_type": "markdown",
746   "id": "cell-32",
747   "metadata": {},
748   "source": [
749    "## 7. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ€์ด๋“œ"
750   ]
751  },
752  {
753   "cell_type": "code",
754   "execution_count": null,
755   "id": "cell-33",
756   "metadata": {},
757   "outputs": [],
758   "source": [
759    "# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋น„๊ตํ‘œ\n",
760    "params_comparison = pd.DataFrame({\n",
761    "    'Parameter': ['ํ•™์Šต๋ฅ ', 'ํŠธ๋ฆฌ ์ˆ˜', '๊นŠ์ด', '๋ฆฌํ”„ ์ˆ˜', 'L1 ์ •๊ทœํ™”', 'L2 ์ •๊ทœํ™”', 'ํ–‰ ์ƒ˜ํ”Œ๋ง', '์—ด ์ƒ˜ํ”Œ๋ง'],\n",
762    "    'XGBoost': ['learning_rate', 'n_estimators', 'max_depth', '-', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
763    "    'LightGBM': ['learning_rate', 'n_estimators', 'max_depth', 'num_leaves', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
764    "    'Effect': ['๋‚ฎ์œผ๋ฉด ์•ˆ์ •์ ', '๋งŽ์œผ๋ฉด ์ •ํ™•', '๊นŠ์œผ๋ฉด ๋ณต์žก', '๋งŽ์œผ๋ฉด ๋ณต์žก', '๊ณผ์ ํ•ฉ ๋ฐฉ์ง€', '๊ณผ์ ํ•ฉ ๋ฐฉ์ง€', '๋ถ„์‚ฐ ๊ฐ์†Œ', '๋‹ค์–‘์„ฑ ์ฆ๊ฐ€']\n",
765    "})\n",
766    "\n",
767    "print(\"ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ€์ด๋“œ:\")\n",
768    "print(params_comparison.to_string(index=False))"
769   ]
770  },
771  {
772   "cell_type": "code",
773   "execution_count": null,
774   "id": "cell-34",
775   "metadata": {},
776   "outputs": [],
777   "source": [
778    "print(\"\"\"\n",
779    "๊ถŒ์žฅ ํŠœ๋‹ ์ˆœ์„œ:\n",
780    "\n",
781    "1. ํŠธ๋ฆฌ ๊ตฌ์กฐ ํŒŒ๋ผ๋ฏธํ„ฐ:\n",
782    "   - max_depth, num_leaves\n",
783    "   - min_child_weight, min_child_samples\n",
784    "\n",
785    "2. ์ƒ˜ํ”Œ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ:\n",
786    "   - subsample\n",
787    "   - colsample_bytree\n",
788    "\n",
789    "3. ์ •๊ทœํ™” ํŒŒ๋ผ๋ฏธํ„ฐ:\n",
790    "   - reg_alpha, reg_lambda\n",
791    "\n",
792    "4. ํ•™์Šต๋ฅ  ์กฐ์ •:\n",
793    "   - learning_rate ๋‚ฎ์ถ”๊ณ \n",
794    "   - n_estimators ๋Š˜๋ฆฌ๊ธฐ\n",
795    "\n",
796    "๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ์ „๋žต:\n",
797    "- ์กฐ๊ธฐ ์ข…๋ฃŒ (early_stopping_rounds)\n",
798    "- ์ •๊ทœํ™” (reg_alpha, reg_lambda)\n",
799    "- ์ƒ˜ํ”Œ๋ง (subsample, colsample_bytree)\n",
800    "- ํŠธ๋ฆฌ ์ œํ•œ (max_depth, min_child_weight)\n",
801    "- ํ•™์Šต๋ฅ  ๋‚ฎ์ถ”๊ธฐ (learning_rate)\n",
802    "\"\"\")"
803   ]
804  },
805  {
806   "cell_type": "markdown",
807   "id": "cell-35",
808   "metadata": {},
809   "source": [
810    "## ์ •๋ฆฌ\n",
811    "\n",
812    "### ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต\n",
813    "\n",
814    "| ์•Œ๊ณ ๋ฆฌ์ฆ˜ | ํŠน์ง• | ์žฅ์  | ๋‹จ์  |\n",
815    "|----------|------|------|------|\n",
816    "| Gradient Boosting | ์ž”์ฐจ ํ•™์Šต | ๋†’์€ ์ •ํ™•๋„ | ๋А๋ฆฐ ํ•™์Šต |\n",
817    "| XGBoost | ์ •๊ทœํ™” + ๋ณ‘๋ ฌํ™” | ๋น ๋ฆ„, ์ •ํ™•ํ•จ | ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ |\n",
818    "| LightGBM | Leaf-wise | ๋งค์šฐ ๋น ๋ฆ„, ๋Œ€์šฉ๋Ÿ‰ | ๊ณผ์ ํ•ฉ ์œ„ํ—˜ |\n",
819    "| CatBoost | ๋ฒ”์ฃผํ˜• ์ฒ˜๋ฆฌ | ํŠœ๋‹ ์ ๊ฒŒ ํ•„์š” | ๋А๋ฆฐ ์‹œ์ž‘ |\n",
820    "\n",
821    "### ์„ ํƒ ๊ฐ€์ด๋“œ\n",
822    "\n",
823    "- **์ž‘์€ ๋ฐ์ดํ„ฐ (<10K)**: XGBoost ๋˜๋Š” sklearn GradientBoosting\n",
824    "- **์ค‘๊ฐ„ ๋ฐ์ดํ„ฐ (10K-100K)**: XGBoost\n",
825    "- **๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ (>100K)**: LightGBM\n",
826    "- **๋ฒ”์ฃผํ˜• ํŠน์„ฑ ๋งŽ์Œ**: CatBoost\n",
827    "- **๋น ๋ฅธ ํ•™์Šต ํ•„์š”**: LightGBM\n",
828    "- **์ตœ๊ณ  ์ •ํ™•๋„**: ๋ชจ๋‘ ์‹œ๋„ ํ›„ ์•™์ƒ๋ธ”\n",
829    "\n",
830    "### ์ฃผ์š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ\n",
831    "\n",
832    "**๊ณตํ†ต:**\n",
833    "- `n_estimators`: ํŠธ๋ฆฌ ๊ฐœ์ˆ˜\n",
834    "- `learning_rate`: ํ•™์Šต๋ฅ \n",
835    "- `max_depth`: ํŠธ๋ฆฌ ๊นŠ์ด\n",
836    "\n",
837    "**XGBoost ์ „์šฉ:**\n",
838    "- `min_child_weight`: ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ์†Œ ๊ฐ€์ค‘์น˜\n",
839    "- `gamma`: ๋ถ„ํ•  ์ตœ์†Œ ์†์‹ค ๊ฐ์†Œ\n",
840    "\n",
841    "**LightGBM ์ „์šฉ:**\n",
842    "- `num_leaves`: ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ๋Œ€ ์ˆ˜\n",
843    "- `min_child_samples`: ๋ฆฌํ”„ ๋…ธ๋“œ ์ตœ์†Œ ์ƒ˜ํ”Œ\n",
844    "\n",
845    "### ๋‹ค์Œ ๋‹จ๊ณ„\n",
846    "- Stacking๊ณผ Blending\n",
847    "- AutoML (Optuna, Hyperopt)\n",
848    "- ์‹ค์ „ Kaggle ๋Œ€ํšŒ ์ฐธ์—ฌ"
849   ]
850  }
851 ],
852 "metadata": {
853  "kernelspec": {
854   "display_name": "Python 3",
855   "language": "python",
856   "name": "python3"
857  },
858  "language_info": {
859   "codemirror_mode": {
860    "name": "ipython",
861    "version": 3
862   },
863   "file_extension": ".py",
864   "mimetype": "text/x-python",
865   "name": "python",
866   "nbconvert_exporter": "python",
867   "pygments_lexer": "ipython3",
868   "version": "3.8.0"
869  }
870 },
871 "nbformat": 4,
872 "nbformat_minor": 5
873}