1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "id": "cell-0",
6 "metadata": {},
7 "source": [
8 "# 08. XGBoost & LightGBM\n",
9 "\n",
10 "## ํ์ต ๋ชฉํ\n",
11 "- Gradient Boosting ๊ฐ๋
์ดํด\n",
12 "- XGBoost ์ฌ์ฉ๋ฒ๊ณผ ํ์ดํผํ๋ผ๋ฏธํฐ\n",
13 "- LightGBM ํน์ง๊ณผ ์ต์ ํ\n",
14 "- CatBoost ๊ฐ์\n",
15 "- ๋ชจ๋ธ ๋น๊ต ๋ฐ ์ ํ"
16 ]
17 },
18 {
19 "cell_type": "code",
20 "execution_count": null,
21 "id": "cell-1",
22 "metadata": {},
23 "outputs": [],
24 "source": [
25 "import numpy as np\n",
26 "import pandas as pd\n",
27 "import matplotlib.pyplot as plt\n",
28 "import seaborn as sns\n",
29 "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
30 "from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score\n",
31 "from sklearn.datasets import make_classification, load_breast_cancer, fetch_california_housing\n",
32 "import time\n",
33 "\n",
34 "plt.rcParams['font.family'] = 'DejaVu Sans'\n",
35 "plt.rcParams['axes.unicode_minus'] = False"
36 ]
37 },
38 {
39 "cell_type": "markdown",
40 "id": "cell-2",
41 "metadata": {},
42 "source": [
43 "## 1. Gradient Boosting ๊ฐ๋
"
44 ]
45 },
46 {
47 "cell_type": "code",
48 "execution_count": null,
49 "id": "cell-3",
50 "metadata": {},
51 "outputs": [],
52 "source": [
53 "print(\"\"\"\n",
54 "Gradient Boosting ์๊ณ ๋ฆฌ์ฆ:\n",
55 "\n",
56 "1. ์ด๊ธฐํ: F_0(x) = argmin_ฮณ ฮฃ L(y_i, ฮณ)\n",
57 "\n",
58 "2. ๋ฐ๋ณต (m = 1, 2, ..., M):\n",
59 " a. ์์ฌ ์์ฐจ(pseudo-residual) ๊ณ์ฐ:\n",
60 " r_im = -[โL(y_i, F(x_i))/โF(x_i)]_{F=F_{m-1}}\n",
61 " \n",
62 " b. ์์ฐจ์ ๋ํด ์ฝํ ํ์ต๊ธฐ h_m(x) ํ์ต\n",
63 " \n",
64 " c. ์ต์ ์คํ
ํฌ๊ธฐ ๊ณ์ฐ:\n",
65 " ฮณ_m = argmin_ฮณ ฮฃ L(y_i, F_{m-1}(x_i) + ฮณ * h_m(x_i))\n",
66 " \n",
67 " d. ๋ชจ๋ธ ์
๋ฐ์ดํธ:\n",
68 " F_m(x) = F_{m-1}(x) + learning_rate * ฮณ_m * h_m(x)\n",
69 "\n",
70 "ํต์ฌ:\n",
71 "- ๊ฐ ๋จ๊ณ์์ ์ด์ ๋ชจ๋ธ์ ์ค์ฐจ(์์ฐจ)๋ฅผ ํ์ต\n",
72 "- ์์ค ํจ์์ ๊ทธ๋๋์ธํธ ๋ฐฉํฅ์ผ๋ก ์ต์ ํ\n",
73 "- learning_rate๋ก ๊ณผ์ ํฉ ๋ฐฉ์ง\n",
74 "\"\"\")\n",
75 "\n",
76 "# ๊ฐ๋จํ ์๊ฐํ ๋ฐ์ดํฐ\n",
77 "np.random.seed(42)\n",
78 "X_demo = np.linspace(0, 10, 100).reshape(-1, 1)\n",
79 "y_demo = np.sin(X_demo).ravel() + np.random.randn(100) * 0.3\n",
80 "\n",
81 "plt.figure(figsize=(12, 4))\n",
82 "plt.subplot(1, 2, 1)\n",
83 "plt.scatter(X_demo, y_demo, alpha=0.5)\n",
84 "plt.xlabel('X')\n",
85 "plt.ylabel('y')\n",
86 "plt.title('Sample Data for Gradient Boosting')\n",
87 "plt.grid(True, alpha=0.3)\n",
88 "\n",
89 "plt.subplot(1, 2, 2)\n",
90 "stages = [0, 1, 5, 20, 50]\n",
91 "colors = ['red', 'orange', 'yellow', 'green', 'blue']\n",
92 "for stage, color in zip(stages, colors):\n",
93 " if stage == 0:\n",
94 " plt.axhline(y=np.mean(y_demo), color=color, label=f'Stage {stage}', alpha=0.7)\n",
95 "plt.scatter(X_demo, y_demo, alpha=0.3, color='gray')\n",
96 "plt.xlabel('X')\n",
97 "plt.ylabel('y')\n",
98 "plt.title('Gradient Boosting: Sequential Learning')\n",
99 "plt.legend()\n",
100 "plt.grid(True, alpha=0.3)\n",
101 "\n",
102 "plt.tight_layout()\n",
103 "plt.show()"
104 ]
105 },
106 {
107 "cell_type": "markdown",
108 "id": "cell-4",
109 "metadata": {},
110 "source": [
111 "## 2. XGBoost (eXtreme Gradient Boosting)"
112 ]
113 },
114 {
115 "cell_type": "code",
116 "execution_count": null,
117 "id": "cell-5",
118 "metadata": {},
119 "outputs": [],
120 "source": [
121 "# XGBoost ์ค์น: pip install xgboost\n",
122 "import xgboost as xgb\n",
123 "from xgboost import XGBClassifier, XGBRegressor\n",
124 "\n",
125 "print(\"\"\"\n",
126 "XGBoost ํน์ง:\n",
127 "\n",
128 "1. ์ ๊ทํ:\n",
129 " - L1, L2 ์ ๊ทํ๋ก ๊ณผ์ ํฉ ๋ฐฉ์ง\n",
130 " - ๋ชฉํ ํจ์: ฮฃ L(y_i, ลท_i) + ฮฃ ฮฉ(f_k)\n",
131 " - ฮฉ(f) = ฮณT + 0.5ฮป||w||ยฒ\n",
132 "\n",
133 "2. ํจ์จ์ ์ธ ๊ณ์ฐ:\n",
134 " - 2์ฐจ ํ
์ผ๋ฌ ์ ๊ฐ ์ฌ์ฉ\n",
135 " - ํ์คํ ๊ทธ๋จ ๊ธฐ๋ฐ ๋ถํ \n",
136 " - ์บ์ ์ต์ ํ\n",
137 "\n",
138 "3. ๊ฒฐ์ธก์น ์ฒ๋ฆฌ:\n",
139 " - ์๋์ผ๋ก ์ต์ ๋ฐฉํฅ ํ์ต\n",
140 "\n",
141 "4. ๋ณ๋ ฌ ์ฒ๋ฆฌ:\n",
142 " - ํน์ฑ๋ณ ๋ณ๋ ฌ ๋ถํ ์ ํ์\n",
143 "\"\"\")\n",
144 "\n",
145 "print(f\"XGBoost ๋ฒ์ : {xgb.__version__}\")"
146 ]
147 },
148 {
149 "cell_type": "markdown",
150 "id": "cell-6",
151 "metadata": {},
152 "source": [
153 "### 2.1 XGBoost ๋ถ๋ฅ"
154 ]
155 },
156 {
157 "cell_type": "code",
158 "execution_count": null,
159 "id": "cell-7",
160 "metadata": {},
161 "outputs": [],
162 "source": [
163 "# Breast Cancer ๋ฐ์ดํฐ ๋ก๋\n",
164 "cancer = load_breast_cancer()\n",
165 "X, y = cancer.data, cancer.target\n",
166 "\n",
167 "X_train, X_test, y_train, y_test = train_test_split(\n",
168 " X, y, test_size=0.2, random_state=42\n",
169 ")\n",
170 "\n",
171 "print(f\"๋ฐ์ดํฐ ํฌ๊ธฐ: {X.shape}\")\n",
172 "print(f\"ํด๋์ค: {cancer.target_names}\")"
173 ]
174 },
175 {
176 "cell_type": "code",
177 "execution_count": null,
178 "id": "cell-8",
179 "metadata": {},
180 "outputs": [],
181 "source": [
182 "# XGBoost ๋ถ๋ฅ๊ธฐ\n",
183 "xgb_clf = XGBClassifier(\n",
184 " n_estimators=100,\n",
185 " learning_rate=0.1,\n",
186 " max_depth=6,\n",
187 " min_child_weight=1, # ๋ฆฌํ ๋
ธ๋ ์ต์ ๊ฐ์ค์น\n",
188 " gamma=0, # ๋ถํ ์ ํ์ํ ์ต์ ์์ค ๊ฐ์\n",
189 " subsample=1.0, # ํ ์ํ๋ง ๋น์จ\n",
190 " colsample_bytree=1.0, # ํธ๋ฆฌ๋ณ ์ด ์ํ๋ง ๋น์จ\n",
191 " reg_alpha=0, # L1 ์ ๊ทํ\n",
192 " reg_lambda=1, # L2 ์ ๊ทํ\n",
193 " random_state=42,\n",
194 " eval_metric='logloss'\n",
195 ")\n",
196 "\n",
197 "# ํ์ต\n",
198 "start_time = time.time()\n",
199 "xgb_clf.fit(X_train, y_train)\n",
200 "train_time = time.time() - start_time\n",
201 "\n",
202 "# ์์ธก ๋ฐ ํ๊ฐ\n",
203 "y_pred = xgb_clf.predict(X_test)\n",
204 "accuracy = accuracy_score(y_test, y_pred)\n",
205 "\n",
206 "print(\"=== XGBoost ๋ถ๋ฅ ๊ฒฐ๊ณผ ===\")\n",
207 "print(f\"ํ๋ จ ์ ํ๋: {xgb_clf.score(X_train, y_train):.4f}\")\n",
208 "print(f\"ํ
์คํธ ์ ํ๋: {accuracy:.4f}\")\n",
209 "print(f\"ํ์ต ์๊ฐ: {train_time:.4f}์ด\")\n",
210 "print(f\"\\n๋ถ๋ฅ ๋ณด๊ณ ์:\")\n",
211 "print(classification_report(y_test, y_pred, target_names=cancer.target_names))"
212 ]
213 },
214 {
215 "cell_type": "markdown",
216 "id": "cell-9",
217 "metadata": {},
218 "source": [
219 "### 2.2 ์กฐ๊ธฐ ์ข
๋ฃ (Early Stopping)"
220 ]
221 },
222 {
223 "cell_type": "code",
224 "execution_count": null,
225 "id": "cell-10",
226 "metadata": {},
227 "outputs": [],
228 "source": [
229 "# ๊ฒ์ฆ ๋ฐ์ดํฐ ๋ถ๋ฆฌ\n",
230 "X_train_sub, X_val, y_train_sub, y_val = train_test_split(\n",
231 " X_train, y_train, test_size=0.2, random_state=42\n",
232 ")\n",
233 "\n",
234 "# ์กฐ๊ธฐ ์ข
๋ฃ ์ฌ์ฉ\n",
235 "xgb_early = XGBClassifier(\n",
236 " n_estimators=1000,\n",
237 " learning_rate=0.1,\n",
238 " max_depth=6,\n",
239 " random_state=42,\n",
240 " early_stopping_rounds=10, # 10 ๋ผ์ด๋ ๋์ ๊ฐ์ ์์ผ๋ฉด ์ค์ง\n",
241 " eval_metric='logloss'\n",
242 ")\n",
243 "\n",
244 "xgb_early.fit(\n",
245 " X_train_sub, y_train_sub,\n",
246 " eval_set=[(X_val, y_val)],\n",
247 " verbose=False\n",
248 ")\n",
249 "\n",
250 "print(\"=== ์กฐ๊ธฐ ์ข
๋ฃ ๊ฒฐ๊ณผ ===\")\n",
251 "print(f\"์ต์ ๋ฐ๋ณต ํ์: {xgb_early.best_iteration}\")\n",
252 "print(f\"์ต์ ์ ์: {xgb_early.best_score:.4f}\")\n",
253 "print(f\"ํ
์คํธ ์ ํ๋: {xgb_early.score(X_test, y_test):.4f}\")"
254 ]
255 },
256 {
257 "cell_type": "markdown",
258 "id": "cell-11",
259 "metadata": {},
260 "source": [
261 "### 2.3 ํน์ฑ ์ค์๋"
262 ]
263 },
264 {
265 "cell_type": "code",
266 "execution_count": null,
267 "id": "cell-12",
268 "metadata": {},
269 "outputs": [],
270 "source": [
271 "# ํน์ฑ ์ค์๋ ์๊ฐํ\n",
272 "importance_types = ['weight', 'gain', 'cover']\n",
273 "\n",
274 "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
275 "\n",
276 "for ax, imp_type in zip(axes, importance_types):\n",
277 " importance_dict = xgb_clf.get_booster().get_score(importance_type=imp_type)\n",
278 " \n",
279 " if importance_dict:\n",
280 " # ์์ 10๊ฐ๋ง ํ์\n",
281 " sorted_importance = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)[:10]\n",
282 " features = [x[0] for x in sorted_importance]\n",
283 " values = [x[1] for x in sorted_importance]\n",
284 " \n",
285 " ax.barh(range(len(features)), values)\n",
286 " ax.set_yticks(range(len(features)))\n",
287 " ax.set_yticklabels(features)\n",
288 " ax.set_xlabel('Importance')\n",
289 " ax.set_title(f'Feature Importance ({imp_type})')\n",
290 " ax.grid(True, alpha=0.3)\n",
291 "\n",
292 "plt.tight_layout()\n",
293 "plt.show()\n",
294 "\n",
295 "print(\"\"\"\n",
296 "์ค์๋ ํ์
:\n",
297 "- weight: ํน์ฑ์ด ๋ถํ ์ ์ฌ์ฉ๋ ํ์\n",
298 "- gain: ํน์ฑ ์ฌ์ฉ ์ ํ๊ท ์ด๋\n",
299 "- cover: ํน์ฑ์ด ์ปค๋ฒํ๋ ํ๊ท ์ํ ์\n",
300 "\"\"\")"
301 ]
302 },
303 {
304 "cell_type": "markdown",
305 "id": "cell-13",
306 "metadata": {},
307 "source": [
308 "### 2.4 XGBoost ํ์ดํผํ๋ผ๋ฏธํฐ ํ๋"
309 ]
310 },
311 {
312 "cell_type": "code",
313 "execution_count": null,
314 "id": "cell-14",
315 "metadata": {},
316 "outputs": [],
317 "source": [
318 "# ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋\n",
319 "param_grid_xgb = {\n",
320 " 'max_depth': [3, 5, 7],\n",
321 " 'learning_rate': [0.01, 0.1, 0.3],\n",
322 " 'n_estimators': [100, 200],\n",
323 " 'min_child_weight': [1, 3],\n",
324 " 'subsample': [0.8, 1.0],\n",
325 " 'colsample_bytree': [0.8, 1.0]\n",
326 "}\n",
327 "\n",
328 "# Grid Search (์๊ฐ์ด ์ค๋ ๊ฑธ๋ฆด ์ ์์ผ๋ฏ๋ก ๊ฐ์ํ๋ ๊ทธ๋ฆฌ๋ ์ฌ์ฉ)\n",
329 "grid_search_xgb = GridSearchCV(\n",
330 " XGBClassifier(random_state=42, eval_metric='logloss'),\n",
331 " param_grid_xgb,\n",
332 " cv=3,\n",
333 " scoring='accuracy',\n",
334 " n_jobs=-1,\n",
335 " verbose=0\n",
336 ")\n",
337 "\n",
338 "# ์๊ฐ ์ ์ฝ์ ์ํด ์ํ ์ฌ์ฉ\n",
339 "X_sample, _, y_sample, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=42)\n",
340 "grid_search_xgb.fit(X_sample, y_sample)\n",
341 "\n",
342 "print(\"=== XGBoost Grid Search ๊ฒฐ๊ณผ ===\")\n",
343 "print(f\"์ต์ ํ๋ผ๋ฏธํฐ: {grid_search_xgb.best_params_}\")\n",
344 "print(f\"์ต์ CV ์ ์: {grid_search_xgb.best_score_:.4f}\")\n",
345 "print(f\"ํ
์คํธ ์ ์: {grid_search_xgb.score(X_test, y_test):.4f}\")"
346 ]
347 },
348 {
349 "cell_type": "markdown",
350 "id": "cell-15",
351 "metadata": {},
352 "source": [
353 "## 3. LightGBM"
354 ]
355 },
356 {
357 "cell_type": "code",
358 "execution_count": null,
359 "id": "cell-16",
360 "metadata": {},
361 "outputs": [],
362 "source": [
363 "# LightGBM ์ค์น: pip install lightgbm\n",
364 "import lightgbm as lgb\n",
365 "from lightgbm import LGBMClassifier, LGBMRegressor\n",
366 "\n",
367 "print(\"\"\"\n",
368 "LightGBM ํน์ง:\n",
369 "\n",
370 "1. Leaf-wise ์ฑ์ฅ:\n",
371 " - ๊ธฐ์กด: Level-wise (์ํ ๋ถํ )\n",
372 " - LightGBM: Leaf-wise (์์ค ์ต๋ ๊ฐ์ ๋ฆฌํ ๋ถํ )\n",
373 " - ๋ ๋น ๋ฅด๊ณ ์ ํํ์ง๋ง ๊ณผ์ ํฉ ์ํ\n",
374 "\n",
375 "2. ํ์คํ ๊ทธ๋จ ๊ธฐ๋ฐ ๋ถํ :\n",
376 " - ์ฐ์ํ ๊ฐ์ ์ด์ฐํ\n",
377 " - ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ , ๋น ๋ฅธ ํ์ต\n",
378 "\n",
379 "3. GOSS (Gradient-based One-Side Sampling):\n",
380 " - ๊ทธ๋๋์ธํธ๊ฐ ํฐ ์ํ ์์ฃผ๋ก ์ํ๋ง\n",
381 "\n",
382 "4. EFB (Exclusive Feature Bundling):\n",
383 " - ์ํธ ๋ฐฐํ์ ํน์ฑ๋ค์ ๋ฌถ์\n",
384 " - ํฌ์ ํน์ฑ์ ํจ๊ณผ์ \n",
385 "\"\"\")\n",
386 "\n",
387 "print(f\"LightGBM ๋ฒ์ : {lgb.__version__}\")"
388 ]
389 },
390 {
391 "cell_type": "markdown",
392 "id": "cell-17",
393 "metadata": {},
394 "source": [
395 "### 3.1 LightGBM ๋ถ๋ฅ"
396 ]
397 },
398 {
399 "cell_type": "code",
400 "execution_count": null,
401 "id": "cell-18",
402 "metadata": {},
403 "outputs": [],
404 "source": [
405 "# LightGBM ๋ถ๋ฅ๊ธฐ\n",
406 "lgb_clf = LGBMClassifier(\n",
407 " n_estimators=100,\n",
408 " learning_rate=0.1,\n",
409 " max_depth=-1, # -1: ์ ํ ์์\n",
410 " num_leaves=31, # ๋ฆฌํ ๋
ธ๋ ์ต๋ ์\n",
411 " min_child_samples=20, # ๋ฆฌํ ๋
ธ๋ ์ต์ ์ํ ์\n",
412 " subsample=1.0, # ํ ์ํ๋ง\n",
413 " colsample_bytree=1.0, # ์ด ์ํ๋ง\n",
414 " reg_alpha=0, # L1 ์ ๊ทํ\n",
415 " reg_lambda=0, # L2 ์ ๊ทํ\n",
416 " random_state=42,\n",
417 " verbose=-1\n",
418 ")\n",
419 "\n",
420 "# ํ์ต\n",
421 "start_time = time.time()\n",
422 "lgb_clf.fit(X_train, y_train)\n",
423 "train_time_lgb = time.time() - start_time\n",
424 "\n",
425 "# ํ๊ฐ\n",
426 "y_pred_lgb = lgb_clf.predict(X_test)\n",
427 "accuracy_lgb = accuracy_score(y_test, y_pred_lgb)\n",
428 "\n",
429 "print(\"=== LightGBM ๋ถ๋ฅ ๊ฒฐ๊ณผ ===\")\n",
430 "print(f\"ํ๋ จ ์ ํ๋: {lgb_clf.score(X_train, y_train):.4f}\")\n",
431 "print(f\"ํ
์คํธ ์ ํ๋: {accuracy_lgb:.4f}\")\n",
432 "print(f\"ํ์ต ์๊ฐ: {train_time_lgb:.4f}์ด\")"
433 ]
434 },
435 {
436 "cell_type": "markdown",
437 "id": "cell-19",
438 "metadata": {},
439 "source": [
440 "### 3.2 num_leaves vs max_depth"
441 ]
442 },
443 {
444 "cell_type": "code",
445 "execution_count": null,
446 "id": "cell-20",
447 "metadata": {},
448 "outputs": [],
449 "source": [
450 "print(\"\"\"\n",
451 "num_leaves์ max_depth์ ๊ด๊ณ:\n",
452 "- max_depth = d์ผ ๋, ์ต๋ ๋ฆฌํ ์ = 2^d\n",
453 "- num_leaves = 31์ด๋ฉด ๋๋ต max_depth = 5 ์์ค\n",
454 "- ๊ณผ์ ํฉ ๋ฐฉ์ง: num_leaves < 2^max_depth\n",
455 "\n",
456 "๊ถ์ฅ ์ค์ :\n",
457 "- ๋์ฉ๋ ๋ฐ์ดํฐ: num_leaves = 2^max_depth - 1 ์ดํ\n",
458 "- ์๊ท๋ชจ ๋ฐ์ดํฐ: num_leaves๋ฅผ ์๊ฒ (15~31)\n",
459 "\"\"\")\n",
460 "\n",
461 "# num_leaves์ ๋ฐ๋ฅธ ์ฑ๋ฅ\n",
462 "num_leaves_range = [15, 31, 63, 127, 255]\n",
463 "train_scores_lgb = []\n",
464 "test_scores_lgb = []\n",
465 "\n",
466 "for num_leaves in num_leaves_range:\n",
467 " lgb_temp = LGBMClassifier(\n",
468 " n_estimators=100,\n",
469 " num_leaves=num_leaves,\n",
470 " random_state=42,\n",
471 " verbose=-1\n",
472 " )\n",
473 " lgb_temp.fit(X_train, y_train)\n",
474 " train_scores_lgb.append(lgb_temp.score(X_train, y_train))\n",
475 " test_scores_lgb.append(lgb_temp.score(X_test, y_test))\n",
476 "\n",
477 "plt.figure(figsize=(10, 6))\n",
478 "plt.plot(num_leaves_range, train_scores_lgb, 'o-', label='Train')\n",
479 "plt.plot(num_leaves_range, test_scores_lgb, 's-', label='Test')\n",
480 "plt.xlabel('num_leaves')\n",
481 "plt.ylabel('Accuracy')\n",
482 "plt.title('LightGBM: num_leaves Effect')\n",
483 "plt.legend()\n",
484 "plt.grid(True, alpha=0.3)\n",
485 "plt.show()"
486 ]
487 },
488 {
489 "cell_type": "markdown",
490 "id": "cell-21",
491 "metadata": {},
492 "source": [
493 "### 3.3 ํน์ฑ ์ค์๋"
494 ]
495 },
496 {
497 "cell_type": "code",
498 "execution_count": null,
499 "id": "cell-22",
500 "metadata": {},
501 "outputs": [],
502 "source": [
503 "# LightGBM ํน์ฑ ์ค์๋\n",
504 "importance_lgb = pd.DataFrame({\n",
505 " 'Feature': cancer.feature_names,\n",
506 " 'Importance': lgb_clf.feature_importances_\n",
507 "}).sort_values('Importance', ascending=True).tail(15)\n",
508 "\n",
509 "plt.figure(figsize=(10, 8))\n",
510 "plt.barh(importance_lgb['Feature'], importance_lgb['Importance'])\n",
511 "plt.xlabel('Importance')\n",
512 "plt.title('LightGBM Feature Importance - Top 15')\n",
513 "plt.grid(True, alpha=0.3)\n",
514 "plt.tight_layout()\n",
515 "plt.show()"
516 ]
517 },
518 {
519 "cell_type": "markdown",
520 "id": "cell-23",
521 "metadata": {},
522 "source": [
523 "## 4. CatBoost ๊ฐ์"
524 ]
525 },
526 {
527 "cell_type": "code",
528 "execution_count": null,
529 "id": "cell-24",
530 "metadata": {},
531 "outputs": [],
532 "source": [
533 "print(\"\"\"\n",
534 "CatBoost ํน์ง:\n",
535 "\n",
536 "1. ๋ฒ์ฃผํ ํน์ฑ ์๋ ์ฒ๋ฆฌ:\n",
537 " - Target Encoding ์๋ ์ ์ฉ\n",
538 " - Ordered Target Statistics๋ก ๋ฐ์ดํฐ ๋์ ๋ฐฉ์ง\n",
539 "\n",
540 "2. Ordered Boosting:\n",
541 " - ํ์ต ์์๋ฅผ ๋๋คํํ์ฌ ํธํฅ ๊ฐ์\n",
542 " - ๊ณผ์ ํฉ ๋ฐฉ์ง\n",
543 "\n",
544 "3. ๋์นญ ํธ๋ฆฌ:\n",
545 " - ๊ฐ์ ์์ค์ ๋ชจ๋ ๋
ธ๋๊ฐ ๋์ผํ ๋ถํ ์กฐ๊ฑด ์ฌ์ฉ\n",
546 " - ์์ธก ์๋ ํฅ์\n",
547 "\n",
548 "์ค์น: pip install catboost\n",
549 "\n",
550 "๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ:\n",
551 "from catboost import CatBoostClassifier\n",
552 "\n",
553 "cat_clf = CatBoostClassifier(\n",
554 " iterations=100,\n",
555 " learning_rate=0.1,\n",
556 " depth=6,\n",
557 " l2_leaf_reg=3,\n",
558 " random_state=42,\n",
559 " verbose=False\n",
560 ")\n",
561 "\n",
562 "cat_clf.fit(X_train, y_train)\n",
563 "\"\"\")"
564 ]
565 },
566 {
567 "cell_type": "markdown",
568 "id": "cell-25",
569 "metadata": {},
570 "source": [
571 "## 5. ๋ถ์คํ
์๊ณ ๋ฆฌ์ฆ ๋น๊ต"
572 ]
573 },
574 {
575 "cell_type": "code",
576 "execution_count": null,
577 "id": "cell-26",
578 "metadata": {},
579 "outputs": [],
580 "source": [
581 "from sklearn.ensemble import GradientBoostingClassifier\n",
582 "\n",
583 "# ๋ชจ๋ธ ์ ์\n",
584 "models = {\n",
585 " 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),\n",
586 " 'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),\n",
587 " 'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)\n",
588 "}\n",
589 "\n",
590 "# ๋น๊ต\n",
591 "print(\"๋ถ์คํ
์๊ณ ๋ฆฌ์ฆ ๋น๊ต:\")\n",
592 "print(\"-\" * 70)\n",
593 "print(f\"{'๋ชจ๋ธ':<20} {'ํ๋ จ ์ ํ๋':>15} {'ํ
์คํธ ์ ํ๋':>15} {'ํ์ต์๊ฐ(์ด)':>15}\")\n",
594 "print(\"-\" * 70)\n",
595 "\n",
596 "results = {}\n",
597 "for name, model in models.items():\n",
598 " start_time = time.time()\n",
599 " model.fit(X_train, y_train)\n",
600 " train_time = time.time() - start_time\n",
601 " \n",
602 " train_acc = model.score(X_train, y_train)\n",
603 " test_acc = model.score(X_test, y_test)\n",
604 " \n",
605 " results[name] = {\n",
606 " 'train_accuracy': train_acc,\n",
607 " 'test_accuracy': test_acc,\n",
608 " 'time': train_time\n",
609 " }\n",
610 " \n",
611 " print(f\"{name:<20} {train_acc:>15.4f} {test_acc:>15.4f} {train_time:>15.4f}\")\n",
612 "\n",
613 "print(\"-\" * 70)"
614 ]
615 },
616 {
617 "cell_type": "code",
618 "execution_count": null,
619 "id": "cell-27",
620 "metadata": {},
621 "outputs": [],
622 "source": [
623 "# ์๊ฐํ ๋น๊ต\n",
624 "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
625 "\n",
626 "# ์ ํ๋ ๋น๊ต\n",
627 "names = list(results.keys())\n",
628 "test_accuracies = [results[n]['test_accuracy'] for n in names]\n",
629 "axes[0].barh(names, test_accuracies, color=['skyblue', 'salmon', 'lightgreen'])\n",
630 "axes[0].set_xlabel('Test Accuracy')\n",
631 "axes[0].set_title('Accuracy Comparison')\n",
632 "axes[0].set_xlim([0.9, 1.0])\n",
633 "axes[0].grid(True, alpha=0.3)\n",
634 "\n",
635 "# ํ์ต ์๊ฐ ๋น๊ต\n",
636 "times = [results[n]['time'] for n in names]\n",
637 "axes[1].barh(names, times, color=['skyblue', 'salmon', 'lightgreen'])\n",
638 "axes[1].set_xlabel('Training Time (seconds)')\n",
639 "axes[1].set_title('Training Time Comparison')\n",
640 "axes[1].grid(True, alpha=0.3)\n",
641 "\n",
642 "plt.tight_layout()\n",
643 "plt.show()"
644 ]
645 },
646 {
647 "cell_type": "markdown",
648 "id": "cell-28",
649 "metadata": {},
650 "source": [
651 "## 6. ํ๊ท ์์ "
652 ]
653 },
654 {
655 "cell_type": "code",
656 "execution_count": null,
657 "id": "cell-29",
658 "metadata": {},
659 "outputs": [],
660 "source": [
661 "# California Housing ๋ฐ์ดํฐ\n",
662 "housing = fetch_california_housing()\n",
663 "X_reg, y_reg = housing.data, housing.target\n",
664 "\n",
665 "X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(\n",
666 " X_reg, y_reg, test_size=0.2, random_state=42\n",
667 ")\n",
668 "\n",
669 "print(f\"๋ฐ์ดํฐ ํฌ๊ธฐ: {X_reg.shape}\")\n",
670 "print(f\"ํน์ฑ: {housing.feature_names}\")\n",
671 "print(f\"ํ๊ฒ: Median house value (in $100,000s)\")"
672 ]
673 },
674 {
675 "cell_type": "code",
676 "execution_count": null,
677 "id": "cell-30",
678 "metadata": {},
679 "outputs": [],
680 "source": [
681 "# XGBoost ํ๊ท\n",
682 "xgb_reg = XGBRegressor(\n",
683 " n_estimators=100,\n",
684 " learning_rate=0.1,\n",
685 " max_depth=6,\n",
686 " random_state=42\n",
687 ")\n",
688 "xgb_reg.fit(X_train_reg, y_train_reg)\n",
689 "y_pred_xgb_reg = xgb_reg.predict(X_test_reg)\n",
690 "\n",
691 "# LightGBM ํ๊ท\n",
692 "lgb_reg = LGBMRegressor(\n",
693 " n_estimators=100,\n",
694 " learning_rate=0.1,\n",
695 " num_leaves=31,\n",
696 " random_state=42,\n",
697 " verbose=-1\n",
698 ")\n",
699 "lgb_reg.fit(X_train_reg, y_train_reg)\n",
700 "y_pred_lgb_reg = lgb_reg.predict(X_test_reg)\n",
701 "\n",
702 "# ํ๊ฐ\n",
703 "print(\"=== XGBoost ํ๊ท ===\")\n",
704 "print(f\"Rยฒ Score: {r2_score(y_test_reg, y_pred_xgb_reg):.4f}\")\n",
705 "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_xgb_reg)):.4f}\")\n",
706 "\n",
707 "print(\"\\n=== LightGBM ํ๊ท ===\")\n",
708 "print(f\"Rยฒ Score: {r2_score(y_test_reg, y_pred_lgb_reg):.4f}\")\n",
709 "print(f\"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_lgb_reg)):.4f}\")"
710 ]
711 },
712 {
713 "cell_type": "code",
714 "execution_count": null,
715 "id": "cell-31",
716 "metadata": {},
717 "outputs": [],
718 "source": [
719 "# ์์ธก vs ์ค์ ์๊ฐํ\n",
720 "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
721 "\n",
722 "# XGBoost\n",
723 "axes[0].scatter(y_test_reg, y_pred_xgb_reg, alpha=0.5)\n",
724 "axes[0].plot([y_test_reg.min(), y_test_reg.max()], \n",
725 " [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
726 "axes[0].set_xlabel('Actual')\n",
727 "axes[0].set_ylabel('Predicted')\n",
728 "axes[0].set_title(f'XGBoost (Rยฒ={r2_score(y_test_reg, y_pred_xgb_reg):.4f})')\n",
729 "axes[0].grid(True, alpha=0.3)\n",
730 "\n",
731 "# LightGBM\n",
732 "axes[1].scatter(y_test_reg, y_pred_lgb_reg, alpha=0.5, color='green')\n",
733 "axes[1].plot([y_test_reg.min(), y_test_reg.max()], \n",
734 " [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)\n",
735 "axes[1].set_xlabel('Actual')\n",
736 "axes[1].set_ylabel('Predicted')\n",
737 "axes[1].set_title(f'LightGBM (Rยฒ={r2_score(y_test_reg, y_pred_lgb_reg):.4f})')\n",
738 "axes[1].grid(True, alpha=0.3)\n",
739 "\n",
740 "plt.tight_layout()\n",
741 "plt.show()"
742 ]
743 },
744 {
745 "cell_type": "markdown",
746 "id": "cell-32",
747 "metadata": {},
748 "source": [
749 "## 7. ํ์ดํผํ๋ผ๋ฏธํฐ ๊ฐ์ด๋"
750 ]
751 },
752 {
753 "cell_type": "code",
754 "execution_count": null,
755 "id": "cell-33",
756 "metadata": {},
757 "outputs": [],
758 "source": [
759 "# ํ์ดํผํ๋ผ๋ฏธํฐ ๋น๊ตํ\n",
760 "params_comparison = pd.DataFrame({\n",
761 " 'Parameter': ['ํ์ต๋ฅ ', 'ํธ๋ฆฌ ์', '๊น์ด', '๋ฆฌํ ์', 'L1 ์ ๊ทํ', 'L2 ์ ๊ทํ', 'ํ ์ํ๋ง', '์ด ์ํ๋ง'],\n",
762 " 'XGBoost': ['learning_rate', 'n_estimators', 'max_depth', '-', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
763 " 'LightGBM': ['learning_rate', 'n_estimators', 'max_depth', 'num_leaves', 'reg_alpha', 'reg_lambda', 'subsample', 'colsample_bytree'],\n",
764 " 'Effect': ['๋ฎ์ผ๋ฉด ์์ ์ ', '๋ง์ผ๋ฉด ์ ํ', '๊น์ผ๋ฉด ๋ณต์ก', '๋ง์ผ๋ฉด ๋ณต์ก', '๊ณผ์ ํฉ ๋ฐฉ์ง', '๊ณผ์ ํฉ ๋ฐฉ์ง', '๋ถ์ฐ ๊ฐ์', '๋ค์์ฑ ์ฆ๊ฐ']\n",
765 "})\n",
766 "\n",
767 "print(\"ํ์ดํผํ๋ผ๋ฏธํฐ ๊ฐ์ด๋:\")\n",
768 "print(params_comparison.to_string(index=False))"
769 ]
770 },
771 {
772 "cell_type": "code",
773 "execution_count": null,
774 "id": "cell-34",
775 "metadata": {},
776 "outputs": [],
777 "source": [
778 "print(\"\"\"\n",
779 "๊ถ์ฅ ํ๋ ์์:\n",
780 "\n",
781 "1. ํธ๋ฆฌ ๊ตฌ์กฐ ํ๋ผ๋ฏธํฐ:\n",
782 " - max_depth, num_leaves\n",
783 " - min_child_weight, min_child_samples\n",
784 "\n",
785 "2. ์ํ๋ง ํ๋ผ๋ฏธํฐ:\n",
786 " - subsample\n",
787 " - colsample_bytree\n",
788 "\n",
789 "3. ์ ๊ทํ ํ๋ผ๋ฏธํฐ:\n",
790 " - reg_alpha, reg_lambda\n",
791 "\n",
792 "4. ํ์ต๋ฅ ์กฐ์ :\n",
793 " - learning_rate ๋ฎ์ถ๊ณ \n",
794 " - n_estimators ๋๋ฆฌ๊ธฐ\n",
795 "\n",
796 "๊ณผ์ ํฉ ๋ฐฉ์ง ์ ๋ต:\n",
797 "- ์กฐ๊ธฐ ์ข
๋ฃ (early_stopping_rounds)\n",
798 "- ์ ๊ทํ (reg_alpha, reg_lambda)\n",
799 "- ์ํ๋ง (subsample, colsample_bytree)\n",
800 "- ํธ๋ฆฌ ์ ํ (max_depth, min_child_weight)\n",
801 "- ํ์ต๋ฅ ๋ฎ์ถ๊ธฐ (learning_rate)\n",
802 "\"\"\")"
803 ]
804 },
805 {
806 "cell_type": "markdown",
807 "id": "cell-35",
808 "metadata": {},
809 "source": [
810 "## ์ ๋ฆฌ\n",
811 "\n",
812 "### ์๊ณ ๋ฆฌ์ฆ ๋น๊ต\n",
813 "\n",
814 "| ์๊ณ ๋ฆฌ์ฆ | ํน์ง | ์ฅ์ | ๋จ์ |\n",
815 "|----------|------|------|------|\n",
816 "| Gradient Boosting | ์์ฐจ ํ์ต | ๋์ ์ ํ๋ | ๋๋ฆฐ ํ์ต |\n",
817 "| XGBoost | ์ ๊ทํ + ๋ณ๋ ฌํ | ๋น ๋ฆ, ์ ํํจ | ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ |\n",
818 "| LightGBM | Leaf-wise | ๋งค์ฐ ๋น ๋ฆ, ๋์ฉ๋ | ๊ณผ์ ํฉ ์ํ |\n",
819 "| CatBoost | ๋ฒ์ฃผํ ์ฒ๋ฆฌ | ํ๋ ์ ๊ฒ ํ์ | ๋๋ฆฐ ์์ |\n",
820 "\n",
821 "### ์ ํ ๊ฐ์ด๋\n",
822 "\n",
823 "- **์์ ๋ฐ์ดํฐ (<10K)**: XGBoost ๋๋ sklearn GradientBoosting\n",
824 "- **์ค๊ฐ ๋ฐ์ดํฐ (10K-100K)**: XGBoost\n",
825 "- **๋์ฉ๋ ๋ฐ์ดํฐ (>100K)**: LightGBM\n",
826 "- **๋ฒ์ฃผํ ํน์ฑ ๋ง์**: CatBoost\n",
827 "- **๋น ๋ฅธ ํ์ต ํ์**: LightGBM\n",
828 "- **์ต๊ณ ์ ํ๋**: ๋ชจ๋ ์๋ ํ ์์๋ธ\n",
829 "\n",
830 "### ์ฃผ์ ํ์ดํผํ๋ผ๋ฏธํฐ\n",
831 "\n",
832 "**๊ณตํต:**\n",
833 "- `n_estimators`: ํธ๋ฆฌ ๊ฐ์\n",
834 "- `learning_rate`: ํ์ต๋ฅ \n",
835 "- `max_depth`: ํธ๋ฆฌ ๊น์ด\n",
836 "\n",
837 "**XGBoost ์ ์ฉ:**\n",
838 "- `min_child_weight`: ๋ฆฌํ ๋
ธ๋ ์ต์ ๊ฐ์ค์น\n",
839 "- `gamma`: ๋ถํ ์ต์ ์์ค ๊ฐ์\n",
840 "\n",
841 "**LightGBM ์ ์ฉ:**\n",
842 "- `num_leaves`: ๋ฆฌํ ๋
ธ๋ ์ต๋ ์\n",
843 "- `min_child_samples`: ๋ฆฌํ ๋
ธ๋ ์ต์ ์ํ\n",
844 "\n",
845 "### ๋ค์ ๋จ๊ณ\n",
846 "- Stacking๊ณผ Blending\n",
847 "- AutoML (Optuna, Hyperopt)\n",
848 "- ์ค์ Kaggle ๋ํ ์ฐธ์ฌ"
849 ]
850 }
851 ],
852 "metadata": {
853 "kernelspec": {
854 "display_name": "Python 3",
855 "language": "python",
856 "name": "python3"
857 },
858 "language_info": {
859 "codemirror_mode": {
860 "name": "ipython",
861 "version": 3
862 },
863 "file_extension": ".py",
864 "mimetype": "text/x-python",
865 "name": "python",
866 "nbconvert_exporter": "python",
867 "pygments_lexer": "ipython3",
868 "version": "3.8.0"
869 }
870 },
871 "nbformat": 4,
872 "nbformat_minor": 5
873}