1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "id": "cell-0",
6 "metadata": {},
7 "source": [
8 "# 10. k-์ต๊ทผ์ ์ด์(kNN)๊ณผ ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
9 "\n",
10 "## ํ์ต ๋ชฉํ\n",
11 "- kNN์ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ ๋ถ๋ฅ ์๋ฆฌ ์ดํด\n",
12 "- ๊ฑฐ๋ฆฌ ๋ฉํธ๋ฆญ (Euclidean, Manhattan, Minkowski) ํ์ต\n",
13 "- ์ต์ k๊ฐ ์ ํ ๋ฐฉ๋ฒ ์ต๋\n",
14 "- ๋์ด๋ธ ๋ฒ ์ด์ฆ์ ํ๋ฅ ๊ธฐ๋ฐ ๋ถ๋ฅ ์ดํด\n",
15 "- Gaussian, Multinomial, Bernoulli NB ๋น๊ต\n",
16 "- ํ
์คํธ ๋ถ๋ฅ ์ ์ฉ"
17 ]
18 },
19 {
20 "cell_type": "code",
21 "execution_count": null,
22 "id": "cell-1",
23 "metadata": {},
24 "outputs": [],
25 "source": [
26 "# ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ํฌํธ\n",
27 "import numpy as np\n",
28 "import pandas as pd\n",
29 "import matplotlib.pyplot as plt\n",
30 "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n",
31 "from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB\n",
32 "from sklearn.datasets import (\n",
33 " load_iris, load_breast_cancer, load_diabetes, load_digits,\n",
34 " make_classification, fetch_20newsgroups\n",
35 ")\n",
36 "from sklearn.model_selection import train_test_split, cross_val_score\n",
37 "from sklearn.preprocessing import StandardScaler\n",
38 "from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score\n",
39 "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
40 "from scipy.spatial.distance import euclidean, cityblock, minkowski, chebyshev\n",
41 "from time import time\n",
42 "\n",
43 "# ํ๊ธ ํฐํธ ์ค์ \n",
44 "plt.rcParams['font.family'] = 'DejaVu Sans'\n",
45 "plt.rcParams['axes.unicode_minus'] = False\n",
46 "np.random.seed(42)"
47 ]
48 },
49 {
50 "cell_type": "markdown",
51 "id": "cell-2",
52 "metadata": {},
53 "source": [
54 "## 1. k-์ต๊ทผ์ ์ด์ (kNN) ๊ฐ๋
\n",
55 "\n",
56 "kNN์ ๊ฒ์ผ๋ฅธ ํ์ต(Lazy Learning) ์๊ณ ๋ฆฌ์ฆ์
๋๋ค.\n",
57 "\n",
58 "**๋์ ์๋ฆฌ**:\n",
59 "1. ์๋ก์ด ๋ฐ์ดํฐ๊ฐ ๋ค์ด์ค๋ฉด\n",
60 "2. ํ์ต ๋ฐ์ดํฐ์์ ๊ฐ์ฅ ๊ฐ๊น์ด k๊ฐ์ ์ด์์ ์ฐพ์\n",
61 "3. k๊ฐ ์ด์์ ๋ค์๊ฒฐ(๋ถ๋ฅ) ๋๋ ํ๊ท (ํ๊ท)์ผ๋ก ์์ธก\n",
62 "\n",
63 "**ํน์ง**:\n",
64 "- ํ์ต ์ ๋ชจ๋ธ ์์ฑ ์์ (๋ชจ๋ ๋ฐ์ดํฐ ์ ์ฅ)\n",
65 "- ๋น๋ชจ์์ ๋ฐฉ๋ฒ (๋ฐ์ดํฐ ๋ถํฌ ๊ฐ์ ๋ถํ์)\n",
66 "- ์์ธก ์๊ฐ์ด ๋๋ฆผ"
67 ]
68 },
69 {
70 "cell_type": "code",
71 "execution_count": null,
72 "id": "cell-3",
73 "metadata": {},
74 "outputs": [],
75 "source": [
76 "# 2D ๋ฐ์ดํฐ๋ก kNN ์๊ฐํ\n",
77 "X, y = make_classification(\n",
78 " n_samples=100, n_features=2, n_redundant=0,\n",
79 " n_informative=2, n_clusters_per_class=1, random_state=42\n",
80 ")\n",
81 "\n",
82 "# ์ฌ๋ฌ k๊ฐ ๋น๊ต\n",
83 "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
84 "k_values = [1, 5, 15]\n",
85 "\n",
86 "for ax, k in zip(axes, k_values):\n",
87 " knn = KNeighborsClassifier(n_neighbors=k)\n",
88 " knn.fit(X, y)\n",
89 "\n",
90 " # ๊ฒฐ์ ๊ฒฝ๊ณ\n",
91 " x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
92 " y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
93 " xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),\n",
94 " np.linspace(y_min, y_max, 100))\n",
95 "\n",
96 " Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])\n",
97 " Z = Z.reshape(xx.shape)\n",
98 "\n",
99 " ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')\n",
100 " ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='black')\n",
101 " ax.set_title(f'k = {k}\\nAccuracy = {knn.score(X, y):.3f}')\n",
102 " ax.set_xlabel('Feature 1')\n",
103 " ax.set_ylabel('Feature 2')\n",
104 "\n",
105 "plt.tight_layout()\n",
106 "plt.show()"
107 ]
108 },
109 {
110 "cell_type": "markdown",
111 "id": "cell-4",
112 "metadata": {},
113 "source": [
114 "## 2. kNN ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ"
115 ]
116 },
117 {
118 "cell_type": "code",
119 "execution_count": null,
120 "id": "cell-5",
121 "metadata": {},
122 "outputs": [],
123 "source": [
124 "# ๋ฐ์ดํฐ ๋ก๋\n",
125 "iris = load_iris()\n",
126 "X_train, X_test, y_train, y_test = train_test_split(\n",
127 " iris.data, iris.target, test_size=0.2, random_state=42\n",
128 ")\n",
129 "\n",
130 "# kNN ๋ถ๋ฅ๊ธฐ\n",
131 "knn = KNeighborsClassifier(\n",
132 " n_neighbors=5, # k๊ฐ\n",
133 " weights='uniform', # ๊ฐ์ค์น: 'uniform' ๋๋ 'distance'\n",
134 " algorithm='auto', # ์๊ณ ๋ฆฌ์ฆ: 'auto', 'ball_tree', 'kd_tree', 'brute'\n",
135 " metric='minkowski', # ๊ฑฐ๋ฆฌ ์ธก์ : 'euclidean', 'manhattan', 'minkowski'\n",
136 " p=2 # minkowski p๊ฐ (2=euclidean, 1=manhattan)\n",
137 ")\n",
138 "\n",
139 "knn.fit(X_train, y_train)\n",
140 "y_pred = knn.predict(X_test)\n",
141 "\n",
142 "print(\"kNN ๋ถ๋ฅ ๊ฒฐ๊ณผ:\")\n",
143 "print(f\" ์ ํ๋: {accuracy_score(y_test, y_pred):.4f}\")\n",
144 "print(\"\\n๋ถ๋ฅ ๋ฆฌํฌํธ:\")\n",
145 "print(classification_report(y_test, y_pred, target_names=iris.target_names))"
146 ]
147 },
148 {
149 "cell_type": "markdown",
150 "id": "cell-6",
151 "metadata": {},
152 "source": [
153 "## 3. ๊ฑฐ๋ฆฌ ์ธก์ ๋ฐฉ๋ฒ\n",
154 "\n",
155 "kNN์ ํต์ฌ์ ๊ฑฐ๋ฆฌ ๊ณ์ฐ์
๋๋ค.\n",
156 "\n",
157 "์ฃผ์ ๊ฑฐ๋ฆฌ ๋ฉํธ๋ฆญ:\n",
158 "- **์ ํด๋ฆฌ๋ (Euclidean, L2)**: d = โฮฃ(xi - yi)ยฒ\n",
159 "- **๋งจํดํผ (Manhattan, L1)**: d = ฮฃ|xi - yi|\n",
160 "- **๋ฏผ์ฝํ์คํค (Minkowski)**: d = (ฮฃ|xi - yi|^p)^(1/p)\n",
161 "- **์ฒด๋น์
ฐํ (Chebyshev, Lโ)**: d = max(|xi - yi|)"
162 ]
163 },
164 {
165 "cell_type": "code",
166 "execution_count": null,
167 "id": "cell-7",
168 "metadata": {},
169 "outputs": [],
170 "source": [
171 "# ๊ฑฐ๋ฆฌ ์ธก์ ์์\n",
172 "point1 = np.array([1, 2, 3])\n",
173 "point2 = np.array([4, 5, 6])\n",
174 "\n",
175 "print(\"๊ฑฐ๋ฆฌ ์ธก์ ์์:\")\n",
176 "print(f\" Point 1: {point1}\")\n",
177 "print(f\" Point 2: {point2}\")\n",
178 "print()\n",
179 "print(f\" ์ ํด๋ฆฌ๋ ๊ฑฐ๋ฆฌ: {euclidean(point1, point2):.4f}\")\n",
180 "print(f\" ๋งจํดํผ ๊ฑฐ๋ฆฌ: {cityblock(point1, point2):.4f}\")\n",
181 "print(f\" ๋ฏผ์ฝํ์คํค (p=3): {minkowski(point1, point2, p=3):.4f}\")\n",
182 "print(f\" ์ฒด๋น์
ฐํ ๊ฑฐ๋ฆฌ: {chebyshev(point1, point2):.4f}\")"
183 ]
184 },
185 {
186 "cell_type": "code",
187 "execution_count": null,
188 "id": "cell-8",
189 "metadata": {},
190 "outputs": [],
191 "source": [
192 "# ๊ฑฐ๋ฆฌ ๋ฉํธ๋ฆญ๋ณ ์ฑ๋ฅ ๋น๊ต\n",
193 "metrics = ['euclidean', 'manhattan', 'chebyshev']\n",
194 "\n",
195 "print(\"๊ฑฐ๋ฆฌ ๋ฉํธ๋ฆญ๋ณ ์ฑ๋ฅ (Iris):\")\n",
196 "print(\"-\" * 40)\n",
197 "for metric in metrics:\n",
198 " knn = KNeighborsClassifier(n_neighbors=5, metric=metric)\n",
199 " knn.fit(X_train, y_train)\n",
200 " acc = knn.score(X_test, y_test)\n",
201 " print(f\" {metric:12s}: {acc:.4f}\")"
202 ]
203 },
204 {
205 "cell_type": "markdown",
206 "id": "cell-9",
207 "metadata": {},
208 "source": [
209 "## 4. ์ต์ k๊ฐ ์ ํ\n",
210 "\n",
211 "k๊ฐ์ด ๋๋ฌด ์์ผ๋ฉด ๊ณผ์ ํฉ, ๋๋ฌด ํฌ๋ฉด ๊ณผ์์ ํฉ์ด ๋ฐ์ํฉ๋๋ค.\n",
212 "๊ต์ฐจ ๊ฒ์ฆ์ผ๋ก ์ต์ k๋ฅผ ์ฐพ์ต๋๋ค."
213 ]
214 },
215 {
216 "cell_type": "code",
217 "execution_count": null,
218 "id": "cell-10",
219 "metadata": {},
220 "outputs": [],
221 "source": [
222 "# k๊ฐ์ ๋ฐ๋ฅธ ์ฑ๋ฅ ๋ณํ\n",
223 "k_range = range(1, 31)\n",
224 "train_scores = []\n",
225 "test_scores = []\n",
226 "\n",
227 "for k in k_range:\n",
228 " knn = KNeighborsClassifier(n_neighbors=k)\n",
229 " knn.fit(X_train, y_train)\n",
230 " train_scores.append(knn.score(X_train, y_train))\n",
231 " test_scores.append(knn.score(X_test, y_test))\n",
232 "\n",
233 "# ์๊ฐํ\n",
234 "plt.figure(figsize=(10, 6))\n",
235 "plt.plot(k_range, train_scores, 'o-', label='Train')\n",
236 "plt.plot(k_range, test_scores, 's-', label='Test')\n",
237 "plt.xlabel('k (Number of Neighbors)')\n",
238 "plt.ylabel('Accuracy')\n",
239 "plt.title('kNN: k vs Accuracy')\n",
240 "plt.legend()\n",
241 "plt.grid(True, alpha=0.3)\n",
242 "plt.xticks(k_range[::2])\n",
243 "plt.tight_layout()\n",
244 "plt.show()\n",
245 "\n",
246 "# ์ต์ k ์ฐพ๊ธฐ\n",
247 "best_k = k_range[np.argmax(test_scores)]\n",
248 "print(f\"์ต์ k: {best_k}\")\n",
249 "print(f\"์ต๊ณ ํ
์คํธ ์ ํ๋: {max(test_scores):.4f}\")"
250 ]
251 },
252 {
253 "cell_type": "code",
254 "execution_count": null,
255 "id": "cell-11",
256 "metadata": {},
257 "outputs": [],
258 "source": [
259 "# ๊ต์ฐจ ๊ฒ์ฆ์ผ๋ก k ์ ํ\n",
260 "k_range = range(1, 31)\n",
261 "cv_scores = []\n",
262 "\n",
263 "for k in k_range:\n",
264 " knn = KNeighborsClassifier(n_neighbors=k)\n",
265 " scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')\n",
266 " cv_scores.append(scores.mean())\n",
267 "\n",
268 "# ์๊ฐํ\n",
269 "plt.figure(figsize=(10, 6))\n",
270 "plt.plot(k_range, cv_scores, 'o-', color='green')\n",
271 "plt.xlabel('k')\n",
272 "plt.ylabel('Cross-Validation Accuracy')\n",
273 "plt.title('kNN: k Selection with 5-Fold Cross-Validation')\n",
274 "plt.grid(True, alpha=0.3)\n",
275 "plt.xticks(k_range[::2])\n",
276 "plt.tight_layout()\n",
277 "plt.show()\n",
278 "\n",
279 "best_k_cv = k_range[np.argmax(cv_scores)]\n",
280 "print(f\"๊ต์ฐจ ๊ฒ์ฆ ์ต์ k: {best_k_cv}\")\n",
281 "print(f\"์ต๊ณ CV ์ ํ๋: {max(cv_scores):.4f}\")"
282 ]
283 },
284 {
285 "cell_type": "markdown",
286 "id": "cell-12",
287 "metadata": {},
288 "source": [
289 "## 5. ๊ฐ์ค kNN (Weighted kNN)\n",
290 "\n",
291 "๊ฑฐ๋ฆฌ์ ๋ฐ๋ผ ์ด์์ ๊ฐ์ค์น๋ฅผ ์กฐ์ ํฉ๋๋ค.\n",
292 "\n",
293 "- **uniform**: ๋ชจ๋ ์ด์์ ๋์ผํ ๊ฐ์ค์น\n",
294 "- **distance**: ๊ฐ๊น์ด ์ด์์ ๋ ํฐ ๊ฐ์ค์น (weight = 1/distance)"
295 ]
296 },
297 {
298 "cell_type": "code",
299 "execution_count": null,
300 "id": "cell-13",
301 "metadata": {},
302 "outputs": [],
303 "source": [
304 "# ๊ฐ์ค์น ๋ฐฉ์ ๋น๊ต\n",
305 "weights = ['uniform', 'distance']\n",
306 "\n",
307 "print(\"๊ฐ์ค์น ๋ฐฉ์ ๋น๊ต:\")\n",
308 "print(\"-\" * 40)\n",
309 "for weight in weights:\n",
310 " knn = KNeighborsClassifier(n_neighbors=5, weights=weight)\n",
311 " knn.fit(X_train, y_train)\n",
312 " acc = knn.score(X_test, y_test)\n",
313 " print(f\" {weight:10s}: {acc:.4f}\")"
314 ]
315 },
316 {
317 "cell_type": "code",
318 "execution_count": null,
319 "id": "cell-14",
320 "metadata": {},
321 "outputs": [],
322 "source": [
323 "# ๊ฑฐ๋ฆฌ ๊ฐ์ค kNN ์๊ฐํ\n",
324 "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
325 "\n",
326 "for ax, weight in zip(axes, weights):\n",
327 " knn = KNeighborsClassifier(n_neighbors=15, weights=weight)\n",
328 " knn.fit(X[:, :2], y)\n",
329 "\n",
330 " x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
331 " y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
332 " xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),\n",
333 " np.linspace(y_min, y_max, 100))\n",
334 "\n",
335 " Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])\n",
336 " Z = Z.reshape(xx.shape)\n",
337 "\n",
338 " ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')\n",
339 " ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='black')\n",
340 " ax.set_title(f'weights = {weight}')\n",
341 " ax.set_xlabel('Feature 1')\n",
342 " ax.set_ylabel('Feature 2')\n",
343 "\n",
344 "plt.tight_layout()\n",
345 "plt.show()"
346 ]
347 },
348 {
349 "cell_type": "markdown",
350 "id": "cell-15",
351 "metadata": {},
352 "source": [
353 "## 6. kNN ํ๊ท\n",
354 "\n",
355 "kNN์ ํ๊ท ๋ฌธ์ ์๋ ์ฌ์ฉํ ์ ์์ต๋๋ค.\n",
356 "k๊ฐ ์ด์์ ํ๊ท ์ผ๋ก ์์ธกํฉ๋๋ค."
357 ]
358 },
359 {
360 "cell_type": "code",
361 "execution_count": null,
362 "id": "cell-16",
363 "metadata": {},
364 "outputs": [],
365 "source": [
366 "# ๋ฐ์ดํฐ ๋ก๋\n",
367 "diabetes = load_diabetes()\n",
368 "X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(\n",
369 " diabetes.data, diabetes.target, test_size=0.2, random_state=42\n",
370 ")\n",
371 "\n",
372 "# ์ค์ผ์ผ๋ง (kNN์ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ์ด๋ฏ๋ก ํ์)\n",
373 "scaler = StandardScaler()\n",
374 "X_train_d_scaled = scaler.fit_transform(X_train_d)\n",
375 "X_test_d_scaled = scaler.transform(X_test_d)\n",
376 "\n",
377 "# kNN ํ๊ท\n",
378 "knn_reg = KNeighborsRegressor(n_neighbors=5, weights='distance')\n",
379 "knn_reg.fit(X_train_d_scaled, y_train_d)\n",
380 "y_pred_d = knn_reg.predict(X_test_d_scaled)\n",
381 "\n",
382 "print(\"kNN ํ๊ท ๊ฒฐ๊ณผ:\")\n",
383 "print(f\" MSE: {mean_squared_error(y_test_d, y_pred_d):.4f}\")\n",
384 "print(f\" RMSE: {np.sqrt(mean_squared_error(y_test_d, y_pred_d)):.4f}\")\n",
385 "print(f\" Rยฒ: {r2_score(y_test_d, y_pred_d):.4f}\")"
386 ]
387 },
388 {
389 "cell_type": "markdown",
390 "id": "cell-17",
391 "metadata": {},
392 "source": [
393 "## 7. kNN ์๊ณ ๋ฆฌ์ฆ ๋น๊ต\n",
394 "\n",
395 "๋์ฉ๋ ๋ฐ์ดํฐ์์๋ ํ์ ์๊ณ ๋ฆฌ์ฆ ์ ํ์ด ์ค์ํฉ๋๋ค.\n",
396 "\n",
397 "- **brute**: ์ ์ ํ์ (O(n))\n",
398 "- **kd_tree**: KD-Tree ์ฌ์ฉ (์ ์ฐจ์์ ํจ์จ์ )\n",
399 "- **ball_tree**: Ball-Tree ์ฌ์ฉ (๊ณ ์ฐจ์์ ํจ์จ์ )"
400 ]
401 },
402 {
403 "cell_type": "code",
404 "execution_count": null,
405 "id": "cell-18",
406 "metadata": {},
407 "outputs": [],
408 "source": [
409 "# ์๊ณ ๋ฆฌ์ฆ๋ณ ์๊ฐ ๋น๊ต\n",
410 "algorithms = ['brute', 'kd_tree', 'ball_tree']\n",
411 "\n",
412 "print(\"์๊ณ ๋ฆฌ์ฆ๋ณ ์๊ฐ ๋น๊ต:\")\n",
413 "print(\"-\" * 60)\n",
414 "for algo in algorithms:\n",
415 " knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)\n",
416 "\n",
417 " # ํ์ต ์๊ฐ\n",
418 " start = time()\n",
419 " knn.fit(X_train, y_train)\n",
420 " fit_time = time() - start\n",
421 "\n",
422 " # ์์ธก ์๊ฐ\n",
423 " start = time()\n",
424 " knn.predict(X_test)\n",
425 " pred_time = time() - start\n",
426 "\n",
427 " print(f\" {algo:10s}: fit={fit_time:.4f}s, predict={pred_time:.4f}s\")"
428 ]
429 },
430 {
431 "cell_type": "markdown",
432 "id": "cell-19",
433 "metadata": {},
434 "source": [
435 "## 8. ๋์ด๋ธ ๋ฒ ์ด์ฆ (Naive Bayes)\n",
436 "\n",
437 "### ๋ฒ ์ด์ฆ ์ ๋ฆฌ\n",
438 "\n",
439 "**P(y|X) = P(X|y) ร P(y) / P(X)**\n",
440 "\n",
441 "- P(y|X): ์ฌํ ํ๋ฅ (ํน์ฑ์ด ์ฃผ์ด์ก์ ๋ ํด๋์ค ํ๋ฅ )\n",
442 "- P(X|y): ์ฐ๋ (ํด๋์ค๊ฐ ์ฃผ์ด์ก์ ๋ ํน์ฑ ํ๋ฅ )\n",
443 "- P(y): ์ฌ์ ํ๋ฅ (ํด๋์ค์ ๊ธฐ๋ณธ ํ๋ฅ )\n",
444 "- P(X): ์ฆ๊ฑฐ (ํน์ฑ์ ํ๋ฅ )\n",
445 "\n",
446 "### ๋์ด๋ธ ๊ฐ์ \n",
447 "\n",
448 "๋ชจ๋ ํน์ฑ์ด ์๋ก ๋
๋ฆฝ์ ์ด๋ผ๊ณ ๊ฐ์ :\n",
449 "**P(X|y) = P(xโ|y) ร P(xโ|y) ร ... ร P(xโ|y)**"
450 ]
451 },
452 {
453 "cell_type": "markdown",
454 "id": "cell-20",
455 "metadata": {},
456 "source": [
457 "## 9. ๊ฐ์ฐ์์ ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
458 "\n",
459 "์ฐ์ํ ํน์ฑ์ด ๊ฐ์ฐ์์(์ ๊ท) ๋ถํฌ๋ฅผ ๋ฐ๋ฅธ๋ค๊ณ ๊ฐ์ ํฉ๋๋ค.\n",
460 "**P(xi|y) = N(xi; ฮผy, ฯy)**"
461 ]
462 },
463 {
464 "cell_type": "code",
465 "execution_count": null,
466 "id": "cell-21",
467 "metadata": {},
468 "outputs": [],
469 "source": [
470 "# ๊ฐ์ฐ์์ ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
471 "gnb = GaussianNB()\n",
472 "gnb.fit(X_train, y_train)\n",
473 "y_pred_nb = gnb.predict(X_test)\n",
474 "\n",
475 "print(\"๊ฐ์ฐ์์ ๋์ด๋ธ ๋ฒ ์ด์ฆ ๊ฒฐ๊ณผ:\")\n",
476 "print(f\" ์ ํ๋: {accuracy_score(y_test, y_pred_nb):.4f}\")\n",
477 "\n",
478 "# ํ์ต๋ ํ๋ผ๋ฏธํฐ ํ์ธ\n",
479 "print(f\"\\nํด๋์ค ์ฌ์ ํ๋ฅ : {gnb.class_prior_}\")\n",
480 "print(f\"\\nํด๋์ค๋ณ ํ๊ท (์ฒ์ 2๊ฐ ํน์ฑ):\")\n",
481 "print(gnb.theta_[:, :2])\n",
482 "print(f\"\\nํด๋์ค๋ณ ๋ถ์ฐ (์ฒ์ 2๊ฐ ํน์ฑ):\")\n",
483 "print(gnb.var_[:, :2])"
484 ]
485 },
486 {
487 "cell_type": "code",
488 "execution_count": null,
489 "id": "cell-22",
490 "metadata": {},
491 "outputs": [],
492 "source": [
493 "# ํ๋ฅ ์์ธก\n",
494 "y_proba = gnb.predict_proba(X_test[:5])\n",
495 "\n",
496 "print(\"ํ๋ฅ ์์ธก (์ฒ์ 5๊ฐ):\")\n",
497 "print(f\"ํด๋์ค: {iris.target_names}\")\n",
498 "print(y_proba)\n",
499 "print(f\"\\n์์ธก ํด๋์ค: {gnb.predict(X_test[:5])}\")\n",
500 "print(f\"์ค์ ํด๋์ค: {y_test[:5]}\")"
501 ]
502 },
503 {
504 "cell_type": "markdown",
505 "id": "cell-23",
506 "metadata": {},
507 "source": [
508 "## 10. ๋คํญ ๋์ด๋ธ ๋ฒ ์ด์ฆ - ํ
์คํธ ๋ถ๋ฅ\n",
509 "\n",
510 "์ด์ฐํ/์นด์ดํธ ํน์ฑ์ ์ฌ์ฉํ๋ฉฐ, ์ฃผ๋ก ํ
์คํธ ๋ถ๋ฅ(๋จ์ด ๋น๋)์ ํ์ฉ๋ฉ๋๋ค.\n",
511 "\n",
512 "**P(xi|y) = (Nyi + ฮฑ) / (Ny + ฮฑn)**\n",
513 "\n",
514 "- ฮฑ: Laplace smoothing ํ๋ผ๋ฏธํฐ (Zero frequency ๋ฌธ์ ํด๊ฒฐ)"
515 ]
516 },
517 {
518 "cell_type": "code",
519 "execution_count": null,
520 "id": "cell-24",
521 "metadata": {},
522 "outputs": [],
523 "source": [
524 "# ๋ด์ค ๋ฐ์ดํฐ ๋ก๋\n",
525 "categories = ['sci.space', 'rec.sport.baseball', 'talk.politics.misc']\n",
526 "newsgroups = fetch_20newsgroups(\n",
527 " subset='train',\n",
528 " categories=categories,\n",
529 " remove=('headers', 'footers', 'quotes'),\n",
530 " random_state=42\n",
531 ")\n",
532 "\n",
533 "print(f\"๋ด์ค ๋ฐ์ดํฐ: {len(newsgroups.data)} ๊ธฐ์ฌ\")\n",
534 "print(f\"์นดํ
๊ณ ๋ฆฌ: {categories}\")\n",
535 "print(f\"\\n์ฒซ ๋ฒ์งธ ๊ธฐ์ฌ (์ผ๋ถ):\\n{newsgroups.data[0][:200]}...\")"
536 ]
537 },
538 {
539 "cell_type": "code",
540 "execution_count": null,
541 "id": "cell-25",
542 "metadata": {},
543 "outputs": [],
544 "source": [
545 "# ํ
์คํธ ๋ฒกํฐํ\n",
546 "vectorizer = CountVectorizer(max_features=5000, stop_words='english')\n",
547 "X_news = vectorizer.fit_transform(newsgroups.data)\n",
548 "y_news = newsgroups.target\n",
549 "\n",
550 "print(f\"๋ฒกํฐ ํฌ๊ธฐ: {X_news.shape}\")\n",
551 "print(f\"ํน์ฑ ์: {len(vectorizer.get_feature_names_out())}\")\n",
552 "\n",
553 "# ํ์ต/ํ
์คํธ ๋ถํ \n",
554 "X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(\n",
555 " X_news, y_news, test_size=0.2, random_state=42\n",
556 ")"
557 ]
558 },
559 {
560 "cell_type": "code",
561 "execution_count": null,
562 "id": "cell-26",
563 "metadata": {},
564 "outputs": [],
565 "source": [
566 "# ๋คํญ ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
567 "mnb = MultinomialNB(alpha=1.0) # alpha: Laplace smoothing\n",
568 "mnb.fit(X_train_news, y_train_news)\n",
569 "\n",
570 "y_pred_news = mnb.predict(X_test_news)\n",
571 "\n",
572 "print(\"๋คํญ ๋์ด๋ธ ๋ฒ ์ด์ฆ (ํ
์คํธ ๋ถ๋ฅ) ๊ฒฐ๊ณผ:\")\n",
573 "print(f\" ์ ํ๋: {mnb.score(X_test_news, y_test_news):.4f}\")\n",
574 "print(\"\\n๋ถ๋ฅ ๋ฆฌํฌํธ:\")\n",
575 "print(classification_report(y_test_news, y_pred_news, target_names=categories))"
576 ]
577 },
578 {
579 "cell_type": "code",
580 "execution_count": null,
581 "id": "cell-27",
582 "metadata": {},
583 "outputs": [],
584 "source": [
585 "# ๊ฐ ํด๋์ค์ ๊ฐ์ฅ ์ค์ํ ๋จ์ด\n",
586 "feature_names = vectorizer.get_feature_names_out()\n",
587 "\n",
588 "print(\"๊ฐ ํด๋์ค๋ณ ์์ 10๊ฐ ๋จ์ด:\")\n",
589 "print(\"=\" * 60)\n",
590 "for i, category in enumerate(categories):\n",
591 " top_indices = mnb.feature_log_prob_[i].argsort()[-10:][::-1]\n",
592 " top_words = [feature_names[idx] for idx in top_indices]\n",
593 " print(f\"\\n{category}:\")\n",
594 " print(f\" {', '.join(top_words)}\")"
595 ]
596 },
597 {
598 "cell_type": "markdown",
599 "id": "cell-28",
600 "metadata": {},
601 "source": [
602 "## 11. ๋ฒ ๋ฅด๋์ด ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
603 "\n",
604 "์ด์ง ํน์ฑ(0/1)์ ์ฌ์ฉํ๋ฉฐ, ๋จ์ด์ ์กด์ฌ ์ฌ๋ถ๋ก ํ
์คํธ๋ฅผ ๋ถ๋ฅํฉ๋๋ค."
605 ]
606 },
607 {
608 "cell_type": "code",
609 "execution_count": null,
610 "id": "cell-29",
611 "metadata": {},
612 "outputs": [],
613 "source": [
614 "# ์ด์ง ๋ฒกํฐํ (๋จ์ด ์กด์ฌ ์ฌ๋ถ๋ง)\n",
615 "binary_vectorizer = CountVectorizer(max_features=5000, binary=True, stop_words='english')\n",
616 "X_binary = binary_vectorizer.fit_transform(newsgroups.data)\n",
617 "\n",
618 "X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(\n",
619 " X_binary, y_news, test_size=0.2, random_state=42\n",
620 ")\n",
621 "\n",
622 "# ๋ฒ ๋ฅด๋์ด ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
623 "bnb = BernoulliNB(alpha=1.0)\n",
624 "bnb.fit(X_train_bin, y_train_bin)\n",
625 "\n",
626 "print(\"๋ฒ ๋ฅด๋์ด ๋์ด๋ธ ๋ฒ ์ด์ฆ ๊ฒฐ๊ณผ:\")\n",
627 "print(f\" ์ ํ๋: {bnb.score(X_test_bin, y_test_bin):.4f}\")"
628 ]
629 },
630 {
631 "cell_type": "markdown",
632 "id": "cell-30",
633 "metadata": {},
634 "source": [
635 "## 12. ๋์ด๋ธ ๋ฒ ์ด์ฆ ๋ชจ๋ธ ๋น๊ต"
636 ]
637 },
638 {
639 "cell_type": "code",
640 "execution_count": null,
641 "id": "cell-31",
642 "metadata": {},
643 "outputs": [],
644 "source": [
645 "# ์ซ์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ\n",
646 "digits = load_digits()\n",
647 "X_train_dig, X_test_dig, y_train_dig, y_test_dig = train_test_split(\n",
648 " digits.data, digits.target, test_size=0.2, random_state=42\n",
649 ")\n",
650 "\n",
651 "# ์ธ ๊ฐ์ง ๋์ด๋ธ ๋ฒ ์ด์ฆ ๋น๊ต\n",
652 "models = {\n",
653 " 'Gaussian NB': GaussianNB(),\n",
654 " 'Multinomial NB': MultinomialNB(),\n",
655 " 'Bernoulli NB': BernoulliNB()\n",
656 "}\n",
657 "\n",
658 "print(\"๋์ด๋ธ ๋ฒ ์ด์ฆ ๋ชจ๋ธ ๋น๊ต (Digits):\")\n",
659 "print(\"-\" * 50)\n",
660 "for name, model in models.items():\n",
661 " model.fit(X_train_dig, y_train_dig)\n",
662 " acc = model.score(X_test_dig, y_test_dig)\n",
663 " print(f\" {name:18s}: {acc:.4f}\")"
664 ]
665 },
666 {
667 "cell_type": "markdown",
668 "id": "cell-32",
669 "metadata": {},
670 "source": [
671 "## 13. ์จ๋ผ์ธ ํ์ต (Incremental Learning)\n",
672 "\n",
673 "๋์ด๋ธ ๋ฒ ์ด์ฆ๋ `partial_fit`์ผ๋ก ์จ๋ผ์ธ ํ์ต์ด ๊ฐ๋ฅํฉ๋๋ค.\n",
674 "๋์ฉ๋ ๋ฐ์ดํฐ๋ ์คํธ๋ฆฌ๋ฐ ๋ฐ์ดํฐ์ ์ ์ฉํฉ๋๋ค."
675 ]
676 },
677 {
678 "cell_type": "code",
679 "execution_count": null,
680 "id": "cell-33",
681 "metadata": {},
682 "outputs": [],
683 "source": [
684 "# ์จ๋ผ์ธ ํ์ต ์๋ฎฌ๋ ์ด์
\n",
685 "gnb_online = GaussianNB()\n",
686 "\n",
687 "# ๋ฐฐ์น ํ์ต\n",
688 "batch_size = 50\n",
689 "n_batches = len(X_train) // batch_size\n",
690 "\n",
691 "for i in range(n_batches):\n",
692 " start = i * batch_size\n",
693 " end = start + batch_size\n",
694 " X_batch = X_train[start:end]\n",
695 " y_batch = y_train[start:end]\n",
696 "\n",
697 " # ์ฒซ ๋ฐฐ์น์์ ํด๋์ค ์ ์\n",
698 " if i == 0:\n",
699 " gnb_online.partial_fit(X_batch, y_batch, classes=np.unique(y_train))\n",
700 " else:\n",
701 " gnb_online.partial_fit(X_batch, y_batch)\n",
702 "\n",
703 "print(\"์จ๋ผ์ธ ํ์ต ๊ฒฐ๊ณผ:\")\n",
704 "print(f\" ๋ฐฐ์น ์: {n_batches}\")\n",
705 "print(f\" ๋ฐฐ์น ํฌ๊ธฐ: {batch_size}\")\n",
706 "print(f\" ์ ํ๋: {gnb_online.score(X_test, y_test):.4f}\")"
707 ]
708 },
709 {
710 "cell_type": "markdown",
711 "id": "cell-34",
712 "metadata": {},
713 "source": [
714 "## 14. kNN vs ๋์ด๋ธ ๋ฒ ์ด์ฆ ๋น๊ต"
715 ]
716 },
717 {
718 "cell_type": "code",
719 "execution_count": null,
720 "id": "cell-35",
721 "metadata": {},
722 "outputs": [],
723 "source": [
724 "# ์ ๋ฐฉ์ ๋ฐ์ดํฐ\n",
725 "cancer = load_breast_cancer()\n",
726 "X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(\n",
727 " cancer.data, cancer.target, test_size=0.2, random_state=42\n",
728 ")\n",
729 "\n",
730 "# ์ค์ผ์ผ๋ง\n",
731 "scaler = StandardScaler()\n",
732 "X_train_c_scaled = scaler.fit_transform(X_train_c)\n",
733 "X_test_c_scaled = scaler.transform(X_test_c)\n",
734 "\n",
735 "# ๋ชจ๋ธ ๋น๊ต\n",
736 "models = {\n",
737 " 'kNN (k=5)': KNeighborsClassifier(n_neighbors=5),\n",
738 " 'kNN (weighted)': KNeighborsClassifier(n_neighbors=5, weights='distance'),\n",
739 " 'Gaussian NB': GaussianNB()\n",
740 "}\n",
741 "\n",
742 "print(\"kNN vs ๋์ด๋ธ ๋ฒ ์ด์ฆ ๋น๊ต (Breast Cancer):\")\n",
743 "print(\"=\" * 50)\n",
744 "\n",
745 "for name, model in models.items():\n",
746 " if 'kNN' in name:\n",
747 " model.fit(X_train_c_scaled, y_train_c)\n",
748 " acc = model.score(X_test_c_scaled, y_test_c)\n",
749 " else:\n",
750 " model.fit(X_train_c, y_train_c)\n",
751 " acc = model.score(X_test_c, y_test_c)\n",
752 " print(f\" {name:18s}: {acc:.4f}\")"
753 ]
754 },
755 {
756 "cell_type": "markdown",
757 "id": "cell-36",
758 "metadata": {},
759 "source": [
760 "## 15. ๊ฐ๋จํ ํ
์คํธ ๋ถ๋ฅ ์์ "
761 ]
762 },
763 {
764 "cell_type": "code",
765 "execution_count": null,
766 "id": "cell-37",
767 "metadata": {},
768 "outputs": [],
769 "source": [
770 "# ๊ฐ๋จํ ๊ฐ์ฑ ๋ถ๋ฅ\n",
771 "texts = [\n",
772 " \"I love this movie\", \"Great film\", \"Excellent acting\",\n",
773 " \"Amazing performance\", \"Wonderful story\",\n",
774 " \"Terrible movie\", \"Bad film\", \"Worst movie ever\",\n",
775 " \"Horrible acting\", \"Disappointing story\"\n",
776 "]\n",
777 "labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] # 1: positive, 0: negative\n",
778 "\n",
779 "# TF-IDF ๋ฒกํฐํ\n",
780 "tfidf = TfidfVectorizer()\n",
781 "X_sentiment = tfidf.fit_transform(texts)\n",
782 "\n",
783 "# ๋์ด๋ธ ๋ฒ ์ด์ฆ ํ์ต\n",
784 "mnb_sentiment = MultinomialNB()\n",
785 "mnb_sentiment.fit(X_sentiment, labels)\n",
786 "\n",
787 "# ์๋ก์ด ํ
์คํธ ๋ถ๋ฅ\n",
788 "new_texts = [\n",
789 " \"This is a great movie\",\n",
790 " \"I hate this film\",\n",
791 " \"Excellent performance and story\",\n",
792 " \"Terrible and disappointing\"\n",
793 "]\n",
794 "X_new = tfidf.transform(new_texts)\n",
795 "predictions = mnb_sentiment.predict(X_new)\n",
796 "probabilities = mnb_sentiment.predict_proba(X_new)\n",
797 "\n",
798 "print(\"๊ฐ์ฑ ๋ถ๋ฅ ๊ฒฐ๊ณผ:\")\n",
799 "print(\"=\" * 60)\n",
800 "for text, pred, prob in zip(new_texts, predictions, probabilities):\n",
801 " sentiment = \"Positive\" if pred == 1 else \"Negative\"\n",
802 " confidence = max(prob) * 100\n",
803 " print(f\"'{text}'\")\n",
804 " print(f\" โ {sentiment} (์ ๋ขฐ๋: {confidence:.1f}%)\\n\")"
805 ]
806 },
807 {
808 "cell_type": "markdown",
809 "id": "cell-38",
810 "metadata": {},
811 "source": [
812 "## ์ ๋ฆฌ\n",
813 "\n",
814 "### kNN ์์ฝ\n",
815 "\n",
816 "| ํ๋ผ๋ฏธํฐ | ์ค๋ช
| ๊ถ์ฅ |\n",
817 "|----------|------|------|\n",
818 "| **n_neighbors** | ์ด์ ์ (k) | ๊ต์ฐจ ๊ฒ์ฆ์ผ๋ก ์ ํ |\n",
819 "| **weights** | ๊ฐ์ค์น ๋ฐฉ์ | 'distance' ์ถ์ฒ |\n",
820 "| **metric** | ๊ฑฐ๋ฆฌ ์ธก์ | 'euclidean' ๊ธฐ๋ณธ |\n",
821 "| **algorithm** | ํ์ ์๊ณ ๋ฆฌ์ฆ | 'auto' |\n",
822 "\n",
823 "**ํน์ง**:\n",
824 "- ๊ฒ์ผ๋ฅธ ํ์ต (ํ์ต ์๊ฐ ์์)\n",
825 "- ์์ธก ์๊ฐ ๋๋ฆผ (O(nยทd))\n",
826 "- ์ค์ผ์ผ๋ง ํ์\n",
827 "- ๊ณ ์ฐจ์์์ ์ฑ๋ฅ ์ ํ (์ฐจ์์ ์ ์ฃผ)\n",
828 "\n",
829 "### ๋์ด๋ธ ๋ฒ ์ด์ฆ ์์ฝ\n",
830 "\n",
831 "| ์ข
๋ฅ | ํน์ฑ ํ์
| ์ฃผ์ ์ฉ๋ |\n",
832 "|------|-----------|----------|\n",
833 "| **GaussianNB** | ์ฐ์ํ (์ ๊ท ๋ถํฌ) | ์ผ๋ฐ ๋ถ๋ฅ |\n",
834 "| **MultinomialNB** | ์นด์ดํธ/๋น๋ | ํ
์คํธ ๋ถ๋ฅ |\n",
835 "| **BernoulliNB** | ์ด์ง (0/1) | ๋จ์ด ์กด์ฌ ์ฌ๋ถ |\n",
836 "\n",
837 "**ํน์ง**:\n",
838 "- ๋งค์ฐ ๋น ๋ฆ (ํ์ต O(nยทd), ์์ธก O(d))\n",
839 "- ์ ์ ๋ฐ์ดํฐ๋ก๋ ์ ์๋\n",
840 "- ๊ณ ์ฐจ์ ๋ฐ์ดํฐ์ ํจ๊ณผ์ \n",
841 "- ์จ๋ผ์ธ ํ์ต ๊ฐ๋ฅ\n",
842 "- ํน์ฑ ๋
๋ฆฝ์ฑ ๊ฐ์ (ํ์ค์์ ์๋ฐ ๊ฐ๋ฅ)\n",
843 "\n",
844 "### kNN vs ๋์ด๋ธ ๋ฒ ์ด์ฆ\n",
845 "\n",
846 "| ํน์ฑ | kNN | ๋์ด๋ธ ๋ฒ ์ด์ฆ |\n",
847 "|------|-----|---------------|\n",
848 "| **ํ์ต ์๊ฐ** | O(1) | O(nยทd) |\n",
849 "| **์์ธก ์๊ฐ** | O(nยทd) | O(d) |\n",
850 "| **๋ฉ๋ชจ๋ฆฌ** | ๋์ | ๋ฎ์ |\n",
851 "| **์ค์ผ์ผ๋ง** | ํ์ | ๋ถํ์ |\n",
852 "| **๊ณ ์ฐจ์** | ์ฝํจ | ๊ฐํจ |\n",
853 "| **ํด์์ฑ** | ์ง๊ด์ | ํ๋ฅ ๊ธฐ๋ฐ |\n",
854 "\n",
855 "### ๋ค์ ๋จ๊ณ\n",
856 "- Clustering (K-Means, DBSCAN)\n",
857 "- Dimensionality Reduction (PCA, t-SNE)\n",
858 "- Ensemble methods (Stacking, Voting)"
859 ]
860 }
861 ],
862 "metadata": {
863 "kernelspec": {
864 "display_name": "Python 3",
865 "language": "python",
866 "name": "python3"
867 },
868 "language_info": {
869 "name": "python",
870 "version": "3.9.0"
871 }
872 },
873 "nbformat": 4,
874 "nbformat_minor": 5
875}