10_knn_naive_bayes.ipynb

  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "id": "cell-0",
  6   "metadata": {},
  7   "source": [
  8    "# 10. k-최근접 이웃(kNN)과 나이브 베이즈\n",
  9    "\n",
 10    "## 학습 목표\n",
 11    "- kNN의 거리 기반 분류 원리 이해\n",
 12    "- 거리 메트릭 (Euclidean, Manhattan, Minkowski) 학습\n",
 13    "- 최적 k값 선택 방법 습득\n",
 14    "- 나이브 베이즈의 확률 기반 분류 이해\n",
 15    "- Gaussian, Multinomial, Bernoulli NB 비교\n",
 16    "- 텍스트 분류 적용"
 17   ]
 18  },
 19  {
 20   "cell_type": "code",
 21   "execution_count": null,
 22   "id": "cell-1",
 23   "metadata": {},
 24   "outputs": [],
 25   "source": [
 26    "# 라이브러리 임포트\n",
 27    "import numpy as np\n",
 28    "import pandas as pd\n",
 29    "import matplotlib.pyplot as plt\n",
 30    "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n",
 31    "from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB\n",
 32    "from sklearn.datasets import (\n",
 33    "    load_iris, load_breast_cancer, load_diabetes, load_digits,\n",
 34    "    make_classification, fetch_20newsgroups\n",
 35    ")\n",
 36    "from sklearn.model_selection import train_test_split, cross_val_score\n",
 37    "from sklearn.preprocessing import StandardScaler\n",
 38    "from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score\n",
 39    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
 40    "from scipy.spatial.distance import euclidean, cityblock, minkowski, chebyshev\n",
 41    "from time import time\n",
 42    "\n",
 43    "# 한글 폰트 설정\n",
 44    "plt.rcParams['font.family'] = 'DejaVu Sans'\n",
 45    "plt.rcParams['axes.unicode_minus'] = False\n",
 46    "np.random.seed(42)"
 47   ]
 48  },
 49  {
 50   "cell_type": "markdown",
 51   "id": "cell-2",
 52   "metadata": {},
 53   "source": [
 54    "## 1. k-최근접 이웃 (kNN) 개념\n",
 55    "\n",
 56    "kNN은 게으른 학습(Lazy Learning) 알고리즘입니다.\n",
 57    "\n",
 58    "**동작 원리**:\n",
 59    "1. 새로운 데이터가 들어오면\n",
 60    "2. 학습 데이터에서 가장 가까운 k개의 이웃을 찾음\n",
 61    "3. k개 이웃의 다수결(분류) 또는 평균(회귀)으로 예측\n",
 62    "\n",
 63    "**특징**:\n",
 64    "- 학습 시 모델 생성 없음 (모든 데이터 저장)\n",
 65    "- 비모수적 방법 (데이터 분포 가정 불필요)\n",
 66    "- 예측 시간이 느림"
 67   ]
 68  },
 69  {
 70   "cell_type": "code",
 71   "execution_count": null,
 72   "id": "cell-3",
 73   "metadata": {},
 74   "outputs": [],
 75   "source": [
 76    "# 2D 데이터로 kNN 시각화\n",
 77    "X, y = make_classification(\n",
 78    "    n_samples=100, n_features=2, n_redundant=0,\n",
 79    "    n_informative=2, n_clusters_per_class=1, random_state=42\n",
 80    ")\n",
 81    "\n",
 82    "# 여러 k값 비교\n",
 83    "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
 84    "k_values = [1, 5, 15]\n",
 85    "\n",
 86    "for ax, k in zip(axes, k_values):\n",
 87    "    knn = KNeighborsClassifier(n_neighbors=k)\n",
 88    "    knn.fit(X, y)\n",
 89    "\n",
 90    "    # 결정 경계\n",
 91    "    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
 92    "    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
 93    "    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),\n",
 94    "                         np.linspace(y_min, y_max, 100))\n",
 95    "\n",
 96    "    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])\n",
 97    "    Z = Z.reshape(xx.shape)\n",
 98    "\n",
 99    "    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')\n",
100    "    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='black')\n",
101    "    ax.set_title(f'k = {k}\\nAccuracy = {knn.score(X, y):.3f}')\n",
102    "    ax.set_xlabel('Feature 1')\n",
103    "    ax.set_ylabel('Feature 2')\n",
104    "\n",
105    "plt.tight_layout()\n",
106    "plt.show()"
107   ]
108  },
109  {
110   "cell_type": "markdown",
111   "id": "cell-4",
112   "metadata": {},
113   "source": [
114    "## 2. kNN 기본 사용법"
115   ]
116  },
117  {
118   "cell_type": "code",
119   "execution_count": null,
120   "id": "cell-5",
121   "metadata": {},
122   "outputs": [],
123   "source": [
124    "# 데이터 로드\n",
125    "iris = load_iris()\n",
126    "X_train, X_test, y_train, y_test = train_test_split(\n",
127    "    iris.data, iris.target, test_size=0.2, random_state=42\n",
128    ")\n",
129    "\n",
130    "# kNN 분류기\n",
131    "knn = KNeighborsClassifier(\n",
132    "    n_neighbors=5,           # k값\n",
133    "    weights='uniform',       # 가중치: 'uniform' 또는 'distance'\n",
134    "    algorithm='auto',        # 알고리즘: 'auto', 'ball_tree', 'kd_tree', 'brute'\n",
135    "    metric='minkowski',      # 거리 측정: 'euclidean', 'manhattan', 'minkowski'\n",
136    "    p=2                      # minkowski p값 (2=euclidean, 1=manhattan)\n",
137    ")\n",
138    "\n",
139    "knn.fit(X_train, y_train)\n",
140    "y_pred = knn.predict(X_test)\n",
141    "\n",
142    "print(\"kNN 분류 결과:\")\n",
143    "print(f\"  정확도: {accuracy_score(y_test, y_pred):.4f}\")\n",
144    "print(\"\\n분류 리포트:\")\n",
145    "print(classification_report(y_test, y_pred, target_names=iris.target_names))"
146   ]
147  },
148  {
149   "cell_type": "markdown",
150   "id": "cell-6",
151   "metadata": {},
152   "source": [
153    "## 3. 거리 측정 방법\n",
154    "\n",
155    "kNN의 핵심은 거리 계산입니다.\n",
156    "\n",
157    "주요 거리 메트릭:\n",
158    "- **유클리드 (Euclidean, L2)**: d = √Σ(xi - yi)²\n",
159    "- **맨해튼 (Manhattan, L1)**: d = Σ|xi - yi|\n",
160    "- **민코프스키 (Minkowski)**: d = (Σ|xi - yi|^p)^(1/p)\n",
161    "- **체비셰프 (Chebyshev, L∞)**: d = max(|xi - yi|)"
162   ]
163  },
164  {
165   "cell_type": "code",
166   "execution_count": null,
167   "id": "cell-7",
168   "metadata": {},
169   "outputs": [],
170   "source": [
171    "# 거리 측정 예시\n",
172    "point1 = np.array([1, 2, 3])\n",
173    "point2 = np.array([4, 5, 6])\n",
174    "\n",
175    "print(\"거리 측정 예시:\")\n",
176    "print(f\"  Point 1: {point1}\")\n",
177    "print(f\"  Point 2: {point2}\")\n",
178    "print()\n",
179    "print(f\"  유클리드 거리:     {euclidean(point1, point2):.4f}\")\n",
180    "print(f\"  맨해튼 거리:       {cityblock(point1, point2):.4f}\")\n",
181    "print(f\"  민코프스키 (p=3):  {minkowski(point1, point2, p=3):.4f}\")\n",
182    "print(f\"  체비셰프 거리:     {chebyshev(point1, point2):.4f}\")"
183   ]
184  },
185  {
186   "cell_type": "code",
187   "execution_count": null,
188   "id": "cell-8",
189   "metadata": {},
190   "outputs": [],
191   "source": [
192    "# 거리 메트릭별 성능 비교\n",
193    "metrics = ['euclidean', 'manhattan', 'chebyshev']\n",
194    "\n",
195    "print(\"거리 메트릭별 성능 (Iris):\")\n",
196    "print(\"-\" * 40)\n",
197    "for metric in metrics:\n",
198    "    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)\n",
199    "    knn.fit(X_train, y_train)\n",
200    "    acc = knn.score(X_test, y_test)\n",
201    "    print(f\"  {metric:12s}: {acc:.4f}\")"
202   ]
203  },
204  {
205   "cell_type": "markdown",
206   "id": "cell-9",
207   "metadata": {},
208   "source": [
209    "## 4. 최적 k값 선택\n",
210    "\n",
211    "k값이 너무 작으면 과적합, 너무 크면 과소적합이 발생합니다.\n",
212    "교차 검증으로 최적 k를 찾습니다."
213   ]
214  },
215  {
216   "cell_type": "code",
217   "execution_count": null,
218   "id": "cell-10",
219   "metadata": {},
220   "outputs": [],
221   "source": [
222    "# k값에 따른 성능 변화\n",
223    "k_range = range(1, 31)\n",
224    "train_scores = []\n",
225    "test_scores = []\n",
226    "\n",
227    "for k in k_range:\n",
228    "    knn = KNeighborsClassifier(n_neighbors=k)\n",
229    "    knn.fit(X_train, y_train)\n",
230    "    train_scores.append(knn.score(X_train, y_train))\n",
231    "    test_scores.append(knn.score(X_test, y_test))\n",
232    "\n",
233    "# 시각화\n",
234    "plt.figure(figsize=(10, 6))\n",
235    "plt.plot(k_range, train_scores, 'o-', label='Train')\n",
236    "plt.plot(k_range, test_scores, 's-', label='Test')\n",
237    "plt.xlabel('k (Number of Neighbors)')\n",
238    "plt.ylabel('Accuracy')\n",
239    "plt.title('kNN: k vs Accuracy')\n",
240    "plt.legend()\n",
241    "plt.grid(True, alpha=0.3)\n",
242    "plt.xticks(k_range[::2])\n",
243    "plt.tight_layout()\n",
244    "plt.show()\n",
245    "\n",
246    "# 최적 k 찾기\n",
247    "best_k = k_range[np.argmax(test_scores)]\n",
248    "print(f\"최적 k: {best_k}\")\n",
249    "print(f\"최고 테스트 정확도: {max(test_scores):.4f}\")"
250   ]
251  },
252  {
253   "cell_type": "code",
254   "execution_count": null,
255   "id": "cell-11",
256   "metadata": {},
257   "outputs": [],
258   "source": [
259    "# 교차 검증으로 k 선택\n",
260    "k_range = range(1, 31)\n",
261    "cv_scores = []\n",
262    "\n",
263    "for k in k_range:\n",
264    "    knn = KNeighborsClassifier(n_neighbors=k)\n",
265    "    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')\n",
266    "    cv_scores.append(scores.mean())\n",
267    "\n",
268    "# 시각화\n",
269    "plt.figure(figsize=(10, 6))\n",
270    "plt.plot(k_range, cv_scores, 'o-', color='green')\n",
271    "plt.xlabel('k')\n",
272    "plt.ylabel('Cross-Validation Accuracy')\n",
273    "plt.title('kNN: k Selection with 5-Fold Cross-Validation')\n",
274    "plt.grid(True, alpha=0.3)\n",
275    "plt.xticks(k_range[::2])\n",
276    "plt.tight_layout()\n",
277    "plt.show()\n",
278    "\n",
279    "best_k_cv = k_range[np.argmax(cv_scores)]\n",
280    "print(f\"교차 검증 최적 k: {best_k_cv}\")\n",
281    "print(f\"최고 CV 정확도: {max(cv_scores):.4f}\")"
282   ]
283  },
284  {
285   "cell_type": "markdown",
286   "id": "cell-12",
287   "metadata": {},
288   "source": [
289    "## 5. 가중 kNN (Weighted kNN)\n",
290    "\n",
291    "거리에 따라 이웃의 가중치를 조절합니다.\n",
292    "\n",
293    "- **uniform**: 모든 이웃에 동일한 가중치\n",
294    "- **distance**: 가까운 이웃에 더 큰 가중치 (weight = 1/distance)"
295   ]
296  },
297  {
298   "cell_type": "code",
299   "execution_count": null,
300   "id": "cell-13",
301   "metadata": {},
302   "outputs": [],
303   "source": [
304    "# 가중치 방식 비교\n",
305    "weights = ['uniform', 'distance']\n",
306    "\n",
307    "print(\"가중치 방식 비교:\")\n",
308    "print(\"-\" * 40)\n",
309    "for weight in weights:\n",
310    "    knn = KNeighborsClassifier(n_neighbors=5, weights=weight)\n",
311    "    knn.fit(X_train, y_train)\n",
312    "    acc = knn.score(X_test, y_test)\n",
313    "    print(f\"  {weight:10s}: {acc:.4f}\")"
314   ]
315  },
316  {
317   "cell_type": "code",
318   "execution_count": null,
319   "id": "cell-14",
320   "metadata": {},
321   "outputs": [],
322   "source": [
323    "# 거리 가중 kNN 시각화\n",
324    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
325    "\n",
326    "for ax, weight in zip(axes, weights):\n",
327    "    knn = KNeighborsClassifier(n_neighbors=15, weights=weight)\n",
328    "    knn.fit(X[:, :2], y)\n",
329    "\n",
330    "    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
331    "    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
332    "    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),\n",
333    "                         np.linspace(y_min, y_max, 100))\n",
334    "\n",
335    "    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])\n",
336    "    Z = Z.reshape(xx.shape)\n",
337    "\n",
338    "    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')\n",
339    "    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='black')\n",
340    "    ax.set_title(f'weights = {weight}')\n",
341    "    ax.set_xlabel('Feature 1')\n",
342    "    ax.set_ylabel('Feature 2')\n",
343    "\n",
344    "plt.tight_layout()\n",
345    "plt.show()"
346   ]
347  },
348  {
349   "cell_type": "markdown",
350   "id": "cell-15",
351   "metadata": {},
352   "source": [
353    "## 6. kNN 회귀\n",
354    "\n",
355    "kNN은 회귀 문제에도 사용할 수 있습니다.\n",
356    "k개 이웃의 평균으로 예측합니다."
357   ]
358  },
359  {
360   "cell_type": "code",
361   "execution_count": null,
362   "id": "cell-16",
363   "metadata": {},
364   "outputs": [],
365   "source": [
366    "# 데이터 로드\n",
367    "diabetes = load_diabetes()\n",
368    "X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(\n",
369    "    diabetes.data, diabetes.target, test_size=0.2, random_state=42\n",
370    ")\n",
371    "\n",
372    "# 스케일링 (kNN은 거리 기반이므로 필수)\n",
373    "scaler = StandardScaler()\n",
374    "X_train_d_scaled = scaler.fit_transform(X_train_d)\n",
375    "X_test_d_scaled = scaler.transform(X_test_d)\n",
376    "\n",
377    "# kNN 회귀\n",
378    "knn_reg = KNeighborsRegressor(n_neighbors=5, weights='distance')\n",
379    "knn_reg.fit(X_train_d_scaled, y_train_d)\n",
380    "y_pred_d = knn_reg.predict(X_test_d_scaled)\n",
381    "\n",
382    "print(\"kNN 회귀 결과:\")\n",
383    "print(f\"  MSE: {mean_squared_error(y_test_d, y_pred_d):.4f}\")\n",
384    "print(f\"  RMSE: {np.sqrt(mean_squared_error(y_test_d, y_pred_d)):.4f}\")\n",
385    "print(f\"  R²: {r2_score(y_test_d, y_pred_d):.4f}\")"
386   ]
387  },
388  {
389   "cell_type": "markdown",
390   "id": "cell-17",
391   "metadata": {},
392   "source": [
393    "## 7. kNN 알고리즘 비교\n",
394    "\n",
395    "대용량 데이터에서는 탐색 알고리즘 선택이 중요합니다.\n",
396    "\n",
397    "- **brute**: 전수 탐색 (O(n))\n",
398    "- **kd_tree**: KD-Tree 사용 (저차원에 효율적)\n",
399    "- **ball_tree**: Ball-Tree 사용 (고차원에 효율적)"
400   ]
401  },
402  {
403   "cell_type": "code",
404   "execution_count": null,
405   "id": "cell-18",
406   "metadata": {},
407   "outputs": [],
408   "source": [
409    "# 알고리즘별 시간 비교\n",
410    "algorithms = ['brute', 'kd_tree', 'ball_tree']\n",
411    "\n",
412    "print(\"알고리즘별 시간 비교:\")\n",
413    "print(\"-\" * 60)\n",
414    "for algo in algorithms:\n",
415    "    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)\n",
416    "\n",
417    "    # 학습 시간\n",
418    "    start = time()\n",
419    "    knn.fit(X_train, y_train)\n",
420    "    fit_time = time() - start\n",
421    "\n",
422    "    # 예측 시간\n",
423    "    start = time()\n",
424    "    knn.predict(X_test)\n",
425    "    pred_time = time() - start\n",
426    "\n",
427    "    print(f\"  {algo:10s}: fit={fit_time:.4f}s, predict={pred_time:.4f}s\")"
428   ]
429  },
430  {
431   "cell_type": "markdown",
432   "id": "cell-19",
433   "metadata": {},
434   "source": [
435    "## 8. 나이브 베이즈 (Naive Bayes)\n",
436    "\n",
437    "### 베이즈 정리\n",
438    "\n",
439    "**P(y|X) = P(X|y) × P(y) / P(X)**\n",
440    "\n",
441    "- P(y|X): 사후 확률 (특성이 주어졌을 때 클래스 확률)\n",
442    "- P(X|y): 우도 (클래스가 주어졌을 때 특성 확률)\n",
443    "- P(y): 사전 확률 (클래스의 기본 확률)\n",
444    "- P(X): 증거 (특성의 확률)\n",
445    "\n",
446    "### 나이브 가정\n",
447    "\n",
448    "모든 특성이 서로 독립적이라고 가정:\n",
449    "**P(X|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)**"
450   ]
451  },
452  {
453   "cell_type": "markdown",
454   "id": "cell-20",
455   "metadata": {},
456   "source": [
457    "## 9. 가우시안 나이브 베이즈\n",
458    "\n",
459    "연속형 특성이 가우시안(정규) 분포를 따른다고 가정합니다.\n",
460    "**P(xi|y) = N(xi; μy, σy)**"
461   ]
462  },
463  {
464   "cell_type": "code",
465   "execution_count": null,
466   "id": "cell-21",
467   "metadata": {},
468   "outputs": [],
469   "source": [
470    "# 가우시안 나이브 베이즈\n",
471    "gnb = GaussianNB()\n",
472    "gnb.fit(X_train, y_train)\n",
473    "y_pred_nb = gnb.predict(X_test)\n",
474    "\n",
475    "print(\"가우시안 나이브 베이즈 결과:\")\n",
476    "print(f\"  정확도: {accuracy_score(y_test, y_pred_nb):.4f}\")\n",
477    "\n",
478    "# 학습된 파라미터 확인\n",
479    "print(f\"\\n클래스 사전 확률: {gnb.class_prior_}\")\n",
480    "print(f\"\\n클래스별 평균 (처음 2개 특성):\")\n",
481    "print(gnb.theta_[:, :2])\n",
482    "print(f\"\\n클래스별 분산 (처음 2개 특성):\")\n",
483    "print(gnb.var_[:, :2])"
484   ]
485  },
486  {
487   "cell_type": "code",
488   "execution_count": null,
489   "id": "cell-22",
490   "metadata": {},
491   "outputs": [],
492   "source": [
493    "# 확률 예측\n",
494    "y_proba = gnb.predict_proba(X_test[:5])\n",
495    "\n",
496    "print(\"확률 예측 (처음 5개):\")\n",
497    "print(f\"클래스: {iris.target_names}\")\n",
498    "print(y_proba)\n",
499    "print(f\"\\n예측 클래스: {gnb.predict(X_test[:5])}\")\n",
500    "print(f\"실제 클래스: {y_test[:5]}\")"
501   ]
502  },
503  {
504   "cell_type": "markdown",
505   "id": "cell-23",
506   "metadata": {},
507   "source": [
508    "## 10. 다항 나이브 베이즈 - 텍스트 분류\n",
509    "\n",
510    "이산형/카운트 특성에 사용하며, 주로 텍스트 분류(단어 빈도)에 활용됩니다.\n",
511    "\n",
512    "**P(xi|y) = (Nyi + α) / (Ny + αn)**\n",
513    "\n",
514    "- α: Laplace smoothing 파라미터 (Zero frequency 문제 해결)"
515   ]
516  },
517  {
518   "cell_type": "code",
519   "execution_count": null,
520   "id": "cell-24",
521   "metadata": {},
522   "outputs": [],
523   "source": [
524    "# 뉴스 데이터 로드\n",
525    "categories = ['sci.space', 'rec.sport.baseball', 'talk.politics.misc']\n",
526    "newsgroups = fetch_20newsgroups(\n",
527    "    subset='train',\n",
528    "    categories=categories,\n",
529    "    remove=('headers', 'footers', 'quotes'),\n",
530    "    random_state=42\n",
531    ")\n",
532    "\n",
533    "print(f\"뉴스 데이터: {len(newsgroups.data)} 기사\")\n",
534    "print(f\"카테고리: {categories}\")\n",
535    "print(f\"\\n첫 번째 기사 (일부):\\n{newsgroups.data[0][:200]}...\")"
536   ]
537  },
538  {
539   "cell_type": "code",
540   "execution_count": null,
541   "id": "cell-25",
542   "metadata": {},
543   "outputs": [],
544   "source": [
545    "# 텍스트 벡터화\n",
546    "vectorizer = CountVectorizer(max_features=5000, stop_words='english')\n",
547    "X_news = vectorizer.fit_transform(newsgroups.data)\n",
548    "y_news = newsgroups.target\n",
549    "\n",
550    "print(f\"벡터 크기: {X_news.shape}\")\n",
551    "print(f\"특성 수: {len(vectorizer.get_feature_names_out())}\")\n",
552    "\n",
553    "# 학습/테스트 분할\n",
554    "X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(\n",
555    "    X_news, y_news, test_size=0.2, random_state=42\n",
556    ")"
557   ]
558  },
559  {
560   "cell_type": "code",
561   "execution_count": null,
562   "id": "cell-26",
563   "metadata": {},
564   "outputs": [],
565   "source": [
566    "# 다항 나이브 베이즈\n",
567    "mnb = MultinomialNB(alpha=1.0)  # alpha: Laplace smoothing\n",
568    "mnb.fit(X_train_news, y_train_news)\n",
569    "\n",
570    "y_pred_news = mnb.predict(X_test_news)\n",
571    "\n",
572    "print(\"다항 나이브 베이즈 (텍스트 분류) 결과:\")\n",
573    "print(f\"  정확도: {mnb.score(X_test_news, y_test_news):.4f}\")\n",
574    "print(\"\\n분류 리포트:\")\n",
575    "print(classification_report(y_test_news, y_pred_news, target_names=categories))"
576   ]
577  },
578  {
579   "cell_type": "code",
580   "execution_count": null,
581   "id": "cell-27",
582   "metadata": {},
583   "outputs": [],
584   "source": [
585    "# 각 클래스의 가장 중요한 단어\n",
586    "feature_names = vectorizer.get_feature_names_out()\n",
587    "\n",
588    "print(\"각 클래스별 상위 10개 단어:\")\n",
589    "print(\"=\" * 60)\n",
590    "for i, category in enumerate(categories):\n",
591    "    top_indices = mnb.feature_log_prob_[i].argsort()[-10:][::-1]\n",
592    "    top_words = [feature_names[idx] for idx in top_indices]\n",
593    "    print(f\"\\n{category}:\")\n",
594    "    print(f\"  {', '.join(top_words)}\")"
595   ]
596  },
597  {
598   "cell_type": "markdown",
599   "id": "cell-28",
600   "metadata": {},
601   "source": [
602    "## 11. 베르누이 나이브 베이즈\n",
603    "\n",
604    "이진 특성(0/1)에 사용하며, 단어의 존재 여부로 텍스트를 분류합니다."
605   ]
606  },
607  {
608   "cell_type": "code",
609   "execution_count": null,
610   "id": "cell-29",
611   "metadata": {},
612   "outputs": [],
613   "source": [
614    "# 이진 벡터화 (단어 존재 여부만)\n",
615    "binary_vectorizer = CountVectorizer(max_features=5000, binary=True, stop_words='english')\n",
616    "X_binary = binary_vectorizer.fit_transform(newsgroups.data)\n",
617    "\n",
618    "X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(\n",
619    "    X_binary, y_news, test_size=0.2, random_state=42\n",
620    ")\n",
621    "\n",
622    "# 베르누이 나이브 베이즈\n",
623    "bnb = BernoulliNB(alpha=1.0)\n",
624    "bnb.fit(X_train_bin, y_train_bin)\n",
625    "\n",
626    "print(\"베르누이 나이브 베이즈 결과:\")\n",
627    "print(f\"  정확도: {bnb.score(X_test_bin, y_test_bin):.4f}\")"
628   ]
629  },
630  {
631   "cell_type": "markdown",
632   "id": "cell-30",
633   "metadata": {},
634   "source": [
635    "## 12. 나이브 베이즈 모델 비교"
636   ]
637  },
638  {
639   "cell_type": "code",
640   "execution_count": null,
641   "id": "cell-31",
642   "metadata": {},
643   "outputs": [],
644   "source": [
645    "# 숫자 이미지 데이터\n",
646    "digits = load_digits()\n",
647    "X_train_dig, X_test_dig, y_train_dig, y_test_dig = train_test_split(\n",
648    "    digits.data, digits.target, test_size=0.2, random_state=42\n",
649    ")\n",
650    "\n",
651    "# 세 가지 나이브 베이즈 비교\n",
652    "models = {\n",
653    "    'Gaussian NB': GaussianNB(),\n",
654    "    'Multinomial NB': MultinomialNB(),\n",
655    "    'Bernoulli NB': BernoulliNB()\n",
656    "}\n",
657    "\n",
658    "print(\"나이브 베이즈 모델 비교 (Digits):\")\n",
659    "print(\"-\" * 50)\n",
660    "for name, model in models.items():\n",
661    "    model.fit(X_train_dig, y_train_dig)\n",
662    "    acc = model.score(X_test_dig, y_test_dig)\n",
663    "    print(f\"  {name:18s}: {acc:.4f}\")"
664   ]
665  },
666  {
667   "cell_type": "markdown",
668   "id": "cell-32",
669   "metadata": {},
670   "source": [
671    "## 13. 온라인 학습 (Incremental Learning)\n",
672    "\n",
673    "나이브 베이즈는 `partial_fit`으로 온라인 학습이 가능합니다.\n",
674    "대용량 데이터나 스트리밍 데이터에 유용합니다."
675   ]
676  },
677  {
678   "cell_type": "code",
679   "execution_count": null,
680   "id": "cell-33",
681   "metadata": {},
682   "outputs": [],
683   "source": [
684    "# 온라인 학습 시뮬레이션\n",
685    "gnb_online = GaussianNB()\n",
686    "\n",
687    "# 배치 학습\n",
688    "batch_size = 50\n",
689    "n_batches = len(X_train) // batch_size\n",
690    "\n",
691    "for i in range(n_batches):\n",
692    "    start = i * batch_size\n",
693    "    end = start + batch_size\n",
694    "    X_batch = X_train[start:end]\n",
695    "    y_batch = y_train[start:end]\n",
696    "\n",
697    "    # 첫 배치에서 클래스 정의\n",
698    "    if i == 0:\n",
699    "        gnb_online.partial_fit(X_batch, y_batch, classes=np.unique(y_train))\n",
700    "    else:\n",
701    "        gnb_online.partial_fit(X_batch, y_batch)\n",
702    "\n",
703    "print(\"온라인 학습 결과:\")\n",
704    "print(f\"  배치 수: {n_batches}\")\n",
705    "print(f\"  배치 크기: {batch_size}\")\n",
706    "print(f\"  정확도: {gnb_online.score(X_test, y_test):.4f}\")"
707   ]
708  },
709  {
710   "cell_type": "markdown",
711   "id": "cell-34",
712   "metadata": {},
713   "source": [
714    "## 14. kNN vs 나이브 베이즈 비교"
715   ]
716  },
717  {
718   "cell_type": "code",
719   "execution_count": null,
720   "id": "cell-35",
721   "metadata": {},
722   "outputs": [],
723   "source": [
724    "# 유방암 데이터\n",
725    "cancer = load_breast_cancer()\n",
726    "X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(\n",
727    "    cancer.data, cancer.target, test_size=0.2, random_state=42\n",
728    ")\n",
729    "\n",
730    "# 스케일링\n",
731    "scaler = StandardScaler()\n",
732    "X_train_c_scaled = scaler.fit_transform(X_train_c)\n",
733    "X_test_c_scaled = scaler.transform(X_test_c)\n",
734    "\n",
735    "# 모델 비교\n",
736    "models = {\n",
737    "    'kNN (k=5)': KNeighborsClassifier(n_neighbors=5),\n",
738    "    'kNN (weighted)': KNeighborsClassifier(n_neighbors=5, weights='distance'),\n",
739    "    'Gaussian NB': GaussianNB()\n",
740    "}\n",
741    "\n",
742    "print(\"kNN vs 나이브 베이즈 비교 (Breast Cancer):\")\n",
743    "print(\"=\" * 50)\n",
744    "\n",
745    "for name, model in models.items():\n",
746    "    if 'kNN' in name:\n",
747    "        model.fit(X_train_c_scaled, y_train_c)\n",
748    "        acc = model.score(X_test_c_scaled, y_test_c)\n",
749    "    else:\n",
750    "        model.fit(X_train_c, y_train_c)\n",
751    "        acc = model.score(X_test_c, y_test_c)\n",
752    "    print(f\"  {name:18s}: {acc:.4f}\")"
753   ]
754  },
755  {
756   "cell_type": "markdown",
757   "id": "cell-36",
758   "metadata": {},
759   "source": [
760    "## 15. 간단한 텍스트 분류 예제"
761   ]
762  },
763  {
764   "cell_type": "code",
765   "execution_count": null,
766   "id": "cell-37",
767   "metadata": {},
768   "outputs": [],
769   "source": [
770    "# 간단한 감성 분류\n",
771    "texts = [\n",
772    "    \"I love this movie\", \"Great film\", \"Excellent acting\",\n",
773    "    \"Amazing performance\", \"Wonderful story\",\n",
774    "    \"Terrible movie\", \"Bad film\", \"Worst movie ever\",\n",
775    "    \"Horrible acting\", \"Disappointing story\"\n",
776    "]\n",
777    "labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # 1: positive, 0: negative\n",
778    "\n",
779    "# TF-IDF 벡터화\n",
780    "tfidf = TfidfVectorizer()\n",
781    "X_sentiment = tfidf.fit_transform(texts)\n",
782    "\n",
783    "# 나이브 베이즈 학습\n",
784    "mnb_sentiment = MultinomialNB()\n",
785    "mnb_sentiment.fit(X_sentiment, labels)\n",
786    "\n",
787    "# 새로운 텍스트 분류\n",
788    "new_texts = [\n",
789    "    \"This is a great movie\",\n",
790    "    \"I hate this film\",\n",
791    "    \"Excellent performance and story\",\n",
792    "    \"Terrible and disappointing\"\n",
793    "]\n",
794    "X_new = tfidf.transform(new_texts)\n",
795    "predictions = mnb_sentiment.predict(X_new)\n",
796    "probabilities = mnb_sentiment.predict_proba(X_new)\n",
797    "\n",
798    "print(\"감성 분류 결과:\")\n",
799    "print(\"=\" * 60)\n",
800    "for text, pred, prob in zip(new_texts, predictions, probabilities):\n",
801    "    sentiment = \"Positive\" if pred == 1 else \"Negative\"\n",
802    "    confidence = max(prob) * 100\n",
803    "    print(f\"'{text}'\")\n",
804    "    print(f\"  → {sentiment} (신뢰도: {confidence:.1f}%)\\n\")"
805   ]
806  },
807  {
808   "cell_type": "markdown",
809   "id": "cell-38",
810   "metadata": {},
811   "source": [
812    "## 정리\n",
813    "\n",
814    "### kNN 요약\n",
815    "\n",
816    "| 파라미터 | 설명 | 권장 |\n",
817    "|----------|------|------|\n",
818    "| **n_neighbors** | 이웃 수 (k) | 교차 검증으로 선택 |\n",
819    "| **weights** | 가중치 방식 | 'distance' 추천 |\n",
820    "| **metric** | 거리 측정 | 'euclidean' 기본 |\n",
821    "| **algorithm** | 탐색 알고리즘 | 'auto' |\n",
822    "\n",
823    "**특징**:\n",
824    "- 게으른 학습 (학습 시간 없음)\n",
825    "- 예측 시간 느림 (O(n·d))\n",
826    "- 스케일링 필수\n",
827    "- 고차원에서 성능 저하 (차원의 저주)\n",
828    "\n",
829    "### 나이브 베이즈 요약\n",
830    "\n",
831    "| 종류 | 특성 타입 | 주요 용도 |\n",
832    "|------|-----------|----------|\n",
833    "| **GaussianNB** | 연속형 (정규 분포) | 일반 분류 |\n",
834    "| **MultinomialNB** | 카운트/빈도 | 텍스트 분류 |\n",
835    "| **BernoulliNB** | 이진 (0/1) | 단어 존재 여부 |\n",
836    "\n",
837    "**특징**:\n",
838    "- 매우 빠름 (학습 O(n·d), 예측 O(d))\n",
839    "- 적은 데이터로도 잘 작동\n",
840    "- 고차원 데이터에 효과적\n",
841    "- 온라인 학습 가능\n",
842    "- 특성 독립성 가정 (현실에서 위반 가능)\n",
843    "\n",
844    "### kNN vs 나이브 베이즈\n",
845    "\n",
846    "| 특성 | kNN | 나이브 베이즈 |\n",
847    "|------|-----|---------------|\n",
848    "| **학습 시간** | O(1) | O(n·d) |\n",
849    "| **예측 시간** | O(n·d) | O(d) |\n",
850    "| **메모리** | 높음 | 낮음 |\n",
851    "| **스케일링** | 필수 | 불필요 |\n",
852    "| **고차원** | 약함 | 강함 |\n",
853    "| **해석성** | 직관적 | 확률 기반 |\n",
854    "\n",
855    "### 다음 단계\n",
856    "- Clustering (K-Means, DBSCAN)\n",
857    "- Dimensionality Reduction (PCA, t-SNE)\n",
858    "- Ensemble methods (Stacking, Voting)"
859   ]
860  }
861 ],
862 "metadata": {
863  "kernelspec": {
864   "display_name": "Python 3",
865   "language": "python",
866   "name": "python3"
867  },
868  "language_info": {
869   "name": "python",
870   "version": "3.9.0"
871  }
872 },
873 "nbformat": 4,
874 "nbformat_minor": 5
875}