14_kaggle_project.ipynb

Download
json 1028 lines 30.8 KB
   1{
   2 "cells": [
   3  {
   4   "cell_type": "markdown",
   5   "metadata": {},
   6   "source": [
   7    "# ์‹ค์ „ ํ”„๋กœ์ ํŠธ: ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด ์˜ˆ์ธก (Kaggle ์Šคํƒ€์ผ)\n",
   8    "\n",
   9    "์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. Kaggle ๊ฒฝ์ง„๋Œ€ํšŒ ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•˜์—ฌ ์‹ค๋ฌด ๋…ธํ•˜์šฐ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.\n",
  10    "\n",
  11    "**ํ•™์Šต ๋ชฉํ‘œ:**\n",
  12    "- ์™„์ „ํ•œ ML ์›Œํฌํ”Œ๋กœ์šฐ ๊ฒฝํ—˜\n",
  13    "- ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ (EDA) ์ˆ˜ํ–‰\n",
  14    "- ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง ๊ธฐ๋ฒ• ์ ์šฉ\n",
  15    "- ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต ๋ฐ ์„ ํƒ\n",
  16    "- ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹\n",
  17    "- Kaggle ๊ฒฝ์ง„๋Œ€ํšŒ ์ „๋žต ์ดํ•ด"
  18   ]
  19  },
  20  {
  21   "cell_type": "code",
  22   "execution_count": null,
  23   "metadata": {},
  24   "outputs": [],
  25   "source": [
  26    "import numpy as np\n",
  27    "import pandas as pd\n",
  28    "import matplotlib.pyplot as plt\n",
  29    "import seaborn as sns\n",
  30    "\n",
  31    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, StratifiedKFold\n",
  32    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
  33    "from sklearn.impute import SimpleImputer\n",
  34    "from sklearn.metrics import (\n",
  35    "    accuracy_score, classification_report, confusion_matrix,\n",
  36    "    roc_auc_score, roc_curve\n",
  37    ")\n",
  38    "\n",
  39    "from sklearn.linear_model import LogisticRegression\n",
  40    "from sklearn.tree import DecisionTreeClassifier\n",
  41    "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
  42    "from sklearn.svm import SVC\n",
  43    "\n",
  44    "import warnings\n",
  45    "warnings.filterwarnings('ignore')\n",
  46    "\n",
  47    "# ์‹œ๊ฐํ™” ์„ค์ •\n",
  48    "plt.style.use('seaborn-v0_8-darkgrid')\n",
  49    "sns.set_palette('husl')"
  50   ]
  51  },
  52  {
  53   "cell_type": "markdown",
  54   "metadata": {},
  55   "source": [
  56    "## 1. ๋ฌธ์ œ ์ •์˜\n",
  57    "\n",
  58    "**๋ชฉํ‘œ**: ํƒ€์ดํƒ€๋‹‰ ์Šน๊ฐ์˜ ์ƒ์กด ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ฐœ๋ฐœ\n",
  59    "\n",
  60    "**ํ‰๊ฐ€ ์ง€ํ‘œ**: Accuracy (์ •ํ™•๋„)\n",
  61    "\n",
  62    "**๋ฐ์ดํ„ฐ**: ์Šน๊ฐ์˜ ๋‚˜์ด, ์„ฑ๋ณ„, ๊ฐ์‹ค ๋“ฑ๊ธ‰, ์š”๊ธˆ ๋“ฑ์˜ ์ •๋ณด"
  63   ]
  64  },
  65  {
  66   "cell_type": "markdown",
  67   "metadata": {},
  68   "source": [
  69    "## 2. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๊ธฐ๋ณธ ํƒ์ƒ‰"
  70   ]
  71  },
  72  {
  73   "cell_type": "code",
  74   "execution_count": null,
  75   "metadata": {},
  76   "outputs": [],
  77   "source": [
  78    "# seaborn ๋‚ด์žฅ ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ\n",
  79    "# Kaggle์—์„œ๋Š” train.csv, test.csv๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์‚ฌ์šฉ\n",
  80    "df = sns.load_dataset('titanic')\n",
  81    "\n",
  82    "print(\"=== ๋ฐ์ดํ„ฐ ๊ธฐ๋ณธ ์ •๋ณด ===\")\n",
  83    "print(f\"๋ฐ์ดํ„ฐ ํ˜•์ƒ: {df.shape}\")\n",
  84    "print(f\"\\n์ปฌ๋Ÿผ ๋ชฉ๋ก:\")\n",
  85    "print(df.columns.tolist())\n",
  86    "print(f\"\\n๋ฐ์ดํ„ฐ ํƒ€์ž…:\")\n",
  87    "print(df.dtypes)"
  88   ]
  89  },
  90  {
  91   "cell_type": "code",
  92   "execution_count": null,
  93   "metadata": {},
  94   "outputs": [],
  95   "source": [
  96    "# ์ฒ˜์Œ ๋ช‡ ํ–‰ ํ™•์ธ\n",
  97    "print(\"์ฒ˜์Œ 5ํ–‰:\")\n",
  98    "df.head()"
  99   ]
 100  },
 101  {
 102   "cell_type": "code",
 103   "execution_count": null,
 104   "metadata": {},
 105   "outputs": [],
 106   "source": [
 107    "# ๊ธฐ์ˆ  ํ†ต๊ณ„\n",
 108    "print(\"๊ธฐ์ˆ  ํ†ต๊ณ„:\")\n",
 109    "df.describe()"
 110   ]
 111  },
 112  {
 113   "cell_type": "code",
 114   "execution_count": null,
 115   "metadata": {},
 116   "outputs": [],
 117   "source": [
 118    "# ํƒ€๊ฒŸ ๋ณ€์ˆ˜ ๋ถ„ํฌ\n",
 119    "print(\"=== ์ƒ์กด ์—ฌ๋ถ€ ๋ถ„ํฌ ===\")\n",
 120    "print(df['survived'].value_counts())\n",
 121    "print(f\"\\n์ƒ์กด ๋น„์œจ:\")\n",
 122    "print(df['survived'].value_counts(normalize=True))\n",
 123    "\n",
 124    "# ์‹œ๊ฐํ™”\n",
 125    "fig, ax = plt.subplots(1, 2, figsize=(12, 4))\n",
 126    "\n",
 127    "df['survived'].value_counts().plot(kind='bar', ax=ax[0])\n",
 128    "ax[0].set_title('Survival Count')\n",
 129    "ax[0].set_xlabel('Survived (0=No, 1=Yes)')\n",
 130    "ax[0].set_ylabel('Count')\n",
 131    "\n",
 132    "df['survived'].value_counts(normalize=True).plot(kind='pie', autopct='%1.1f%%', ax=ax[1])\n",
 133    "ax[1].set_title('Survival Proportion')\n",
 134    "ax[1].set_ylabel('')\n",
 135    "\n",
 136    "plt.tight_layout()\n",
 137    "plt.show()"
 138   ]
 139  },
 140  {
 141   "cell_type": "markdown",
 142   "metadata": {},
 143   "source": [
 144    "## 3. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„ (EDA)\n",
 145    "\n",
 146    "### 3.1 ๊ฒฐ์ธก์น˜ ๋ถ„์„"
 147   ]
 148  },
 149  {
 150   "cell_type": "code",
 151   "execution_count": null,
 152   "metadata": {},
 153   "outputs": [],
 154   "source": [
 155    "# ๊ฒฐ์ธก์น˜ ํ™•์ธ\n",
 156    "print(\"=== ๊ฒฐ์ธก์น˜ ๋ถ„์„ ===\")\n",
 157    "missing = df.isnull().sum()\n",
 158    "missing_pct = (missing / len(df) * 100).round(2)\n",
 159    "missing_df = pd.DataFrame({\n",
 160    "    '๊ฒฐ์ธก์น˜ ์ˆ˜': missing,\n",
 161    "    '๊ฒฐ์ธก์น˜ ๋น„์œจ(%)': missing_pct\n",
 162    "})\n",
 163    "print(missing_df[missing_df['๊ฒฐ์ธก์น˜ ์ˆ˜'] > 0].sort_values(by='๊ฒฐ์ธก์น˜ ์ˆ˜', ascending=False))\n",
 164    "\n",
 165    "# ์‹œ๊ฐํ™”\n",
 166    "plt.figure(figsize=(10, 6))\n",
 167    "missing_data = missing_df[missing_df['๊ฒฐ์ธก์น˜ ์ˆ˜'] > 0].sort_values(by='๊ฒฐ์ธก์น˜ ์ˆ˜', ascending=False)\n",
 168    "plt.barh(missing_data.index, missing_data['๊ฒฐ์ธก์น˜ ๋น„์œจ(%)'])\n",
 169    "plt.xlabel('Missing Percentage (%)')\n",
 170    "plt.title('Missing Values by Feature')\n",
 171    "plt.tight_layout()\n",
 172    "plt.show()"
 173   ]
 174  },
 175  {
 176   "cell_type": "markdown",
 177   "metadata": {},
 178   "source": [
 179    "### 3.2 ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์™€ ์ƒ์กด์˜ ๊ด€๊ณ„"
 180   ]
 181  },
 182  {
 183   "cell_type": "code",
 184   "execution_count": null,
 185   "metadata": {},
 186   "outputs": [],
 187   "source": [
 188    "# ์ฃผ์š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์™€ ์ƒ์กด์˜ ๊ด€๊ณ„\n",
 189    "fig, axes = plt.subplots(2, 3, figsize=(15, 10))\n",
 190    "\n",
 191    "# ์„ฑ๋ณ„\n",
 192    "sns.countplot(data=df, x='sex', hue='survived', ax=axes[0, 0])\n",
 193    "axes[0, 0].set_title('Survival by Sex')\n",
 194    "\n",
 195    "# ๊ฐ์‹ค ๋“ฑ๊ธ‰\n",
 196    "sns.countplot(data=df, x='pclass', hue='survived', ax=axes[0, 1])\n",
 197    "axes[0, 1].set_title('Survival by Class')\n",
 198    "\n",
 199    "# ์Šน์„  ํ•ญ๊ตฌ\n",
 200    "sns.countplot(data=df, x='embarked', hue='survived', ax=axes[0, 2])\n",
 201    "axes[0, 2].set_title('Survival by Embarked')\n",
 202    "\n",
 203    "# ํ˜•์ œ/๋ฐฐ์šฐ์ž ์ˆ˜\n",
 204    "sns.countplot(data=df, x='sibsp', hue='survived', ax=axes[1, 0])\n",
 205    "axes[1, 0].set_title('Survival by SibSp')\n",
 206    "\n",
 207    "# ๋ถ€๋ชจ/์ž๋…€ ์ˆ˜\n",
 208    "sns.countplot(data=df, x='parch', hue='survived', ax=axes[1, 1])\n",
 209    "axes[1, 1].set_title('Survival by Parch')\n",
 210    "\n",
 211    "# ํ˜ผ์ž ์—ฌํ–‰ ์—ฌ๋ถ€\n",
 212    "df['alone'] = ((df['sibsp'] + df['parch']) == 0).astype(int)\n",
 213    "sns.countplot(data=df, x='alone', hue='survived', ax=axes[1, 2])\n",
 214    "axes[1, 2].set_title('Survival by Alone')\n",
 215    "\n",
 216    "plt.tight_layout()\n",
 217    "plt.show()"
 218   ]
 219  },
 220  {
 221   "cell_type": "code",
 222   "execution_count": null,
 223   "metadata": {},
 224   "outputs": [],
 225   "source": [
 226    "# ์ƒ์กด์œจ ํ†ต๊ณ„\n",
 227    "print(\"=== ๋ฒ”์ฃผ๋ณ„ ์ƒ์กด์œจ ===\")\n",
 228    "print(\"\\n์„ฑ๋ณ„:\")\n",
 229    "print(df.groupby('sex')['survived'].mean())\n",
 230    "print(\"\\n๊ฐ์‹ค ๋“ฑ๊ธ‰:\")\n",
 231    "print(df.groupby('pclass')['survived'].mean())\n",
 232    "print(\"\\n์Šน์„  ํ•ญ๊ตฌ:\")\n",
 233    "print(df.groupby('embarked')['survived'].mean())"
 234   ]
 235  },
 236  {
 237   "cell_type": "markdown",
 238   "metadata": {},
 239   "source": [
 240    "### 3.3 ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ๋ถ„์„"
 241   ]
 242  },
 243  {
 244   "cell_type": "code",
 245   "execution_count": null,
 246   "metadata": {},
 247   "outputs": [],
 248   "source": [
 249    "# ๋‚˜์ด์™€ ์š”๊ธˆ ๋ถ„ํฌ (์ƒ์กด ์—ฌ๋ถ€๋ณ„)\n",
 250    "fig, axes = plt.subplots(2, 2, figsize=(14, 10))\n",
 251    "\n",
 252    "# ๋‚˜์ด ๋ถ„ํฌ\n",
 253    "for survived in [0, 1]:\n",
 254    "    axes[0, 0].hist(df[df['survived'] == survived]['age'].dropna(), \n",
 255    "                    bins=30, alpha=0.5, label=f'Survived={survived}')\n",
 256    "axes[0, 0].set_xlabel('Age')\n",
 257    "axes[0, 0].set_ylabel('Count')\n",
 258    "axes[0, 0].set_title('Age Distribution by Survival')\n",
 259    "axes[0, 0].legend()\n",
 260    "\n",
 261    "# ๋‚˜์ด ๋ฐ•์Šคํ”Œ๋กฏ\n",
 262    "sns.boxplot(data=df, x='survived', y='age', ax=axes[0, 1])\n",
 263    "axes[0, 1].set_title('Age by Survival')\n",
 264    "\n",
 265    "# ์š”๊ธˆ ๋ถ„ํฌ (๋กœ๊ทธ ์Šค์ผ€์ผ)\n",
 266    "for survived in [0, 1]:\n",
 267    "    axes[1, 0].hist(np.log1p(df[df['survived'] == survived]['fare'].dropna()), \n",
 268    "                    bins=30, alpha=0.5, label=f'Survived={survived}')\n",
 269    "axes[1, 0].set_xlabel('Log(Fare + 1)')\n",
 270    "axes[1, 0].set_ylabel('Count')\n",
 271    "axes[1, 0].set_title('Fare Distribution by Survival (Log Scale)')\n",
 272    "axes[1, 0].legend()\n",
 273    "\n",
 274    "# ์š”๊ธˆ ๋ฐ•์Šคํ”Œ๋กฏ\n",
 275    "sns.boxplot(data=df, x='survived', y='fare', ax=axes[1, 1])\n",
 276    "axes[1, 1].set_title('Fare by Survival')\n",
 277    "axes[1, 1].set_ylim(0, 300)\n",
 278    "\n",
 279    "plt.tight_layout()\n",
 280    "plt.show()"
 281   ]
 282  },
 283  {
 284   "cell_type": "code",
 285   "execution_count": null,
 286   "metadata": {},
 287   "outputs": [],
 288   "source": [
 289    "# ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„\n",
 290    "print(\"=== ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ์ƒ๊ด€๊ด€๊ณ„ ===\")\n",
 291    "numeric_cols = df.select_dtypes(include=[np.number]).columns\n",
 292    "correlation = df[numeric_cols].corr()\n",
 293    "\n",
 294    "plt.figure(figsize=(10, 8))\n",
 295    "sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0)\n",
 296    "plt.title('Correlation Matrix')\n",
 297    "plt.tight_layout()\n",
 298    "plt.show()\n",
 299    "\n",
 300    "print(\"\\nํƒ€๊ฒŸ(survived)๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„:\")\n",
 301    "print(correlation['survived'].sort_values(ascending=False))"
 302   ]
 303  },
 304  {
 305   "cell_type": "markdown",
 306   "metadata": {},
 307   "source": [
 308    "## 4. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง"
 309   ]
 310  },
 311  {
 312   "cell_type": "code",
 313   "execution_count": null,
 314   "metadata": {},
 315   "outputs": [],
 316   "source": [
 317    "# ์ž‘์—…์šฉ ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ\n",
 318    "df_clean = df.copy()\n",
 319    "\n",
 320    "print(\"=== ์ „์ฒ˜๋ฆฌ ์‹œ์ž‘ ===\")\n",
 321    "print(f\"์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ํ˜•์ƒ: {df_clean.shape}\")"
 322   ]
 323  },
 324  {
 325   "cell_type": "markdown",
 326   "metadata": {},
 327   "source": [
 328    "### 4.1 ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ œ๊ฑฐ"
 329   ]
 330  },
 331  {
 332   "cell_type": "code",
 333   "execution_count": null,
 334   "metadata": {},
 335   "outputs": [],
 336   "source": [
 337    "# ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ œ๊ฑฐ\n",
 338    "drop_cols = ['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class']\n",
 339    "df_clean = df_clean.drop(columns=drop_cols, errors='ignore')\n",
 340    "\n",
 341    "print(f\"์ปฌ๋Ÿผ ์ œ๊ฑฐ ํ›„: {df_clean.shape}\")\n",
 342    "print(f\"๋‚จ์€ ์ปฌ๋Ÿผ: {df_clean.columns.tolist()}\")"
 343   ]
 344  },
 345  {
 346   "cell_type": "markdown",
 347   "metadata": {},
 348   "source": [
 349    "### 4.2 ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ"
 350   ]
 351  },
 352  {
 353   "cell_type": "code",
 354   "execution_count": null,
 355   "metadata": {},
 356   "outputs": [],
 357   "source": [
 358    "# ๋‚˜์ด: ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด\n",
 359    "age_median = df_clean['age'].median()\n",
 360    "df_clean['age'] = df_clean['age'].fillna(age_median)\n",
 361    "print(f\"๋‚˜์ด ๊ฒฐ์ธก์น˜๋ฅผ ์ค‘๊ฐ„๊ฐ’({age_median})์œผ๋กœ ๋Œ€์ฒด\")\n",
 362    "\n",
 363    "# ์Šน์„  ํ•ญ๊ตฌ: ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๋Œ€์ฒด\n",
 364    "embarked_mode = df_clean['embarked'].mode()[0]\n",
 365    "df_clean['embarked'] = df_clean['embarked'].fillna(embarked_mode)\n",
 366    "print(f\"์Šน์„  ํ•ญ๊ตฌ ๊ฒฐ์ธก์น˜๋ฅผ ์ตœ๋นˆ๊ฐ’({embarked_mode})์œผ๋กœ ๋Œ€์ฒด\")\n",
 367    "\n",
 368    "# ์š”๊ธˆ: ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด\n",
 369    "fare_median = df_clean['fare'].median()\n",
 370    "df_clean['fare'] = df_clean['fare'].fillna(fare_median)\n",
 371    "\n",
 372    "print(f\"\\n๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ํ›„:\")\n",
 373    "print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])"
 374   ]
 375  },
 376  {
 377   "cell_type": "markdown",
 378   "metadata": {},
 379   "source": [
 380    "### 4.3 ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง\n",
 381    "\n",
 382    "๋„๋ฉ”์ธ ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค."
 383   ]
 384  },
 385  {
 386   "cell_type": "code",
 387   "execution_count": null,
 388   "metadata": {},
 389   "outputs": [],
 390   "source": [
 391    "# 1. ๊ฐ€์กฑ ํฌ๊ธฐ\n",
 392    "df_clean['family_size'] = df_clean['sibsp'] + df_clean['parch'] + 1\n",
 393    "print(\"๊ฐ€์กฑ ํฌ๊ธฐ ํŠน์„ฑ ์ƒ์„ฑ: sibsp + parch + 1\")\n",
 394    "\n",
 395    "# 2. ํ˜ผ์ž ์—ฌํ–‰ ์—ฌ๋ถ€\n",
 396    "df_clean['is_alone'] = (df_clean['family_size'] == 1).astype(int)\n",
 397    "print(\"ํ˜ผ์ž ์—ฌํ–‰ ์—ฌ๋ถ€ ํŠน์„ฑ ์ƒ์„ฑ\")\n",
 398    "\n",
 399    "# 3. ๋‚˜์ด ๊ทธ๋ฃน\n",
 400    "df_clean['age_group'] = pd.cut(df_clean['age'],\n",
 401    "                                bins=[0, 12, 18, 35, 60, 100],\n",
 402    "                                labels=['Child', 'Teen', 'Young', 'Middle', 'Senior'])\n",
 403    "print(\"๋‚˜์ด ๊ทธ๋ฃน ํŠน์„ฑ ์ƒ์„ฑ\")\n",
 404    "\n",
 405    "# 4. ์š”๊ธˆ ๊ตฌ๊ฐ„\n",
 406    "df_clean['fare_bin'] = pd.qcut(df_clean['fare'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])\n",
 407    "print(\"์š”๊ธˆ ๊ตฌ๊ฐ„ ํŠน์„ฑ ์ƒ์„ฑ\")\n",
 408    "\n",
 409    "# 5. ํ˜ธ์นญ ์ถ”์ถœ (์„ ํƒ์ )\n",
 410    "# df_clean['title'] = df_clean['name'].str.extract(' ([A-Za-z]+)\\.', expand=False)\n",
 411    "\n",
 412    "print(f\"\\nํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง ํ›„ ํ˜•์ƒ: {df_clean.shape}\")"
 413   ]
 414  },
 415  {
 416   "cell_type": "code",
 417   "execution_count": null,
 418   "metadata": {},
 419   "outputs": [],
 420   "source": [
 421    "# ์ƒˆ๋กœ์šด ํŠน์„ฑ๊ณผ ์ƒ์กด์˜ ๊ด€๊ณ„ ํ™•์ธ\n",
 422    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
 423    "\n",
 424    "sns.countplot(data=df_clean, x='family_size', hue='survived', ax=axes[0])\n",
 425    "axes[0].set_title('Survival by Family Size')\n",
 426    "\n",
 427    "sns.countplot(data=df_clean, x='age_group', hue='survived', ax=axes[1])\n",
 428    "axes[1].set_title('Survival by Age Group')\n",
 429    "axes[1].tick_params(axis='x', rotation=45)\n",
 430    "\n",
 431    "sns.countplot(data=df_clean, x='fare_bin', hue='survived', ax=axes[2])\n",
 432    "axes[2].set_title('Survival by Fare Bin')\n",
 433    "axes[2].tick_params(axis='x', rotation=45)\n",
 434    "\n",
 435    "plt.tight_layout()\n",
 436    "plt.show()"
 437   ]
 438  },
 439  {
 440   "cell_type": "markdown",
 441   "metadata": {},
 442   "source": [
 443    "### 4.4 ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ"
 444   ]
 445  },
 446  {
 447   "cell_type": "code",
 448   "execution_count": null,
 449   "metadata": {},
 450   "outputs": [],
 451   "source": [
 452    "# LabelEncoder ์‚ฌ์šฉ\n",
 453    "le = LabelEncoder()\n",
 454    "\n",
 455    "df_clean['sex'] = le.fit_transform(df_clean['sex'])\n",
 456    "df_clean['embarked'] = le.fit_transform(df_clean['embarked'])\n",
 457    "df_clean['age_group'] = le.fit_transform(df_clean['age_group'])\n",
 458    "df_clean['fare_bin'] = le.fit_transform(df_clean['fare_bin'])\n",
 459    "\n",
 460    "print(\"๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ ์™„๋ฃŒ\")\n",
 461    "print(f\"\\n์ธ์ฝ”๋”ฉ ํ›„ ๋ฐ์ดํ„ฐ ํƒ€์ž…:\")\n",
 462    "print(df_clean.dtypes)"
 463   ]
 464  },
 465  {
 466   "cell_type": "markdown",
 467   "metadata": {},
 468   "source": [
 469    "### 4.5 ์ตœ์ข… ํŠน์„ฑ ์„ ํƒ"
 470   ]
 471  },
 472  {
 473   "cell_type": "code",
 474   "execution_count": null,
 475   "metadata": {},
 476   "outputs": [],
 477   "source": [
 478    "# ๋ชจ๋ธ๋ง์— ์‚ฌ์šฉํ•  ํŠน์„ฑ ์„ ํƒ\n",
 479    "features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n",
 480    "            'embarked', 'family_size', 'is_alone', 'age_group', 'fare_bin']\n",
 481    "\n",
 482    "X = df_clean[features]\n",
 483    "y = df_clean['survived']\n",
 484    "\n",
 485    "print(f\"์ตœ์ข… ํŠน์„ฑ: {features}\")\n",
 486    "print(f\"X ํ˜•์ƒ: {X.shape}\")\n",
 487    "print(f\"y ๋ถ„ํฌ: {y.value_counts().to_dict()}\")"
 488   ]
 489  },
 490  {
 491   "cell_type": "markdown",
 492   "metadata": {},
 493   "source": [
 494    "## 5. ๋ชจ๋ธ๋ง\n",
 495    "\n",
 496    "### 5.1 ๋ฐ์ดํ„ฐ ๋ถ„ํ• "
 497   ]
 498  },
 499  {
 500   "cell_type": "code",
 501   "execution_count": null,
 502   "metadata": {},
 503   "outputs": [],
 504   "source": [
 505    "# Train/Test ๋ถ„ํ•  (Stratified)\n",
 506    "X_train, X_test, y_train, y_test = train_test_split(\n",
 507    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
 508    ")\n",
 509    "\n",
 510    "print(f\"ํ•™์Šต ๋ฐ์ดํ„ฐ: {X_train.shape}\")\n",
 511    "print(f\"ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: {X_test.shape}\")\n",
 512    "print(f\"\\nํ•™์Šต ๋ฐ์ดํ„ฐ ํƒ€๊ฒŸ ๋ถ„ํฌ: {y_train.value_counts().to_dict()}\")\n",
 513    "print(f\"ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํƒ€๊ฒŸ ๋ถ„ํฌ: {y_test.value_counts().to_dict()}\")"
 514   ]
 515  },
 516  {
 517   "cell_type": "code",
 518   "execution_count": null,
 519   "metadata": {},
 520   "outputs": [],
 521   "source": [
 522    "# ์Šค์ผ€์ผ๋ง (์„ ํ˜• ๋ชจ๋ธ์šฉ)\n",
 523    "scaler = StandardScaler()\n",
 524    "X_train_scaled = scaler.fit_transform(X_train)\n",
 525    "X_test_scaled = scaler.transform(X_test)\n",
 526    "\n",
 527    "print(\"์Šค์ผ€์ผ๋ง ์™„๋ฃŒ\")"
 528   ]
 529  },
 530  {
 531   "cell_type": "markdown",
 532   "metadata": {},
 533   "source": [
 534    "### 5.2 Baseline ๋ชจ๋ธ\n",
 535    "\n",
 536    "๊ฐ„๋‹จํ•œ ๋ชจ๋ธ๋กœ ๊ธฐ์ค€์„ ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค."
 537   ]
 538  },
 539  {
 540   "cell_type": "code",
 541   "execution_count": null,
 542   "metadata": {},
 543   "outputs": [],
 544   "source": [
 545    "# ๊ธฐ์ค€์„ : ํ•ญ์ƒ ๋‹ค์ˆ˜ ํด๋ž˜์Šค ์˜ˆ์ธก\n",
 546    "baseline_pred = np.zeros(len(y_test))  # ๋ชจ๋‘ 0 (์‚ฌ๋ง) ์˜ˆ์ธก\n",
 547    "baseline_acc = accuracy_score(y_test, baseline_pred)\n",
 548    "\n",
 549    "print(f\"Baseline ์ •ํ™•๋„ (ํ•ญ์ƒ ์‚ฌ๋ง ์˜ˆ์ธก): {baseline_acc:.4f}\")\n",
 550    "print(\"\\n์ด ๊ฐ’๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.\")"
 551   ]
 552  },
 553  {
 554   "cell_type": "markdown",
 555   "metadata": {},
 556   "source": [
 557    "### 5.3 ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต"
 558   ]
 559  },
 560  {
 561   "cell_type": "code",
 562   "execution_count": null,
 563   "metadata": {},
 564   "outputs": [],
 565   "source": [
 566    "# ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์ •์˜\n",
 567    "models = {\n",
 568    "    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),\n",
 569    "    'Decision Tree': DecisionTreeClassifier(random_state=42),\n",
 570    "    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),\n",
 571    "    'Gradient Boosting': GradientBoostingClassifier(random_state=42),\n",
 572    "    'SVM': SVC(random_state=42)\n",
 573    "}\n",
 574    "\n",
 575    "# ๋ชจ๋ธ ๋น„๊ต\n",
 576    "print(\"=== ๋ชจ๋ธ ๋น„๊ต (5-Fold Cross Validation) ===\")\n",
 577    "results = []\n",
 578    "\n",
 579    "for name, model in models.items():\n",
 580    "    # ์„ ํ˜• ๋ชจ๋ธ์€ ์Šค์ผ€์ผ๋ง๋œ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ\n",
 581    "    if name in ['Logistic Regression', 'SVM']:\n",
 582    "        X_tr, X_te = X_train_scaled, X_test_scaled\n",
 583    "    else:\n",
 584    "        X_tr, X_te = X_train, X_test\n",
 585    "    \n",
 586    "    # ๊ต์ฐจ ๊ฒ€์ฆ\n",
 587    "    cv_scores = cross_val_score(model, X_tr, y_train, cv=5, scoring='accuracy')\n",
 588    "    \n",
 589    "    # ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ\n",
 590    "    model.fit(X_tr, y_train)\n",
 591    "    test_score = model.score(X_te, y_test)\n",
 592    "    \n",
 593    "    results.append({\n",
 594    "        'Model': name,\n",
 595    "        'CV Mean': cv_scores.mean(),\n",
 596    "        'CV Std': cv_scores.std(),\n",
 597    "        'Test Score': test_score\n",
 598    "    })\n",
 599    "    \n",
 600    "    print(f\"{name}:\")\n",
 601    "    print(f\"  CV = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})\")\n",
 602    "    print(f\"  Test = {test_score:.4f}\")\n",
 603    "    print()\n",
 604    "\n",
 605    "results_df = pd.DataFrame(results)\n",
 606    "results_df = results_df.sort_values(by='CV Mean', ascending=False)\n",
 607    "print(\"\\n๋ชจ๋ธ ์ˆœ์œ„:\")\n",
 608    "print(results_df)"
 609   ]
 610  },
 611  {
 612   "cell_type": "code",
 613   "execution_count": null,
 614   "metadata": {},
 615   "outputs": [],
 616   "source": [
 617    "# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”\n",
 618    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
 619    "\n",
 620    "# CV ์ ์ˆ˜\n",
 621    "axes[0].barh(results_df['Model'], results_df['CV Mean'])\n",
 622    "axes[0].set_xlabel('CV Accuracy')\n",
 623    "axes[0].set_title('Cross-Validation Scores')\n",
 624    "axes[0].set_xlim(0.7, 0.9)\n",
 625    "\n",
 626    "# Test ์ ์ˆ˜\n",
 627    "axes[1].barh(results_df['Model'], results_df['Test Score'])\n",
 628    "axes[1].set_xlabel('Test Accuracy')\n",
 629    "axes[1].set_title('Test Scores')\n",
 630    "axes[1].set_xlim(0.7, 0.9)\n",
 631    "\n",
 632    "plt.tight_layout()\n",
 633    "plt.show()"
 634   ]
 635  },
 636  {
 637   "cell_type": "markdown",
 638   "metadata": {},
 639   "source": [
 640    "### 5.4 ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹\n",
 641    "\n",
 642    "์ตœ๊ณ  ์„ฑ๋Šฅ ๋ชจ๋ธ์— ๋Œ€ํ•ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค."
 643   ]
 644  },
 645  {
 646   "cell_type": "code",
 647   "execution_count": null,
 648   "metadata": {},
 649   "outputs": [],
 650   "source": [
 651    "# Random Forest ํŠœ๋‹\n",
 652    "rf_param_grid = {\n",
 653    "    'n_estimators': [100, 200, 300],\n",
 654    "    'max_depth': [5, 10, 15, None],\n",
 655    "    'min_samples_split': [2, 5, 10],\n",
 656    "    'min_samples_leaf': [1, 2, 4],\n",
 657    "    'max_features': ['sqrt', 'log2']\n",
 658    "}\n",
 659    "\n",
 660    "rf = RandomForestClassifier(random_state=42)\n",
 661    "grid_search = GridSearchCV(\n",
 662    "    rf, rf_param_grid, \n",
 663    "    cv=5, \n",
 664    "    scoring='accuracy', \n",
 665    "    n_jobs=-1, \n",
 666    "    verbose=1\n",
 667    ")\n",
 668    "\n",
 669    "print(\"Grid Search ์‹œ์ž‘...\")\n",
 670    "grid_search.fit(X_train, y_train)\n",
 671    "\n",
 672    "print(\"\\n=== ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๊ฒฐ๊ณผ ===\")\n",
 673    "print(f\"์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ: {grid_search.best_params_}\")\n",
 674    "print(f\"์ตœ์  CV ์ ์ˆ˜: {grid_search.best_score_:.4f}\")\n",
 675    "print(f\"ํ…Œ์ŠคํŠธ ์ ์ˆ˜: {grid_search.score(X_test, y_test):.4f}\")\n",
 676    "\n",
 677    "best_model = grid_search.best_estimator_"
 678   ]
 679  },
 680  {
 681   "cell_type": "markdown",
 682   "metadata": {},
 683   "source": [
 684    "## 6. ๋ชจ๋ธ ํ‰๊ฐ€\n",
 685    "\n",
 686    "### 6.1 ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ์ง€ํ‘œ"
 687   ]
 688  },
 689  {
 690   "cell_type": "code",
 691   "execution_count": null,
 692   "metadata": {},
 693   "outputs": [],
 694   "source": [
 695    "# ์˜ˆ์ธก\n",
 696    "y_pred = best_model.predict(X_test)\n",
 697    "y_pred_proba = best_model.predict_proba(X_test)[:, 1]\n",
 698    "\n",
 699    "# ๋ถ„๋ฅ˜ ๋ฆฌํฌํŠธ\n",
 700    "print(\"=== ๋ถ„๋ฅ˜ ๋ฆฌํฌํŠธ ===\")\n",
 701    "print(classification_report(y_test, y_pred, target_names=['Not Survived', 'Survived']))\n",
 702    "\n",
 703    "# ROC AUC\n",
 704    "roc_auc = roc_auc_score(y_test, y_pred_proba)\n",
 705    "print(f\"\\nROC AUC Score: {roc_auc:.4f}\")"
 706   ]
 707  },
 708  {
 709   "cell_type": "markdown",
 710   "metadata": {},
 711   "source": [
 712    "### 6.2 ํ˜ผ๋™ ํ–‰๋ ฌ"
 713   ]
 714  },
 715  {
 716   "cell_type": "code",
 717   "execution_count": null,
 718   "metadata": {},
 719   "outputs": [],
 720   "source": [
 721    "# ํ˜ผ๋™ ํ–‰๋ ฌ ์‹œ๊ฐํ™”\n",
 722    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
 723    "\n",
 724    "# ํ˜ผ๋™ ํ–‰๋ ฌ\n",
 725    "cm = confusion_matrix(y_test, y_pred)\n",
 726    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',\n",
 727    "            xticklabels=['Not Survived', 'Survived'],\n",
 728    "            yticklabels=['Not Survived', 'Survived'],\n",
 729    "            ax=axes[0])\n",
 730    "axes[0].set_xlabel('Predicted')\n",
 731    "axes[0].set_ylabel('Actual')\n",
 732    "axes[0].set_title('Confusion Matrix')\n",
 733    "\n",
 734    "# ROC Curve\n",
 735    "fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)\n",
 736    "axes[1].plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.4f})')\n",
 737    "axes[1].plot([0, 1], [0, 1], 'k--', label='Random')\n",
 738    "axes[1].set_xlabel('False Positive Rate')\n",
 739    "axes[1].set_ylabel('True Positive Rate')\n",
 740    "axes[1].set_title('ROC Curve')\n",
 741    "axes[1].legend()\n",
 742    "axes[1].grid(True)\n",
 743    "\n",
 744    "plt.tight_layout()\n",
 745    "plt.show()"
 746   ]
 747  },
 748  {
 749   "cell_type": "markdown",
 750   "metadata": {},
 751   "source": [
 752    "### 6.3 ํŠน์„ฑ ์ค‘์š”๋„"
 753   ]
 754  },
 755  {
 756   "cell_type": "code",
 757   "execution_count": null,
 758   "metadata": {},
 759   "outputs": [],
 760   "source": [
 761    "# ํŠน์„ฑ ์ค‘์š”๋„\n",
 762    "importances = best_model.feature_importances_\n",
 763    "indices = np.argsort(importances)[::-1]\n",
 764    "\n",
 765    "plt.figure(figsize=(12, 6))\n",
 766    "plt.bar(range(len(importances)), importances[indices])\n",
 767    "plt.xticks(range(len(importances)), [features[i] for i in indices], rotation=45)\n",
 768    "plt.xlabel('Feature')\n",
 769    "plt.ylabel('Importance')\n",
 770    "plt.title('Feature Importance')\n",
 771    "plt.tight_layout()\n",
 772    "plt.show()\n",
 773    "\n",
 774    "print(\"\\nํŠน์„ฑ ์ค‘์š”๋„ ์ˆœ์œ„:\")\n",
 775    "for i in indices:\n",
 776    "    print(f\"  {features[i]:15s}: {importances[i]:.4f}\")"
 777   ]
 778  },
 779  {
 780   "cell_type": "markdown",
 781   "metadata": {},
 782   "source": [
 783    "### 6.4 ์˜ค๋ฅ˜ ๋ถ„์„"
 784   ]
 785  },
 786  {
 787   "cell_type": "code",
 788   "execution_count": null,
 789   "metadata": {},
 790   "outputs": [],
 791   "source": [
 792    "# ์ž˜๋ชป ์˜ˆ์ธก๋œ ์ผ€์ด์Šค ๋ถ„์„\n",
 793    "X_test_df = X_test.copy()\n",
 794    "X_test_df['actual'] = y_test.values\n",
 795    "X_test_df['predicted'] = y_pred\n",
 796    "X_test_df['correct'] = X_test_df['actual'] == X_test_df['predicted']\n",
 797    "\n",
 798    "print(\"=== ์˜ˆ์ธก ๊ฒฐ๊ณผ ===\")\n",
 799    "print(f\"์ •ํ™•ํžˆ ์˜ˆ์ธก: {X_test_df['correct'].sum()} / {len(X_test_df)}\")\n",
 800    "print(f\"์ž˜๋ชป ์˜ˆ์ธก: {(~X_test_df['correct']).sum()} / {len(X_test_df)}\")\n",
 801    "\n",
 802    "# False Positive์™€ False Negative\n",
 803    "fp = X_test_df[(X_test_df['actual'] == 0) & (X_test_df['predicted'] == 1)]\n",
 804    "fn = X_test_df[(X_test_df['actual'] == 1) & (X_test_df['predicted'] == 0)]\n",
 805    "\n",
 806    "print(f\"\\nFalse Positive (์‹ค์ œ ์‚ฌ๋ง, ์˜ˆ์ธก ์ƒ์กด): {len(fp)}\")\n",
 807    "print(f\"False Negative (์‹ค์ œ ์ƒ์กด, ์˜ˆ์ธก ์‚ฌ๋ง): {len(fn)}\")\n",
 808    "\n",
 809    "print(\"\\nFalse Negative ์ƒ˜ํ”Œ (์ฒ˜์Œ 5๊ฐœ):\")\n",
 810    "print(fn.head())"
 811   ]
 812  },
 813  {
 814   "cell_type": "markdown",
 815   "metadata": {},
 816   "source": [
 817    "## 7. Kaggle ๊ฒฝ์ง„๋Œ€ํšŒ ์ „๋žต\n",
 818    "\n",
 819    "### 7.1 ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•"
 820   ]
 821  },
 822  {
 823   "cell_type": "code",
 824   "execution_count": null,
 825   "metadata": {},
 826   "outputs": [],
 827   "source": [
 828    "# ์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉ\n",
 829    "def simple_blend(models, X_train, y_train, X_test, weights=None):\n",
 830    "    \"\"\"๊ฐ„๋‹จํ•œ ๋ธ”๋ Œ๋”ฉ ์•™์ƒ๋ธ”\"\"\"\n",
 831    "    if weights is None:\n",
 832    "        weights = [1/len(models)] * len(models)\n",
 833    "    \n",
 834    "    predictions = np.zeros(len(X_test))\n",
 835    "    \n",
 836    "    for model, weight in zip(models, weights):\n",
 837    "        model.fit(X_train, y_train)\n",
 838    "        pred_proba = model.predict_proba(X_test)[:, 1]\n",
 839    "        predictions += weight * pred_proba\n",
 840    "    \n",
 841    "    return (predictions > 0.5).astype(int)\n",
 842    "\n",
 843    "\n",
 844    "# ์•™์ƒ๋ธ” ๋ชจ๋ธ\n",
 845    "ensemble_models = [\n",
 846    "    RandomForestClassifier(n_estimators=200, random_state=42),\n",
 847    "    GradientBoostingClassifier(n_estimators=100, random_state=42),\n",
 848    "    LogisticRegression(max_iter=1000, random_state=42)\n",
 849    "]\n",
 850    "\n",
 851    "# ์„ธ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ์Šค์ผ€์ผ๋ง๋œ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ\n",
 852    "y_pred_ensemble = simple_blend(\n",
 853    "    [ensemble_models[0], ensemble_models[1]], \n",
 854    "    X_train, y_train, X_test\n",
 855    ")\n",
 856    "\n",
 857    "# ํ‰๊ฐ€\n",
 858    "ensemble_acc = accuracy_score(y_test, y_pred_ensemble)\n",
 859    "print(f\"์•™์ƒ๋ธ” ์ •ํ™•๋„: {ensemble_acc:.4f}\")\n",
 860    "print(f\"์ตœ๊ณ  ๋‹จ์ผ ๋ชจ๋ธ ์ •ํ™•๋„: {best_model.score(X_test, y_test):.4f}\")\n",
 861    "print(f\"ํ–ฅ์ƒ: {(ensemble_acc - best_model.score(X_test, y_test)):.4f}\")"
 862   ]
 863  },
 864  {
 865   "cell_type": "markdown",
 866   "metadata": {},
 867   "source": [
 868    "### 7.2 Kaggle ์ œ์ถœ ํŒŒ์ผ ํ˜•์‹"
 869   ]
 870  },
 871  {
 872   "cell_type": "code",
 873   "execution_count": null,
 874   "metadata": {},
 875   "outputs": [],
 876   "source": [
 877    "# Kaggle ์ œ์ถœ์šฉ ์˜ˆ์ธก ์ƒ์„ฑ (์‹ค์ œ Kaggle์—์„œ๋Š” test.csv ์‚ฌ์šฉ)\n",
 878    "# ์—ฌ๊ธฐ์„œ๋Š” ์˜ˆ์‹œ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ\n",
 879    "\n",
 880    "submission = pd.DataFrame({\n",
 881    "    'PassengerId': range(1, len(y_pred) + 1),  # ์‹ค์ œ๋กœ๋Š” test.csv์˜ PassengerId ์‚ฌ์šฉ\n",
 882    "    'Survived': y_pred\n",
 883    "})\n",
 884    "\n",
 885    "print(\"์ œ์ถœ ํŒŒ์ผ ํ˜•์‹:\")\n",
 886    "print(submission.head(10))\n",
 887    "\n",
 888    "# CSV๋กœ ์ €์žฅ\n",
 889    "# submission.to_csv('titanic_submission.csv', index=False)\n",
 890    "# print(\"\\nsubmission.csv ์ €์žฅ ์™„๋ฃŒ\")"
 891   ]
 892  },
 893  {
 894   "cell_type": "markdown",
 895   "metadata": {},
 896   "source": [
 897    "## 8. Kaggle ํ•„์ˆ˜ ํŒ\n",
 898    "\n",
 899    "### 8.1 ๊ฒฝ์ง„๋Œ€ํšŒ ์ฒดํฌ๋ฆฌ์ŠคํŠธ\n",
 900    "\n",
 901    "**1. ๋น ๋ฅธ ์‹œ์ž‘**\n",
 902    "- Baseline ์ฝ”๋“œ ์‹คํ–‰ํ•˜์—ฌ ์ฒซ ์ œ์ถœ\n",
 903    "- ๋ฆฌ๋”๋ณด๋“œ ์œ„์น˜ ํ™•์ธ\n",
 904    "\n",
 905    "**2. EDA ์ง‘์ค‘**\n",
 906    "- ๋ฐ์ดํ„ฐ ์ดํ•ด๊ฐ€ ํ•ต์‹ฌ\n",
 907    "- ๊ฒฐ์ธก์น˜, ์ด์ƒ์น˜, ๋ถ„ํฌ ํŒŒ์•…\n",
 908    "- ํƒ€๊ฒŸ๊ณผ์˜ ๊ด€๊ณ„ ๋ถ„์„\n",
 909    "\n",
 910    "**3. ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง**\n",
 911    "- ๋„๋ฉ”์ธ ์ง€์‹ ํ™œ์šฉ\n",
 912    "- ๊ต์ฐจ ํŠน์„ฑ ์ƒ์„ฑ (์˜ˆ: family_size)\n",
 913    "- ๊ทธ๋ฃน๋ณ„ ํ†ต๊ณ„๋Ÿ‰ (์˜ˆ: ๊ทธ๋ฃน๋ณ„ ํ‰๊ท )\n",
 914    "\n",
 915    "**4. ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์‹œ๋„**\n",
 916    "- ์„ ํ˜• ๋ชจ๋ธ โ†’ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ โ†’ ์•™์ƒ๋ธ”\n",
 917    "- ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹\n",
 918    "\n",
 919    "**5. ์•™์ƒ๋ธ”**\n",
 920    "- ๋‹ค๋ฅธ ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐํ•ฉ\n",
 921    "- ๋ธ”๋ Œ๋”ฉ, ์Šคํƒœํ‚น\n",
 922    "\n",
 923    "**6. ๊ฒ€์ฆ ์ „๋žต**\n",
 924    "- ๋กœ์ปฌ CV์™€ ๋ฆฌ๋”๋ณด๋“œ ์ ์ˆ˜ ์ผ์น˜ ํ™•์ธ\n",
 925    "- ๊ณผ์ ํ•ฉ ์ฃผ์˜ (Public LB์— ๋งž์ถ”์ง€ ๋ง ๊ฒƒ)\n",
 926    "\n",
 927    "### 8.2 ๊ต์ฐจ ๊ฒ€์ฆ ์ „๋žต"
 928   ]
 929  },
 930  {
 931   "cell_type": "code",
 932   "execution_count": null,
 933   "metadata": {},
 934   "outputs": [],
 935   "source": [
 936    "def cross_validate_model(model, X, y, n_splits=5, stratified=True):\n",
 937    "    \"\"\"\n",
 938    "    ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰\n",
 939    "    \n",
 940    "    Parameters:\n",
 941    "    -----------\n",
 942    "    model : sklearn estimator\n",
 943    "    X : features\n",
 944    "    y : target\n",
 945    "    n_splits : ํด๋“œ ์ˆ˜\n",
 946    "    stratified : ๊ณ„์ธตํ™” ์—ฌ๋ถ€\n",
 947    "    \"\"\"\n",
 948    "    if stratified:\n",
 949    "        kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)\n",
 950    "    else:\n",
 951    "        kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)\n",
 952    "    \n",
 953    "    scores = []\n",
 954    "    \n",
 955    "    for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):\n",
 956    "        X_train_fold = X.iloc[train_idx]\n",
 957    "        X_val_fold = X.iloc[val_idx]\n",
 958    "        y_train_fold = y.iloc[train_idx]\n",
 959    "        y_val_fold = y.iloc[val_idx]\n",
 960    "        \n",
 961    "        model.fit(X_train_fold, y_train_fold)\n",
 962    "        score = model.score(X_val_fold, y_val_fold)\n",
 963    "        scores.append(score)\n",
 964    "        \n",
 965    "        print(f\"Fold {fold+1}: {score:.4f}\")\n",
 966    "    \n",
 967    "    print(f\"\\nMean: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})\")\n",
 968    "    return np.mean(scores)\n",
 969    "\n",
 970    "\n",
 971    "# ์‚ฌ์šฉ ์˜ˆ์‹œ\n",
 972    "print(\"=== Random Forest ๊ต์ฐจ ๊ฒ€์ฆ ===\")\n",
 973    "cv_score = cross_validate_model(\n",
 974    "    RandomForestClassifier(n_estimators=100, random_state=42),\n",
 975    "    X, y, n_splits=5\n",
 976    ")"
 977   ]
 978  },
 979  {
 980   "cell_type": "markdown",
 981   "metadata": {},
 982   "source": [
 983    "## ์š”์•ฝ\n",
 984    "\n",
 985    "### ํ”„๋กœ์ ํŠธ ์›Œํฌํ”Œ๋กœ์šฐ\n",
 986    "\n",
 987    "1. **๋ฌธ์ œ ์ •์˜**: ๋ชฉํ‘œ์™€ ํ‰๊ฐ€ ์ง€ํ‘œ ์„ค์ •\n",
 988    "2. **๋ฐ์ดํ„ฐ ํƒ์ƒ‰**: EDA๋กœ ๋ฐ์ดํ„ฐ ์ดํ•ด\n",
 989    "3. **์ „์ฒ˜๋ฆฌ**: ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ, ์ธ์ฝ”๋”ฉ\n",
 990    "4. **ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง**: ๋„๋ฉ”์ธ ์ง€์‹ ํ™œ์šฉ\n",
 991    "5. **๋ชจ๋ธ๋ง**: ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต\n",
 992    "6. **ํŠœ๋‹**: ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”\n",
 993    "7. **ํ‰๊ฐ€**: ๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋กœ ์„ฑ๋Šฅ ํ‰๊ฐ€\n",
 994    "8. **์•™์ƒ๋ธ”**: ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๊ฒฐํ•ฉ\n",
 995    "\n",
 996    "### ํ•ต์‹ฌ ํฌ์ธํŠธ\n",
 997    "\n",
 998    "- **EDA๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”**: ๋ฐ์ดํ„ฐ ์ดํ•ด ์—†์ด๋Š” ์ข‹์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์—†์Œ\n",
 999    "- **ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง**: ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ํ•ต์‹ฌ\n",
1000    "- **๊ต์ฐจ ๊ฒ€์ฆ**: ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€์™€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ™•์ธ\n",
1001    "- **์•™์ƒ๋ธ”**: ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๊ฒฐํ•ฉ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ\n",
1002    "- **๋ฐ˜๋ณต ๊ฐœ์„ **: ํ•œ ๋ฒˆ์— ์™„๋ฒฝํ•œ ๋ชจ๋ธ์€ ์—†์Œ, ์ง€์†์  ๊ฐœ์„ ์ด ํ•„์š”"
1003   ]
1004  }
1005 ],
1006 "metadata": {
1007  "kernelspec": {
1008   "display_name": "Python 3",
1009   "language": "python",
1010   "name": "python3"
1011  },
1012  "language_info": {
1013   "codemirror_mode": {
1014    "name": "ipython",
1015    "version": 3
1016   },
1017   "file_extension": ".py",
1018   "mimetype": "text/x-python",
1019   "name": "python",
1020   "nbconvert_exporter": "python",
1021   "pygments_lexer": "ipython3",
1022   "version": "3.8.0"
1023  }
1024 },
1025 "nbformat": 4,
1026 "nbformat_minor": 4
1027}