14_kaggle_project.ipynb

   1{
   2 "cells": [
   3  {
   4   "cell_type": "markdown",
   5   "metadata": {},
   6   "source": [
   7    "# 실전 프로젝트: 타이타닉 생존 예측 (Kaggle 스타일)\n",
   8    "\n",
   9    "실제 데이터셋을 사용하여 머신러닝 프로젝트를 처음부터 끝까지 수행합니다. Kaggle 경진대회 방식으로 접근하여 실무 노하우를 익힙니다.\n",
  10    "\n",
  11    "**학습 목표:**\n",
  12    "- 완전한 ML 워크플로우 경험\n",
  13    "- 탐색적 데이터 분석 (EDA) 수행\n",
  14    "- 특성 엔지니어링 기법 적용\n",
  15    "- 여러 모델 비교 및 선택\n",
  16    "- 하이퍼파라미터 튜닝\n",
  17    "- Kaggle 경진대회 전략 이해"
  18   ]
  19  },
  20  {
  21   "cell_type": "code",
  22   "execution_count": null,
  23   "metadata": {},
  24   "outputs": [],
  25   "source": [
  26    "import numpy as np\n",
  27    "import pandas as pd\n",
  28    "import matplotlib.pyplot as plt\n",
  29    "import seaborn as sns\n",
  30    "\n",
  31    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, StratifiedKFold\n",
  32    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
  33    "from sklearn.impute import SimpleImputer\n",
  34    "from sklearn.metrics import (\n",
  35    "    accuracy_score, classification_report, confusion_matrix,\n",
  36    "    roc_auc_score, roc_curve\n",
  37    ")\n",
  38    "\n",
  39    "from sklearn.linear_model import LogisticRegression\n",
  40    "from sklearn.tree import DecisionTreeClassifier\n",
  41    "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
  42    "from sklearn.svm import SVC\n",
  43    "\n",
  44    "import warnings\n",
  45    "warnings.filterwarnings('ignore')\n",
  46    "\n",
  47    "# 시각화 설정\n",
  48    "plt.style.use('seaborn-v0_8-darkgrid')\n",
  49    "sns.set_palette('husl')"
  50   ]
  51  },
  52  {
  53   "cell_type": "markdown",
  54   "metadata": {},
  55   "source": [
  56    "## 1. 문제 정의\n",
  57    "\n",
  58    "**목표**: 타이타닉 승객의 생존 여부를 예측하는 분류 모델 개발\n",
  59    "\n",
  60    "**평가 지표**: Accuracy (정확도)\n",
  61    "\n",
  62    "**데이터**: 승객의 나이, 성별, 객실 등급, 요금 등의 정보"
  63   ]
  64  },
  65  {
  66   "cell_type": "markdown",
  67   "metadata": {},
  68   "source": [
  69    "## 2. 데이터 로드 및 기본 탐색"
  70   ]
  71  },
  72  {
  73   "cell_type": "code",
  74   "execution_count": null,
  75   "metadata": {},
  76   "outputs": [],
  77   "source": [
  78    "# seaborn 내장 타이타닉 데이터셋 사용\n",
  79    "# Kaggle에서는 train.csv, test.csv를 다운로드하여 사용\n",
  80    "df = sns.load_dataset('titanic')\n",
  81    "\n",
  82    "print(\"=== 데이터 기본 정보 ===\")\n",
  83    "print(f\"데이터 형상: {df.shape}\")\n",
  84    "print(f\"\\n컬럼 목록:\")\n",
  85    "print(df.columns.tolist())\n",
  86    "print(f\"\\n데이터 타입:\")\n",
  87    "print(df.dtypes)"
  88   ]
  89  },
  90  {
  91   "cell_type": "code",
  92   "execution_count": null,
  93   "metadata": {},
  94   "outputs": [],
  95   "source": [
  96    "# 처음 몇 행 확인\n",
  97    "print(\"처음 5행:\")\n",
  98    "df.head()"
  99   ]
 100  },
 101  {
 102   "cell_type": "code",
 103   "execution_count": null,
 104   "metadata": {},
 105   "outputs": [],
 106   "source": [
 107    "# 기술 통계\n",
 108    "print(\"기술 통계:\")\n",
 109    "df.describe()"
 110   ]
 111  },
 112  {
 113   "cell_type": "code",
 114   "execution_count": null,
 115   "metadata": {},
 116   "outputs": [],
 117   "source": [
 118    "# 타겟 변수 분포\n",
 119    "print(\"=== 생존 여부 분포 ===\")\n",
 120    "print(df['survived'].value_counts())\n",
 121    "print(f\"\\n생존 비율:\")\n",
 122    "print(df['survived'].value_counts(normalize=True))\n",
 123    "\n",
 124    "# 시각화\n",
 125    "fig, ax = plt.subplots(1, 2, figsize=(12, 4))\n",
 126    "\n",
 127    "df['survived'].value_counts().plot(kind='bar', ax=ax[0])\n",
 128    "ax[0].set_title('Survival Count')\n",
 129    "ax[0].set_xlabel('Survived (0=No, 1=Yes)')\n",
 130    "ax[0].set_ylabel('Count')\n",
 131    "\n",
 132    "df['survived'].value_counts(normalize=True).plot(kind='pie', autopct='%1.1f%%', ax=ax[1])\n",
 133    "ax[1].set_title('Survival Proportion')\n",
 134    "ax[1].set_ylabel('')\n",
 135    "\n",
 136    "plt.tight_layout()\n",
 137    "plt.show()"
 138   ]
 139  },
 140  {
 141   "cell_type": "markdown",
 142   "metadata": {},
 143   "source": [
 144    "## 3. 탐색적 데이터 분석 (EDA)\n",
 145    "\n",
 146    "### 3.1 결측치 분석"
 147   ]
 148  },
 149  {
 150   "cell_type": "code",
 151   "execution_count": null,
 152   "metadata": {},
 153   "outputs": [],
 154   "source": [
 155    "# 결측치 확인\n",
 156    "print(\"=== 결측치 분석 ===\")\n",
 157    "missing = df.isnull().sum()\n",
 158    "missing_pct = (missing / len(df) * 100).round(2)\n",
 159    "missing_df = pd.DataFrame({\n",
 160    "    '결측치 수': missing,\n",
 161    "    '결측치 비율(%)': missing_pct\n",
 162    "})\n",
 163    "print(missing_df[missing_df['결측치 수'] > 0].sort_values(by='결측치 수', ascending=False))\n",
 164    "\n",
 165    "# 시각화\n",
 166    "plt.figure(figsize=(10, 6))\n",
 167    "missing_data = missing_df[missing_df['결측치 수'] > 0].sort_values(by='결측치 수', ascending=False)\n",
 168    "plt.barh(missing_data.index, missing_data['결측치 비율(%)'])\n",
 169    "plt.xlabel('Missing Percentage (%)')\n",
 170    "plt.title('Missing Values by Feature')\n",
 171    "plt.tight_layout()\n",
 172    "plt.show()"
 173   ]
 174  },
 175  {
 176   "cell_type": "markdown",
 177   "metadata": {},
 178   "source": [
 179    "### 3.2 범주형 변수와 생존의 관계"
 180   ]
 181  },
 182  {
 183   "cell_type": "code",
 184   "execution_count": null,
 185   "metadata": {},
 186   "outputs": [],
 187   "source": [
 188    "# 주요 범주형 변수와 생존의 관계\n",
 189    "fig, axes = plt.subplots(2, 3, figsize=(15, 10))\n",
 190    "\n",
 191    "# 성별\n",
 192    "sns.countplot(data=df, x='sex', hue='survived', ax=axes[0, 0])\n",
 193    "axes[0, 0].set_title('Survival by Sex')\n",
 194    "\n",
 195    "# 객실 등급\n",
 196    "sns.countplot(data=df, x='pclass', hue='survived', ax=axes[0, 1])\n",
 197    "axes[0, 1].set_title('Survival by Class')\n",
 198    "\n",
 199    "# 승선 항구\n",
 200    "sns.countplot(data=df, x='embarked', hue='survived', ax=axes[0, 2])\n",
 201    "axes[0, 2].set_title('Survival by Embarked')\n",
 202    "\n",
 203    "# 형제/배우자 수\n",
 204    "sns.countplot(data=df, x='sibsp', hue='survived', ax=axes[1, 0])\n",
 205    "axes[1, 0].set_title('Survival by SibSp')\n",
 206    "\n",
 207    "# 부모/자녀 수\n",
 208    "sns.countplot(data=df, x='parch', hue='survived', ax=axes[1, 1])\n",
 209    "axes[1, 1].set_title('Survival by Parch')\n",
 210    "\n",
 211    "# 혼자 여행 여부\n",
 212    "df['alone'] = ((df['sibsp'] + df['parch']) == 0).astype(int)\n",
 213    "sns.countplot(data=df, x='alone', hue='survived', ax=axes[1, 2])\n",
 214    "axes[1, 2].set_title('Survival by Alone')\n",
 215    "\n",
 216    "plt.tight_layout()\n",
 217    "plt.show()"
 218   ]
 219  },
 220  {
 221   "cell_type": "code",
 222   "execution_count": null,
 223   "metadata": {},
 224   "outputs": [],
 225   "source": [
 226    "# 생존율 통계\n",
 227    "print(\"=== 범주별 생존율 ===\")\n",
 228    "print(\"\\n성별:\")\n",
 229    "print(df.groupby('sex')['survived'].mean())\n",
 230    "print(\"\\n객실 등급:\")\n",
 231    "print(df.groupby('pclass')['survived'].mean())\n",
 232    "print(\"\\n승선 항구:\")\n",
 233    "print(df.groupby('embarked')['survived'].mean())"
 234   ]
 235  },
 236  {
 237   "cell_type": "markdown",
 238   "metadata": {},
 239   "source": [
 240    "### 3.3 수치형 변수 분석"
 241   ]
 242  },
 243  {
 244   "cell_type": "code",
 245   "execution_count": null,
 246   "metadata": {},
 247   "outputs": [],
 248   "source": [
 249    "# 나이와 요금 분포 (생존 여부별)\n",
 250    "fig, axes = plt.subplots(2, 2, figsize=(14, 10))\n",
 251    "\n",
 252    "# 나이 분포\n",
 253    "for survived in [0, 1]:\n",
 254    "    axes[0, 0].hist(df[df['survived'] == survived]['age'].dropna(), \n",
 255    "                    bins=30, alpha=0.5, label=f'Survived={survived}')\n",
 256    "axes[0, 0].set_xlabel('Age')\n",
 257    "axes[0, 0].set_ylabel('Count')\n",
 258    "axes[0, 0].set_title('Age Distribution by Survival')\n",
 259    "axes[0, 0].legend()\n",
 260    "\n",
 261    "# 나이 박스플롯\n",
 262    "sns.boxplot(data=df, x='survived', y='age', ax=axes[0, 1])\n",
 263    "axes[0, 1].set_title('Age by Survival')\n",
 264    "\n",
 265    "# 요금 분포 (로그 스케일)\n",
 266    "for survived in [0, 1]:\n",
 267    "    axes[1, 0].hist(np.log1p(df[df['survived'] == survived]['fare'].dropna()), \n",
 268    "                    bins=30, alpha=0.5, label=f'Survived={survived}')\n",
 269    "axes[1, 0].set_xlabel('Log(Fare + 1)')\n",
 270    "axes[1, 0].set_ylabel('Count')\n",
 271    "axes[1, 0].set_title('Fare Distribution by Survival (Log Scale)')\n",
 272    "axes[1, 0].legend()\n",
 273    "\n",
 274    "# 요금 박스플롯\n",
 275    "sns.boxplot(data=df, x='survived', y='fare', ax=axes[1, 1])\n",
 276    "axes[1, 1].set_title('Fare by Survival')\n",
 277    "axes[1, 1].set_ylim(0, 300)\n",
 278    "\n",
 279    "plt.tight_layout()\n",
 280    "plt.show()"
 281   ]
 282  },
 283  {
 284   "cell_type": "code",
 285   "execution_count": null,
 286   "metadata": {},
 287   "outputs": [],
 288   "source": [
 289    "# 상관관계 분석\n",
 290    "print(\"=== 수치형 변수 상관관계 ===\")\n",
 291    "numeric_cols = df.select_dtypes(include=[np.number]).columns\n",
 292    "correlation = df[numeric_cols].corr()\n",
 293    "\n",
 294    "plt.figure(figsize=(10, 8))\n",
 295    "sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0)\n",
 296    "plt.title('Correlation Matrix')\n",
 297    "plt.tight_layout()\n",
 298    "plt.show()\n",
 299    "\n",
 300    "print(\"\\n타겟(survived)과의 상관관계:\")\n",
 301    "print(correlation['survived'].sort_values(ascending=False))"
 302   ]
 303  },
 304  {
 305   "cell_type": "markdown",
 306   "metadata": {},
 307   "source": [
 308    "## 4. 데이터 전처리 및 특성 엔지니어링"
 309   ]
 310  },
 311  {
 312   "cell_type": "code",
 313   "execution_count": null,
 314   "metadata": {},
 315   "outputs": [],
 316   "source": [
 317    "# 작업용 데이터 복사\n",
 318    "df_clean = df.copy()\n",
 319    "\n",
 320    "print(\"=== 전처리 시작 ===\")\n",
 321    "print(f\"초기 데이터 형상: {df_clean.shape}\")"
 322   ]
 323  },
 324  {
 325   "cell_type": "markdown",
 326   "metadata": {},
 327   "source": [
 328    "### 4.1 불필요한 컬럼 제거"
 329   ]
 330  },
 331  {
 332   "cell_type": "code",
 333   "execution_count": null,
 334   "metadata": {},
 335   "outputs": [],
 336   "source": [
 337    "# 중복되거나 불필요한 컬럼 제거\n",
 338    "drop_cols = ['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class']\n",
 339    "df_clean = df_clean.drop(columns=drop_cols, errors='ignore')\n",
 340    "\n",
 341    "print(f\"컬럼 제거 후: {df_clean.shape}\")\n",
 342    "print(f\"남은 컬럼: {df_clean.columns.tolist()}\")"
 343   ]
 344  },
 345  {
 346   "cell_type": "markdown",
 347   "metadata": {},
 348   "source": [
 349    "### 4.2 결측치 처리"
 350   ]
 351  },
 352  {
 353   "cell_type": "code",
 354   "execution_count": null,
 355   "metadata": {},
 356   "outputs": [],
 357   "source": [
 358    "# 나이: 중간값으로 대체\n",
 359    "age_median = df_clean['age'].median()\n",
 360    "df_clean['age'] = df_clean['age'].fillna(age_median)\n",
 361    "print(f\"나이 결측치를 중간값({age_median})으로 대체\")\n",
 362    "\n",
 363    "# 승선 항구: 최빈값으로 대체\n",
 364    "embarked_mode = df_clean['embarked'].mode()[0]\n",
 365    "df_clean['embarked'] = df_clean['embarked'].fillna(embarked_mode)\n",
 366    "print(f\"승선 항구 결측치를 최빈값({embarked_mode})으로 대체\")\n",
 367    "\n",
 368    "# 요금: 중간값으로 대체\n",
 369    "fare_median = df_clean['fare'].median()\n",
 370    "df_clean['fare'] = df_clean['fare'].fillna(fare_median)\n",
 371    "\n",
 372    "print(f\"\\n결측치 처리 후:\")\n",
 373    "print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])"
 374   ]
 375  },
 376  {
 377   "cell_type": "markdown",
 378   "metadata": {},
 379   "source": [
 380    "### 4.3 특성 엔지니어링\n",
 381    "\n",
 382    "도메인 지식을 활용하여 새로운 특성을 생성합니다."
 383   ]
 384  },
 385  {
 386   "cell_type": "code",
 387   "execution_count": null,
 388   "metadata": {},
 389   "outputs": [],
 390   "source": [
 391    "# 1. 가족 크기\n",
 392    "df_clean['family_size'] = df_clean['sibsp'] + df_clean['parch'] + 1\n",
 393    "print(\"가족 크기 특성 생성: sibsp + parch + 1\")\n",
 394    "\n",
 395    "# 2. 혼자 여행 여부\n",
 396    "df_clean['is_alone'] = (df_clean['family_size'] == 1).astype(int)\n",
 397    "print(\"혼자 여행 여부 특성 생성\")\n",
 398    "\n",
 399    "# 3. 나이 그룹\n",
 400    "df_clean['age_group'] = pd.cut(df_clean['age'],\n",
 401    "                                bins=[0, 12, 18, 35, 60, 100],\n",
 402    "                                labels=['Child', 'Teen', 'Young', 'Middle', 'Senior'])\n",
 403    "print(\"나이 그룹 특성 생성\")\n",
 404    "\n",
 405    "# 4. 요금 구간\n",
 406    "df_clean['fare_bin'] = pd.qcut(df_clean['fare'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])\n",
 407    "print(\"요금 구간 특성 생성\")\n",
 408    "\n",
 409    "# 5. 호칭 추출 (선택적)\n",
 410    "# df_clean['title'] = df_clean['name'].str.extract(' ([A-Za-z]+)\\.', expand=False)\n",
 411    "\n",
 412    "print(f\"\\n특성 엔지니어링 후 형상: {df_clean.shape}\")"
 413   ]
 414  },
 415  {
 416   "cell_type": "code",
 417   "execution_count": null,
 418   "metadata": {},
 419   "outputs": [],
 420   "source": [
 421    "# 새로운 특성과 생존의 관계 확인\n",
 422    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
 423    "\n",
 424    "sns.countplot(data=df_clean, x='family_size', hue='survived', ax=axes[0])\n",
 425    "axes[0].set_title('Survival by Family Size')\n",
 426    "\n",
 427    "sns.countplot(data=df_clean, x='age_group', hue='survived', ax=axes[1])\n",
 428    "axes[1].set_title('Survival by Age Group')\n",
 429    "axes[1].tick_params(axis='x', rotation=45)\n",
 430    "\n",
 431    "sns.countplot(data=df_clean, x='fare_bin', hue='survived', ax=axes[2])\n",
 432    "axes[2].set_title('Survival by Fare Bin')\n",
 433    "axes[2].tick_params(axis='x', rotation=45)\n",
 434    "\n",
 435    "plt.tight_layout()\n",
 436    "plt.show()"
 437   ]
 438  },
 439  {
 440   "cell_type": "markdown",
 441   "metadata": {},
 442   "source": [
 443    "### 4.4 범주형 변수 인코딩"
 444   ]
 445  },
 446  {
 447   "cell_type": "code",
 448   "execution_count": null,
 449   "metadata": {},
 450   "outputs": [],
 451   "source": [
 452    "# LabelEncoder 사용\n",
 453    "le = LabelEncoder()\n",
 454    "\n",
 455    "df_clean['sex'] = le.fit_transform(df_clean['sex'])\n",
 456    "df_clean['embarked'] = le.fit_transform(df_clean['embarked'])\n",
 457    "df_clean['age_group'] = le.fit_transform(df_clean['age_group'])\n",
 458    "df_clean['fare_bin'] = le.fit_transform(df_clean['fare_bin'])\n",
 459    "\n",
 460    "print(\"범주형 변수 인코딩 완료\")\n",
 461    "print(f\"\\n인코딩 후 데이터 타입:\")\n",
 462    "print(df_clean.dtypes)"
 463   ]
 464  },
 465  {
 466   "cell_type": "markdown",
 467   "metadata": {},
 468   "source": [
 469    "### 4.5 최종 특성 선택"
 470   ]
 471  },
 472  {
 473   "cell_type": "code",
 474   "execution_count": null,
 475   "metadata": {},
 476   "outputs": [],
 477   "source": [
 478    "# 모델링에 사용할 특성 선택\n",
 479    "features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n",
 480    "            'embarked', 'family_size', 'is_alone', 'age_group', 'fare_bin']\n",
 481    "\n",
 482    "X = df_clean[features]\n",
 483    "y = df_clean['survived']\n",
 484    "\n",
 485    "print(f\"최종 특성: {features}\")\n",
 486    "print(f\"X 형상: {X.shape}\")\n",
 487    "print(f\"y 분포: {y.value_counts().to_dict()}\")"
 488   ]
 489  },
 490  {
 491   "cell_type": "markdown",
 492   "metadata": {},
 493   "source": [
 494    "## 5. 모델링\n",
 495    "\n",
 496    "### 5.1 데이터 분할"
 497   ]
 498  },
 499  {
 500   "cell_type": "code",
 501   "execution_count": null,
 502   "metadata": {},
 503   "outputs": [],
 504   "source": [
 505    "# Train/Test 분할 (Stratified)\n",
 506    "X_train, X_test, y_train, y_test = train_test_split(\n",
 507    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
 508    ")\n",
 509    "\n",
 510    "print(f\"학습 데이터: {X_train.shape}\")\n",
 511    "print(f\"테스트 데이터: {X_test.shape}\")\n",
 512    "print(f\"\\n학습 데이터 타겟 분포: {y_train.value_counts().to_dict()}\")\n",
 513    "print(f\"테스트 데이터 타겟 분포: {y_test.value_counts().to_dict()}\")"
 514   ]
 515  },
 516  {
 517   "cell_type": "code",
 518   "execution_count": null,
 519   "metadata": {},
 520   "outputs": [],
 521   "source": [
 522    "# 스케일링 (선형 모델용)\n",
 523    "scaler = StandardScaler()\n",
 524    "X_train_scaled = scaler.fit_transform(X_train)\n",
 525    "X_test_scaled = scaler.transform(X_test)\n",
 526    "\n",
 527    "print(\"스케일링 완료\")"
 528   ]
 529  },
 530  {
 531   "cell_type": "markdown",
 532   "metadata": {},
 533   "source": [
 534    "### 5.2 Baseline 모델\n",
 535    "\n",
 536    "간단한 모델로 기준선을 설정합니다."
 537   ]
 538  },
 539  {
 540   "cell_type": "code",
 541   "execution_count": null,
 542   "metadata": {},
 543   "outputs": [],
 544   "source": [
 545    "# 기준선: 항상 다수 클래스 예측\n",
 546    "baseline_pred = np.zeros(len(y_test))  # 모두 0 (사망) 예측\n",
 547    "baseline_acc = accuracy_score(y_test, baseline_pred)\n",
 548    "\n",
 549    "print(f\"Baseline 정확도 (항상 사망 예측): {baseline_acc:.4f}\")\n",
 550    "print(\"\\n이 값보다 높은 성능을 목표로 합니다.\")"
 551   ]
 552  },
 553  {
 554   "cell_type": "markdown",
 555   "metadata": {},
 556   "source": [
 557    "### 5.3 여러 모델 비교"
 558   ]
 559  },
 560  {
 561   "cell_type": "code",
 562   "execution_count": null,
 563   "metadata": {},
 564   "outputs": [],
 565   "source": [
 566    "# 다양한 모델 정의\n",
 567    "models = {\n",
 568    "    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),\n",
 569    "    'Decision Tree': DecisionTreeClassifier(random_state=42),\n",
 570    "    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),\n",
 571    "    'Gradient Boosting': GradientBoostingClassifier(random_state=42),\n",
 572    "    'SVM': SVC(random_state=42)\n",
 573    "}\n",
 574    "\n",
 575    "# 모델 비교\n",
 576    "print(\"=== 모델 비교 (5-Fold Cross Validation) ===\")\n",
 577    "results = []\n",
 578    "\n",
 579    "for name, model in models.items():\n",
 580    "    # 선형 모델은 스케일링된 데이터 사용\n",
 581    "    if name in ['Logistic Regression', 'SVM']:\n",
 582    "        X_tr, X_te = X_train_scaled, X_test_scaled\n",
 583    "    else:\n",
 584    "        X_tr, X_te = X_train, X_test\n",
 585    "    \n",
 586    "    # 교차 검증\n",
 587    "    cv_scores = cross_val_score(model, X_tr, y_train, cv=5, scoring='accuracy')\n",
 588    "    \n",
 589    "    # 학습 및 테스트\n",
 590    "    model.fit(X_tr, y_train)\n",
 591    "    test_score = model.score(X_te, y_test)\n",
 592    "    \n",
 593    "    results.append({\n",
 594    "        'Model': name,\n",
 595    "        'CV Mean': cv_scores.mean(),\n",
 596    "        'CV Std': cv_scores.std(),\n",
 597    "        'Test Score': test_score\n",
 598    "    })\n",
 599    "    \n",
 600    "    print(f\"{name}:\")\n",
 601    "    print(f\"  CV = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})\")\n",
 602    "    print(f\"  Test = {test_score:.4f}\")\n",
 603    "    print()\n",
 604    "\n",
 605    "results_df = pd.DataFrame(results)\n",
 606    "results_df = results_df.sort_values(by='CV Mean', ascending=False)\n",
 607    "print(\"\\n모델 순위:\")\n",
 608    "print(results_df)"
 609   ]
 610  },
 611  {
 612   "cell_type": "code",
 613   "execution_count": null,
 614   "metadata": {},
 615   "outputs": [],
 616   "source": [
 617    "# 결과 시각화\n",
 618    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
 619    "\n",
 620    "# CV 점수\n",
 621    "axes[0].barh(results_df['Model'], results_df['CV Mean'])\n",
 622    "axes[0].set_xlabel('CV Accuracy')\n",
 623    "axes[0].set_title('Cross-Validation Scores')\n",
 624    "axes[0].set_xlim(0.7, 0.9)\n",
 625    "\n",
 626    "# Test 점수\n",
 627    "axes[1].barh(results_df['Model'], results_df['Test Score'])\n",
 628    "axes[1].set_xlabel('Test Accuracy')\n",
 629    "axes[1].set_title('Test Scores')\n",
 630    "axes[1].set_xlim(0.7, 0.9)\n",
 631    "\n",
 632    "plt.tight_layout()\n",
 633    "plt.show()"
 634   ]
 635  },
 636  {
 637   "cell_type": "markdown",
 638   "metadata": {},
 639   "source": [
 640    "### 5.4 하이퍼파라미터 튜닝\n",
 641    "\n",
 642    "최고 성능 모델에 대해 하이퍼파라미터를 튜닝합니다."
 643   ]
 644  },
 645  {
 646   "cell_type": "code",
 647   "execution_count": null,
 648   "metadata": {},
 649   "outputs": [],
 650   "source": [
 651    "# Random Forest 튜닝\n",
 652    "rf_param_grid = {\n",
 653    "    'n_estimators': [100, 200, 300],\n",
 654    "    'max_depth': [5, 10, 15, None],\n",
 655    "    'min_samples_split': [2, 5, 10],\n",
 656    "    'min_samples_leaf': [1, 2, 4],\n",
 657    "    'max_features': ['sqrt', 'log2']\n",
 658    "}\n",
 659    "\n",
 660    "rf = RandomForestClassifier(random_state=42)\n",
 661    "grid_search = GridSearchCV(\n",
 662    "    rf, rf_param_grid, \n",
 663    "    cv=5, \n",
 664    "    scoring='accuracy', \n",
 665    "    n_jobs=-1, \n",
 666    "    verbose=1\n",
 667    ")\n",
 668    "\n",
 669    "print(\"Grid Search 시작...\")\n",
 670    "grid_search.fit(X_train, y_train)\n",
 671    "\n",
 672    "print(\"\\n=== 하이퍼파라미터 튜닝 결과 ===\")\n",
 673    "print(f\"최적 파라미터: {grid_search.best_params_}\")\n",
 674    "print(f\"최적 CV 점수: {grid_search.best_score_:.4f}\")\n",
 675    "print(f\"테스트 점수: {grid_search.score(X_test, y_test):.4f}\")\n",
 676    "\n",
 677    "best_model = grid_search.best_estimator_"
 678   ]
 679  },
 680  {
 681   "cell_type": "markdown",
 682   "metadata": {},
 683   "source": [
 684    "## 6. 모델 평가\n",
 685    "\n",
 686    "### 6.1 분류 성능 지표"
 687   ]
 688  },
 689  {
 690   "cell_type": "code",
 691   "execution_count": null,
 692   "metadata": {},
 693   "outputs": [],
 694   "source": [
 695    "# 예측\n",
 696    "y_pred = best_model.predict(X_test)\n",
 697    "y_pred_proba = best_model.predict_proba(X_test)[:, 1]\n",
 698    "\n",
 699    "# 분류 리포트\n",
 700    "print(\"=== 분류 리포트 ===\")\n",
 701    "print(classification_report(y_test, y_pred, target_names=['Not Survived', 'Survived']))\n",
 702    "\n",
 703    "# ROC AUC\n",
 704    "roc_auc = roc_auc_score(y_test, y_pred_proba)\n",
 705    "print(f\"\\nROC AUC Score: {roc_auc:.4f}\")"
 706   ]
 707  },
 708  {
 709   "cell_type": "markdown",
 710   "metadata": {},
 711   "source": [
 712    "### 6.2 혼동 행렬"
 713   ]
 714  },
 715  {
 716   "cell_type": "code",
 717   "execution_count": null,
 718   "metadata": {},
 719   "outputs": [],
 720   "source": [
 721    "# 혼동 행렬 시각화\n",
 722    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
 723    "\n",
 724    "# 혼동 행렬\n",
 725    "cm = confusion_matrix(y_test, y_pred)\n",
 726    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',\n",
 727    "            xticklabels=['Not Survived', 'Survived'],\n",
 728    "            yticklabels=['Not Survived', 'Survived'],\n",
 729    "            ax=axes[0])\n",
 730    "axes[0].set_xlabel('Predicted')\n",
 731    "axes[0].set_ylabel('Actual')\n",
 732    "axes[0].set_title('Confusion Matrix')\n",
 733    "\n",
 734    "# ROC Curve\n",
 735    "fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)\n",
 736    "axes[1].plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.4f})')\n",
 737    "axes[1].plot([0, 1], [0, 1], 'k--', label='Random')\n",
 738    "axes[1].set_xlabel('False Positive Rate')\n",
 739    "axes[1].set_ylabel('True Positive Rate')\n",
 740    "axes[1].set_title('ROC Curve')\n",
 741    "axes[1].legend()\n",
 742    "axes[1].grid(True)\n",
 743    "\n",
 744    "plt.tight_layout()\n",
 745    "plt.show()"
 746   ]
 747  },
 748  {
 749   "cell_type": "markdown",
 750   "metadata": {},
 751   "source": [
 752    "### 6.3 특성 중요도"
 753   ]
 754  },
 755  {
 756   "cell_type": "code",
 757   "execution_count": null,
 758   "metadata": {},
 759   "outputs": [],
 760   "source": [
 761    "# 특성 중요도\n",
 762    "importances = best_model.feature_importances_\n",
 763    "indices = np.argsort(importances)[::-1]\n",
 764    "\n",
 765    "plt.figure(figsize=(12, 6))\n",
 766    "plt.bar(range(len(importances)), importances[indices])\n",
 767    "plt.xticks(range(len(importances)), [features[i] for i in indices], rotation=45)\n",
 768    "plt.xlabel('Feature')\n",
 769    "plt.ylabel('Importance')\n",
 770    "plt.title('Feature Importance')\n",
 771    "plt.tight_layout()\n",
 772    "plt.show()\n",
 773    "\n",
 774    "print(\"\\n특성 중요도 순위:\")\n",
 775    "for i in indices:\n",
 776    "    print(f\"  {features[i]:15s}: {importances[i]:.4f}\")"
 777   ]
 778  },
 779  {
 780   "cell_type": "markdown",
 781   "metadata": {},
 782   "source": [
 783    "### 6.4 오류 분석"
 784   ]
 785  },
 786  {
 787   "cell_type": "code",
 788   "execution_count": null,
 789   "metadata": {},
 790   "outputs": [],
 791   "source": [
 792    "# 잘못 예측된 케이스 분석\n",
 793    "X_test_df = X_test.copy()\n",
 794    "X_test_df['actual'] = y_test.values\n",
 795    "X_test_df['predicted'] = y_pred\n",
 796    "X_test_df['correct'] = X_test_df['actual'] == X_test_df['predicted']\n",
 797    "\n",
 798    "print(\"=== 예측 결과 ===\")\n",
 799    "print(f\"정확히 예측: {X_test_df['correct'].sum()} / {len(X_test_df)}\")\n",
 800    "print(f\"잘못 예측: {(~X_test_df['correct']).sum()} / {len(X_test_df)}\")\n",
 801    "\n",
 802    "# False Positive와 False Negative\n",
 803    "fp = X_test_df[(X_test_df['actual'] == 0) & (X_test_df['predicted'] == 1)]\n",
 804    "fn = X_test_df[(X_test_df['actual'] == 1) & (X_test_df['predicted'] == 0)]\n",
 805    "\n",
 806    "print(f\"\\nFalse Positive (실제 사망, 예측 생존): {len(fp)}\")\n",
 807    "print(f\"False Negative (실제 생존, 예측 사망): {len(fn)}\")\n",
 808    "\n",
 809    "print(\"\\nFalse Negative 샘플 (처음 5개):\")\n",
 810    "print(fn.head())"
 811   ]
 812  },
 813  {
 814   "cell_type": "markdown",
 815   "metadata": {},
 816   "source": [
 817    "## 7. Kaggle 경진대회 전략\n",
 818    "\n",
 819    "### 7.1 앙상블 기법"
 820   ]
 821  },
 822  {
 823   "cell_type": "code",
 824   "execution_count": null,
 825   "metadata": {},
 826   "outputs": [],
 827   "source": [
 828    "# 여러 모델의 예측을 결합\n",
 829    "def simple_blend(models, X_train, y_train, X_test, weights=None):\n",
 830    "    \"\"\"간단한 블렌딩 앙상블\"\"\"\n",
 831    "    if weights is None:\n",
 832    "        weights = [1/len(models)] * len(models)\n",
 833    "    \n",
 834    "    predictions = np.zeros(len(X_test))\n",
 835    "    \n",
 836    "    for model, weight in zip(models, weights):\n",
 837    "        model.fit(X_train, y_train)\n",
 838    "        pred_proba = model.predict_proba(X_test)[:, 1]\n",
 839    "        predictions += weight * pred_proba\n",
 840    "    \n",
 841    "    return (predictions > 0.5).astype(int)\n",
 842    "\n",
 843    "\n",
 844    "# 앙상블 모델\n",
 845    "ensemble_models = [\n",
 846    "    RandomForestClassifier(n_estimators=200, random_state=42),\n",
 847    "    GradientBoostingClassifier(n_estimators=100, random_state=42),\n",
 848    "    LogisticRegression(max_iter=1000, random_state=42)\n",
 849    "]\n",
 850    "\n",
 851    "# 세 번째 모델은 스케일링된 데이터 사용\n",
 852    "y_pred_ensemble = simple_blend(\n",
 853    "    [ensemble_models[0], ensemble_models[1]], \n",
 854    "    X_train, y_train, X_test\n",
 855    ")\n",
 856    "\n",
 857    "# 평가\n",
 858    "ensemble_acc = accuracy_score(y_test, y_pred_ensemble)\n",
 859    "print(f\"앙상블 정확도: {ensemble_acc:.4f}\")\n",
 860    "print(f\"최고 단일 모델 정확도: {best_model.score(X_test, y_test):.4f}\")\n",
 861    "print(f\"향상: {(ensemble_acc - best_model.score(X_test, y_test)):.4f}\")"
 862   ]
 863  },
 864  {
 865   "cell_type": "markdown",
 866   "metadata": {},
 867   "source": [
 868    "### 7.2 Kaggle 제출 파일 형식"
 869   ]
 870  },
 871  {
 872   "cell_type": "code",
 873   "execution_count": null,
 874   "metadata": {},
 875   "outputs": [],
 876   "source": [
 877    "# Kaggle 제출용 예측 생성 (실제 Kaggle에서는 test.csv 사용)\n",
 878    "# 여기서는 예시로 테스트 데이터 사용\n",
 879    "\n",
 880    "submission = pd.DataFrame({\n",
 881    "    'PassengerId': range(1, len(y_pred) + 1),  # 실제로는 test.csv의 PassengerId 사용\n",
 882    "    'Survived': y_pred\n",
 883    "})\n",
 884    "\n",
 885    "print(\"제출 파일 형식:\")\n",
 886    "print(submission.head(10))\n",
 887    "\n",
 888    "# CSV로 저장\n",
 889    "# submission.to_csv('titanic_submission.csv', index=False)\n",
 890    "# print(\"\\nsubmission.csv 저장 완료\")"
 891   ]
 892  },
 893  {
 894   "cell_type": "markdown",
 895   "metadata": {},
 896   "source": [
 897    "## 8. Kaggle 필수 팁\n",
 898    "\n",
 899    "### 8.1 경진대회 체크리스트\n",
 900    "\n",
 901    "**1. 빠른 시작**\n",
 902    "- Baseline 코드 실행하여 첫 제출\n",
 903    "- 리더보드 위치 확인\n",
 904    "\n",
 905    "**2. EDA 집중**\n",
 906    "- 데이터 이해가 핵심\n",
 907    "- 결측치, 이상치, 분포 파악\n",
 908    "- 타겟과의 관계 분석\n",
 909    "\n",
 910    "**3. 특성 엔지니어링**\n",
 911    "- 도메인 지식 활용\n",
 912    "- 교차 특성 생성 (예: family_size)\n",
 913    "- 그룹별 통계량 (예: 그룹별 평균)\n",
 914    "\n",
 915    "**4. 다양한 모델 시도**\n",
 916    "- 선형 모델 → 트리 기반 → 앙상블\n",
 917    "- 하이퍼파라미터 튜닝\n",
 918    "\n",
 919    "**5. 앙상블**\n",
 920    "- 다른 모델 예측 결합\n",
 921    "- 블렌딩, 스태킹\n",
 922    "\n",
 923    "**6. 검증 전략**\n",
 924    "- 로컬 CV와 리더보드 점수 일치 확인\n",
 925    "- 과적합 주의 (Public LB에 맞추지 말 것)\n",
 926    "\n",
 927    "### 8.2 교차 검증 전략"
 928   ]
 929  },
 930  {
 931   "cell_type": "code",
 932   "execution_count": null,
 933   "metadata": {},
 934   "outputs": [],
 935   "source": [
 936    "def cross_validate_model(model, X, y, n_splits=5, stratified=True):\n",
 937    "    \"\"\"\n",
 938    "    교차 검증 수행\n",
 939    "    \n",
 940    "    Parameters:\n",
 941    "    -----------\n",
 942    "    model : sklearn estimator\n",
 943    "    X : features\n",
 944    "    y : target\n",
 945    "    n_splits : 폴드 수\n",
 946    "    stratified : 계층화 여부\n",
 947    "    \"\"\"\n",
 948    "    if stratified:\n",
 949    "        kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)\n",
 950    "    else:\n",
 951    "        kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)\n",
 952    "    \n",
 953    "    scores = []\n",
 954    "    \n",
 955    "    for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):\n",
 956    "        X_train_fold = X.iloc[train_idx]\n",
 957    "        X_val_fold = X.iloc[val_idx]\n",
 958    "        y_train_fold = y.iloc[train_idx]\n",
 959    "        y_val_fold = y.iloc[val_idx]\n",
 960    "        \n",
 961    "        model.fit(X_train_fold, y_train_fold)\n",
 962    "        score = model.score(X_val_fold, y_val_fold)\n",
 963    "        scores.append(score)\n",
 964    "        \n",
 965    "        print(f\"Fold {fold+1}: {score:.4f}\")\n",
 966    "    \n",
 967    "    print(f\"\\nMean: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})\")\n",
 968    "    return np.mean(scores)\n",
 969    "\n",
 970    "\n",
 971    "# 사용 예시\n",
 972    "print(\"=== Random Forest 교차 검증 ===\")\n",
 973    "cv_score = cross_validate_model(\n",
 974    "    RandomForestClassifier(n_estimators=100, random_state=42),\n",
 975    "    X, y, n_splits=5\n",
 976    ")"
 977   ]
 978  },
 979  {
 980   "cell_type": "markdown",
 981   "metadata": {},
 982   "source": [
 983    "## 요약\n",
 984    "\n",
 985    "### 프로젝트 워크플로우\n",
 986    "\n",
 987    "1. **문제 정의**: 목표와 평가 지표 설정\n",
 988    "2. **데이터 탐색**: EDA로 데이터 이해\n",
 989    "3. **전처리**: 결측치 처리, 인코딩\n",
 990    "4. **특성 엔지니어링**: 도메인 지식 활용\n",
 991    "5. **모델링**: 여러 모델 비교\n",
 992    "6. **튜닝**: 하이퍼파라미터 최적화\n",
 993    "7. **평가**: 다양한 지표로 성능 평가\n",
 994    "8. **앙상블**: 여러 모델 결합\n",
 995    "\n",
 996    "### 핵심 포인트\n",
 997    "\n",
 998    "- **EDA가 가장 중요**: 데이터 이해 없이는 좋은 모델을 만들 수 없음\n",
 999    "- **특성 엔지니어링**: 모델 성능 향상의 핵심\n",
1000    "- **교차 검증**: 과적합 방지와 일반화 성능 확인\n",
1001    "- **앙상블**: 다양한 모델 결합으로 성능 향상\n",
1002    "- **반복 개선**: 한 번에 완벽한 모델은 없음, 지속적 개선이 필요"
1003   ]
1004  }
1005 ],
1006 "metadata": {
1007  "kernelspec": {
1008   "display_name": "Python 3",
1009   "language": "python",
1010   "name": "python3"
1011  },
1012  "language_info": {
1013   "codemirror_mode": {
1014    "name": "ipython",
1015    "version": 3
1016   },
1017   "file_extension": ".py",
1018   "mimetype": "text/x-python",
1019   "name": "python",
1020   "nbconvert_exporter": "python",
1021   "pygments_lexer": "ipython3",
1022   "version": "3.8.0"
1023  }
1024 },
1025 "nbformat": 4,
1026 "nbformat_minor": 4
1027}