13_pipeline.ipynb

  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "metadata": {},
  6   "source": [
  7    "# 파이프라인과 실무 (Pipeline & Practice)\n",
  8    "\n",
  9    "sklearn의 Pipeline과 ColumnTransformer를 사용하여 전처리와 모델링을 하나의 워크플로우로 통합하는 방법을 학습합니다.\n",
 10    "\n",
 11    "**학습 목표:**\n",
 12    "- Pipeline의 필요성과 장점 이해\n",
 13    "- ColumnTransformer로 다양한 타입의 특성 처리\n",
 14    "- 커스텀 Transformer 작성\n",
 15    "- Pipeline과 GridSearchCV 결합\n",
 16    "- 모델 저장 및 배포"
 17   ]
 18  },
 19  {
 20   "cell_type": "code",
 21   "execution_count": null,
 22   "metadata": {},
 23   "outputs": [],
 24   "source": [
 25    "import numpy as np\n",
 26    "import pandas as pd\n",
 27    "import matplotlib.pyplot as plt\n",
 28    "import seaborn as sns\n",
 29    "\n",
 30    "from sklearn.pipeline import Pipeline, make_pipeline\n",
 31    "from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder\n",
 32    "from sklearn.decomposition import PCA\n",
 33    "from sklearn.linear_model import LogisticRegression\n",
 34    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
 35    "from sklearn.datasets import load_iris, load_breast_cancer\n",
 36    "\n",
 37    "import warnings\n",
 38    "warnings.filterwarnings('ignore')"
 39   ]
 40  },
 41  {
 42   "cell_type": "markdown",
 43   "metadata": {},
 44   "source": [
 45    "## 1. Pipeline 기초\n",
 46    "\n",
 47    "### Pipeline을 사용하지 않을 때의 문제점\n",
 48    "\n",
 49    "1. **데이터 누수 (Data Leakage)**: 테스트 데이터 정보가 학습에 반영될 위험\n",
 50    "2. **코드 복잡성**: 여러 단계를 수동으로 관리해야 함\n",
 51    "3. **재현성 문제**: 순서 실수, 파라미터 불일치 가능성\n",
 52    "\n",
 53    "### Pipeline의 장점\n",
 54    "\n",
 55    "1. 코드 간소화\n",
 56    "2. 데이터 누수 방지\n",
 57    "3. 교차 검증과 완벽 통합\n",
 58    "4. 하이퍼파라미터 튜닝 용이\n",
 59    "5. 모델 저장/배포 편리"
 60   ]
 61  },
 62  {
 63   "cell_type": "code",
 64   "execution_count": null,
 65   "metadata": {},
 66   "outputs": [],
 67   "source": [
 68    "# 데이터 로드\n",
 69    "iris = load_iris()\n",
 70    "X_train, X_test, y_train, y_test = train_test_split(\n",
 71    "    iris.data, iris.target, test_size=0.2, random_state=42\n",
 72    ")\n",
 73    "\n",
 74    "# Pipeline 생성 (명시적 이름)\n",
 75    "pipeline = Pipeline([\n",
 76    "    ('scaler', StandardScaler()),\n",
 77    "    ('pca', PCA(n_components=2)),\n",
 78    "    ('classifier', LogisticRegression())\n",
 79    "])\n",
 80    "\n",
 81    "# 학습 및 예측\n",
 82    "pipeline.fit(X_train, y_train)\n",
 83    "y_pred = pipeline.predict(X_test)\n",
 84    "score = pipeline.score(X_test, y_test)\n",
 85    "\n",
 86    "print(f\"Pipeline 정확도: {score:.4f}\")\n",
 87    "\n",
 88    "# make_pipeline (자동 이름 생성)\n",
 89    "pipeline_auto = make_pipeline(\n",
 90    "    StandardScaler(),\n",
 91    "    PCA(n_components=2),\n",
 92    "    LogisticRegression()\n",
 93    ")\n",
 94    "\n",
 95    "pipeline_auto.fit(X_train, y_train)\n",
 96    "print(f\"make_pipeline 정확도: {pipeline_auto.score(X_test, y_test):.4f}\")"
 97   ]
 98  },
 99  {
100   "cell_type": "markdown",
101   "metadata": {},
102   "source": [
103    "### Pipeline 단계 접근하기"
104   ]
105  },
106  {
107   "cell_type": "code",
108   "execution_count": null,
109   "metadata": {},
110   "outputs": [],
111   "source": [
112    "# 단계 이름 확인\n",
113    "print(\"Pipeline 단계:\")\n",
114    "for name, step in pipeline.named_steps.items():\n",
115    "    print(f\"  {name}: {type(step).__name__}\")\n",
116    "\n",
117    "# 특정 단계 접근\n",
118    "print(f\"\\nPCA 설명된 분산: {pipeline.named_steps['pca'].explained_variance_ratio_}\")\n",
119    "print(f\"로지스틱 회귀 계수 형상: {pipeline.named_steps['classifier'].coef_.shape}\")\n",
120    "\n",
121    "# 중간 단계 결과 얻기\n",
122    "X_scaled = pipeline.named_steps['scaler'].transform(X_test)\n",
123    "X_pca = pipeline.named_steps['pca'].transform(X_scaled)\n",
124    "print(f\"\\n스케일링 후 형상: {X_scaled.shape}\")\n",
125    "print(f\"PCA 후 형상: {X_pca.shape}\")"
126   ]
127  },
128  {
129   "cell_type": "markdown",
130   "metadata": {},
131   "source": [
132    "## 2. ColumnTransformer - 다양한 타입의 특성 처리\n",
133    "\n",
134    "실제 데이터에서는 수치형과 범주형 특성이 혼재되어 있습니다. ColumnTransformer를 사용하면 각 타입에 맞는 전처리를 적용할 수 있습니다."
135   ]
136  },
137  {
138   "cell_type": "code",
139   "execution_count": null,
140   "metadata": {},
141   "outputs": [],
142   "source": [
143    "from sklearn.compose import ColumnTransformer\n",
144    "from sklearn.preprocessing import OrdinalEncoder\n",
145    "from sklearn.ensemble import RandomForestClassifier\n",
146    "\n",
147    "# 샘플 데이터 생성\n",
148    "data = {\n",
149    "    'age': [25, 32, 47, 51, 62, 28, 35, 42, 55, 60],\n",
150    "    'income': [50000, 60000, 80000, 120000, 95000, 55000, 70000, 85000, 110000, 100000],\n",
151    "    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],\n",
152    "    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', \n",
153    "                  'Bachelor', 'PhD', 'Master', 'PhD', 'Bachelor'],\n",
154    "    'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
155    "}\n",
156    "df = pd.DataFrame(data)\n",
157    "\n",
158    "X = df.drop('purchased', axis=1)\n",
159    "y = df['purchased']\n",
160    "\n",
161    "print(\"데이터 타입:\")\n",
162    "print(X.dtypes)\n",
163    "print(\"\\n데이터 샘플:\")\n",
164    "print(X.head())"
165   ]
166  },
167  {
168   "cell_type": "code",
169   "execution_count": null,
170   "metadata": {},
171   "outputs": [],
172   "source": [
173    "# 특성 분류\n",
174    "numeric_features = ['age', 'income']\n",
175    "categorical_features = ['gender', 'education']\n",
176    "\n",
177    "# ColumnTransformer 정의\n",
178    "preprocessor = ColumnTransformer(\n",
179    "    transformers=[\n",
180    "        ('num', StandardScaler(), numeric_features),\n",
181    "        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)\n",
182    "    ],\n",
183    "    remainder='passthrough'  # 나머지 특성 처리: 'drop' 또는 'passthrough'\n",
184    ")\n",
185    "\n",
186    "# 변환\n",
187    "X_transformed = preprocessor.fit_transform(X)\n",
188    "\n",
189    "print(f\"원본 형상: {X.shape}\")\n",
190    "print(f\"변환 후 형상: {X_transformed.shape}\")\n",
191    "\n",
192    "# 변환된 특성 이름\n",
193    "feature_names = (\n",
194    "    numeric_features +\n",
195    "    list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))\n",
196    ")\n",
197    "print(f\"\\n특성 이름: {feature_names}\")"
198   ]
199  },
200  {
201   "cell_type": "markdown",
202   "metadata": {},
203   "source": [
204    "### Pipeline + ColumnTransformer 결합"
205   ]
206  },
207  {
208   "cell_type": "code",
209   "execution_count": null,
210   "metadata": {},
211   "outputs": [],
212   "source": [
213    "# 전체 파이프라인\n",
214    "full_pipeline = Pipeline([\n",
215    "    ('preprocessor', preprocessor),\n",
216    "    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))\n",
217    "])\n",
218    "\n",
219    "# 학습\n",
220    "full_pipeline.fit(X, y)\n",
221    "\n",
222    "# 새로운 데이터로 예측\n",
223    "new_data = pd.DataFrame({\n",
224    "    'age': [30, 45],\n",
225    "    'income': [70000, 90000],\n",
226    "    'gender': ['F', 'M'],\n",
227    "    'education': ['Master', 'PhD']\n",
228    "})\n",
229    "\n",
230    "predictions = full_pipeline.predict(new_data)\n",
231    "print(f\"예측 결과: {predictions}\")"
232   ]
233  },
234  {
235   "cell_type": "markdown",
236   "metadata": {},
237   "source": [
238    "## 3. 결측치 처리를 포함한 복잡한 파이프라인"
239   ]
240  },
241  {
242   "cell_type": "code",
243   "execution_count": null,
244   "metadata": {},
245   "outputs": [],
246   "source": [
247    "from sklearn.impute import SimpleImputer\n",
248    "\n",
249    "# 결측치가 있는 데이터 생성\n",
250    "data_missing = {\n",
251    "    'age': [25, np.nan, 47, 51, 62, 28, np.nan, 42, 55, 60],\n",
252    "    'income': [50000, 60000, np.nan, 120000, 95000, np.nan, 70000, 85000, 110000, 100000],\n",
253    "    'gender': ['M', 'F', 'M', None, 'M', 'F', 'M', None, 'M', 'F'],\n",
254    "    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', None, \n",
255    "                  'Bachelor', 'PhD', 'Master', None, 'Bachelor'],\n",
256    "    'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
257    "}\n",
258    "df_missing = pd.DataFrame(data_missing)\n",
259    "X_missing = df_missing.drop('purchased', axis=1)\n",
260    "y_missing = df_missing['purchased']\n",
261    "\n",
262    "print(\"결측치 개수:\")\n",
263    "print(X_missing.isnull().sum())"
264   ]
265  },
266  {
267   "cell_type": "code",
268   "execution_count": null,
269   "metadata": {},
270   "outputs": [],
271   "source": [
272    "# 수치형 파이프라인 (결측치 처리 포함)\n",
273    "numeric_transformer = Pipeline([\n",
274    "    ('imputer', SimpleImputer(strategy='median')),\n",
275    "    ('scaler', StandardScaler())\n",
276    "])\n",
277    "\n",
278    "# 범주형 파이프라인 (결측치 처리 포함)\n",
279    "categorical_transformer = Pipeline([\n",
280    "    ('imputer', SimpleImputer(strategy='most_frequent')),\n",
281    "    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
282    "])\n",
283    "\n",
284    "# ColumnTransformer\n",
285    "preprocessor_full = ColumnTransformer(\n",
286    "    transformers=[\n",
287    "        ('num', numeric_transformer, numeric_features),\n",
288    "        ('cat', categorical_transformer, categorical_features)\n",
289    "    ]\n",
290    ")\n",
291    "\n",
292    "# 전체 파이프라인\n",
293    "complete_pipeline = Pipeline([\n",
294    "    ('preprocessor', preprocessor_full),\n",
295    "    ('classifier', RandomForestClassifier(random_state=42))\n",
296    "])\n",
297    "\n",
298    "complete_pipeline.fit(X_missing, y_missing)\n",
299    "print(\"결측치 포함 파이프라인 학습 완료\")\n",
300    "print(f\"학습 정확도: {complete_pipeline.score(X_missing, y_missing):.4f}\")"
301   ]
302  },
303  {
304   "cell_type": "markdown",
305   "metadata": {},
306   "source": [
307    "## 4. Pipeline과 교차 검증 및 하이퍼파라미터 튜닝"
308   ]
309  },
310  {
311   "cell_type": "code",
312   "execution_count": null,
313   "metadata": {},
314   "outputs": [],
315   "source": [
316    "# 실제 데이터셋으로 실습\n",
317    "cancer = load_breast_cancer()\n",
318    "X, y = cancer.data, cancer.target\n",
319    "\n",
320    "# 파이프라인 정의\n",
321    "pipeline_cv = Pipeline([\n",
322    "    ('scaler', StandardScaler()),\n",
323    "    ('classifier', LogisticRegression(max_iter=1000))\n",
324    "])\n",
325    "\n",
326    "# 교차 검증 (올바른 방법: 각 폴드에서 스케일러가 학습 데이터만으로 fit)\n",
327    "scores = cross_val_score(pipeline_cv, X, y, cv=5, scoring='accuracy')\n",
328    "\n",
329    "print(\"교차 검증 결과:\")\n",
330    "print(f\"  각 폴드: {scores}\")\n",
331    "print(f\"  평균: {scores.mean():.4f} (+/- {scores.std():.4f})\")"
332   ]
333  },
334  {
335   "cell_type": "markdown",
336   "metadata": {},
337   "source": [
338    "### GridSearchCV로 하이퍼파라미터 튜닝\n",
339    "\n",
340    "Pipeline에서 하이퍼파라미터 이름은 `step__parameter` 형식을 사용합니다."
341   ]
342  },
343  {
344   "cell_type": "code",
345   "execution_count": null,
346   "metadata": {},
347   "outputs": [],
348   "source": [
349    "# 파라미터 그리드 (step__parameter 형식)\n",
350    "param_grid = {\n",
351    "    'scaler': [StandardScaler(), MinMaxScaler()],\n",
352    "    'classifier__C': [0.1, 1, 10],\n",
353    "    'classifier__penalty': ['l1', 'l2'],\n",
354    "    'classifier__solver': ['liblinear']\n",
355    "}\n",
356    "\n",
357    "# Grid Search\n",
358    "grid_search = GridSearchCV(\n",
359    "    pipeline_cv,\n",
360    "    param_grid,\n",
361    "    cv=5,\n",
362    "    scoring='accuracy',\n",
363    "    n_jobs=-1,\n",
364    "    verbose=1\n",
365    ")\n",
366    "\n",
367    "grid_search.fit(X, y)\n",
368    "\n",
369    "print(\"\\nGrid Search 결과:\")\n",
370    "print(f\"  최적 파라미터: {grid_search.best_params_}\")\n",
371    "print(f\"  최적 점수: {grid_search.best_score_:.4f}\")"
372   ]
373  },
374  {
375   "cell_type": "markdown",
376   "metadata": {},
377   "source": [
378    "### 여러 모델 비교"
379   ]
380  },
381  {
382   "cell_type": "code",
383   "execution_count": null,
384   "metadata": {},
385   "outputs": [],
386   "source": [
387    "from sklearn.ensemble import RandomForestClassifier\n",
388    "from sklearn.svm import SVC\n",
389    "\n",
390    "# 여러 모델을 위한 파이프라인\n",
391    "pipeline_multi = Pipeline([\n",
392    "    ('scaler', StandardScaler()),\n",
393    "    ('classifier', LogisticRegression())  # placeholder\n",
394    "])\n",
395    "\n",
396    "# 모델별 다른 파라미터\n",
397    "param_grid_multi = [\n",
398    "    {\n",
399    "        'classifier': [LogisticRegression(max_iter=1000)],\n",
400    "        'classifier__C': [0.1, 1, 10]\n",
401    "    },\n",
402    "    {\n",
403    "        'classifier': [RandomForestClassifier(random_state=42)],\n",
404    "        'classifier__n_estimators': [50, 100],\n",
405    "        'classifier__max_depth': [None, 5, 10]\n",
406    "    },\n",
407    "    {\n",
408    "        'classifier': [SVC()],\n",
409    "        'classifier__C': [0.1, 1],\n",
410    "        'classifier__kernel': ['rbf', 'linear']\n",
411    "    }\n",
412    "]\n",
413    "\n",
414    "grid_search_multi = GridSearchCV(\n",
415    "    pipeline_multi,\n",
416    "    param_grid_multi,\n",
417    "    cv=5,\n",
418    "    scoring='accuracy',\n",
419    "    n_jobs=-1,\n",
420    "    verbose=1\n",
421    ")\n",
422    "\n",
423    "grid_search_multi.fit(X, y)\n",
424    "\n",
425    "print(\"\\n여러 모델 비교 결과:\")\n",
426    "print(f\"  최적 모델: {type(grid_search_multi.best_params_['classifier']).__name__}\")\n",
427    "print(f\"  최적 파라미터: {grid_search_multi.best_params_}\")\n",
428    "print(f\"  최적 점수: {grid_search_multi.best_score_:.4f}\")"
429   ]
430  },
431  {
432   "cell_type": "markdown",
433   "metadata": {},
434   "source": [
435    "## 5. 모델 저장과 로드\n",
436    "\n",
437    "학습된 파이프라인을 저장하고 나중에 다시 사용할 수 있습니다."
438   ]
439  },
440  {
441   "cell_type": "code",
442   "execution_count": null,
443   "metadata": {},
444   "outputs": [],
445   "source": [
446    "import joblib\n",
447    "import pickle\n",
448    "import sklearn\n",
449    "from datetime import datetime\n",
450    "\n",
451    "# 최적 모델\n",
452    "best_pipeline = grid_search.best_estimator_\n",
453    "\n",
454    "# 1. joblib 저장 (권장)\n",
455    "joblib.dump(best_pipeline, 'best_model.joblib')\n",
456    "print(\"모델 저장 완료: best_model.joblib\")\n",
457    "\n",
458    "# 모델 로드\n",
459    "loaded_model = joblib.load('best_model.joblib')\n",
460    "\n",
461    "# 테스트\n",
462    "X_test_sample = X[:5]\n",
463    "predictions = loaded_model.predict(X_test_sample)\n",
464    "print(f\"로드된 모델 예측: {predictions}\")"
465   ]
466  },
467  {
468   "cell_type": "code",
469   "execution_count": null,
470   "metadata": {},
471   "outputs": [],
472   "source": [
473    "# 2. pickle 저장\n",
474    "with open('model.pkl', 'wb') as f:\n",
475    "    pickle.dump(best_pipeline, f)\n",
476    "\n",
477    "# pickle 로드\n",
478    "with open('model.pkl', 'rb') as f:\n",
479    "    loaded_model_pkl = pickle.load(f)\n",
480    "\n",
481    "print(\"pickle 모델 예측:\", loaded_model_pkl.predict(X[:3]))"
482   ]
483  },
484  {
485   "cell_type": "markdown",
486   "metadata": {},
487   "source": [
488    "### 메타데이터와 함께 저장 (권장)"
489   ]
490  },
491  {
492   "cell_type": "code",
493   "execution_count": null,
494   "metadata": {},
495   "outputs": [],
496   "source": [
497    "# 메타데이터와 함께 저장\n",
498    "model_metadata = {\n",
499    "    'model': best_pipeline,\n",
500    "    'sklearn_version': sklearn.__version__,\n",
501    "    'training_date': datetime.now().isoformat(),\n",
502    "    'feature_names': list(cancer.feature_names),\n",
503    "    'target_names': list(cancer.target_names),\n",
504    "    'cv_score': grid_search.best_score_,\n",
505    "    'best_params': grid_search.best_params_\n",
506    "}\n",
507    "\n",
508    "joblib.dump(model_metadata, 'model_with_metadata.joblib')\n",
509    "\n",
510    "# 로드 및 검증\n",
511    "loaded_metadata = joblib.load('model_with_metadata.joblib')\n",
512    "print(\"모델 메타데이터:\")\n",
513    "print(f\"  학습 날짜: {loaded_metadata['training_date']}\")\n",
514    "print(f\"  sklearn 버전: {loaded_metadata['sklearn_version']}\")\n",
515    "print(f\"  CV 점수: {loaded_metadata['cv_score']:.4f}\")\n",
516    "print(f\"  최적 파라미터: {loaded_metadata['best_params']}\")"
517   ]
518  },
519  {
520   "cell_type": "markdown",
521   "metadata": {},
522   "source": [
523    "## 6. 커스텀 Transformer 작성\n",
524    "\n",
525    "sklearn의 BaseEstimator와 TransformerMixin을 상속하여 자신만의 Transformer를 만들 수 있습니다."
526   ]
527  },
528  {
529   "cell_type": "code",
530   "execution_count": null,
531   "metadata": {},
532   "outputs": [],
533   "source": [
534    "from sklearn.base import BaseEstimator, TransformerMixin\n",
535    "\n",
536    "class OutlierRemover(BaseEstimator, TransformerMixin):\n",
537    "    \"\"\"이상치를 경계값으로 대체하는 트랜스포머\"\"\"\n",
538    "    \n",
539    "    def __init__(self, threshold=3):\n",
540    "        self.threshold = threshold\n",
541    "        self.mean_ = None\n",
542    "        self.std_ = None\n",
543    "    \n",
544    "    def fit(self, X, y=None):\n",
545    "        self.mean_ = np.mean(X, axis=0)\n",
546    "        self.std_ = np.std(X, axis=0)\n",
547    "        return self\n",
548    "    \n",
549    "    def transform(self, X):\n",
550    "        X = np.array(X)\n",
551    "        z_scores = np.abs((X - self.mean_) / (self.std_ + 1e-10))\n",
552    "        # 이상치를 경계값으로 대체\n",
553    "        X_clipped = np.where(z_scores > self.threshold,\n",
554    "                             self.mean_ + self.threshold * self.std_ * np.sign(X - self.mean_),\n",
555    "                             X)\n",
556    "        return X_clipped\n",
557    "\n",
558    "\n",
559    "class FeatureSelector(BaseEstimator, TransformerMixin):\n",
560    "    \"\"\"특성 선택 트랜스포머\"\"\"\n",
561    "    \n",
562    "    def __init__(self, feature_indices=None):\n",
563    "        self.feature_indices = feature_indices\n",
564    "    \n",
565    "    def fit(self, X, y=None):\n",
566    "        return self\n",
567    "    \n",
568    "    def transform(self, X):\n",
569    "        X = np.array(X)\n",
570    "        if self.feature_indices is not None:\n",
571    "            return X[:, self.feature_indices]\n",
572    "        return X\n",
573    "\n",
574    "\n",
575    "# 커스텀 트랜스포머 사용\n",
576    "custom_pipeline = Pipeline([\n",
577    "    ('outlier', OutlierRemover(threshold=3)),\n",
578    "    ('scaler', StandardScaler()),\n",
579    "    ('classifier', LogisticRegression(max_iter=1000))\n",
580    "])\n",
581    "\n",
582    "scores = cross_val_score(custom_pipeline, X, y, cv=5)\n",
583    "print(f\"커스텀 트랜스포머 CV 점수: {scores.mean():.4f} (+/- {scores.std():.4f})\")"
584   ]
585  },
586  {
587   "cell_type": "markdown",
588   "metadata": {},
589   "source": [
590    "## 7. 실전 템플릿 - 분류 문제용 파이프라인 생성 함수"
591   ]
592  },
593  {
594   "cell_type": "code",
595   "execution_count": null,
596   "metadata": {},
597   "outputs": [],
598   "source": [
599    "from sklearn.compose import make_column_selector\n",
600    "\n",
601    "def create_classification_pipeline(model, numeric_features=None, categorical_features=None):\n",
602    "    \"\"\"\n",
603    "    분류 문제용 파이프라인 생성 함수\n",
604    "    \n",
605    "    Parameters:\n",
606    "    -----------\n",
607    "    model : sklearn estimator\n",
608    "        분류 모델\n",
609    "    numeric_features : list, optional\n",
610    "        수치형 특성 이름 리스트\n",
611    "    categorical_features : list, optional\n",
612    "        범주형 특성 이름 리스트\n",
613    "    \n",
614    "    Returns:\n",
615    "    --------\n",
616    "    pipeline : Pipeline\n",
617    "        전처리 + 모델 파이프라인\n",
618    "    \"\"\"\n",
619    "    \n",
620    "    # 수치형 특성 파이프라인\n",
621    "    numeric_transformer = Pipeline([\n",
622    "        ('imputer', SimpleImputer(strategy='median')),\n",
623    "        ('scaler', StandardScaler())\n",
624    "    ])\n",
625    "    \n",
626    "    # 범주형 특성 파이프라인\n",
627    "    categorical_transformer = Pipeline([\n",
628    "        ('imputer', SimpleImputer(strategy='most_frequent')),\n",
629    "        ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
630    "    ])\n",
631    "    \n",
632    "    # ColumnTransformer\n",
633    "    if numeric_features is None and categorical_features is None:\n",
634    "        # 자동 감지\n",
635    "        preprocessor = ColumnTransformer(\n",
636    "            transformers=[\n",
637    "                ('num', numeric_transformer, make_column_selector(dtype_include=np.number)),\n",
638    "                ('cat', categorical_transformer, make_column_selector(dtype_include=object))\n",
639    "            ]\n",
640    "        )\n",
641    "    else:\n",
642    "        preprocessor = ColumnTransformer(\n",
643    "            transformers=[\n",
644    "                ('num', numeric_transformer, numeric_features or []),\n",
645    "                ('cat', categorical_transformer, categorical_features or [])\n",
646    "            ]\n",
647    "        )\n",
648    "    \n",
649    "    # 전체 파이프라인\n",
650    "    pipeline = Pipeline([\n",
651    "        ('preprocessor', preprocessor),\n",
652    "        ('classifier', model)\n",
653    "    ])\n",
654    "    \n",
655    "    return pipeline\n",
656    "\n",
657    "\n",
658    "# 사용 예시\n",
659    "from sklearn.ensemble import GradientBoostingClassifier\n",
660    "\n",
661    "pipeline_template = create_classification_pipeline(\n",
662    "    GradientBoostingClassifier(random_state=42),\n",
663    "    numeric_features=['age', 'income'],\n",
664    "    categorical_features=['gender', 'education']\n",
665    ")\n",
666    "\n",
667    "print(\"분류 파이프라인 템플릿 생성 완료\")\n",
668    "print(pipeline_template)"
669   ]
670  },
671  {
672   "cell_type": "markdown",
673   "metadata": {},
674   "source": [
675    "## 8. 배포를 위한 모델 래퍼 클래스"
676   ]
677  },
678  {
679   "cell_type": "code",
680   "execution_count": null,
681   "metadata": {},
682   "outputs": [],
683   "source": [
684    "class ModelWrapper:\n",
685    "    \"\"\"배포용 모델 래퍼\"\"\"\n",
686    "    \n",
687    "    def __init__(self, model_path):\n",
688    "        self.model = joblib.load(model_path)\n",
689    "        self.feature_names = None\n",
690    "    \n",
691    "    def set_feature_names(self, names):\n",
692    "        \"\"\"특성 이름 설정\"\"\"\n",
693    "        self.feature_names = names\n",
694    "    \n",
695    "    def predict(self, input_data):\n",
696    "        \"\"\"딕셔너리 또는 DataFrame 입력 처리\"\"\"\n",
697    "        if isinstance(input_data, dict):\n",
698    "            input_data = pd.DataFrame([input_data])\n",
699    "        \n",
700    "        if self.feature_names:\n",
701    "            input_data = input_data[self.feature_names]\n",
702    "        \n",
703    "        return self.model.predict(input_data)\n",
704    "    \n",
705    "    def predict_proba(self, input_data):\n",
706    "        \"\"\"확률 예측\"\"\"\n",
707    "        if isinstance(input_data, dict):\n",
708    "            input_data = pd.DataFrame([input_data])\n",
709    "        \n",
710    "        if self.feature_names:\n",
711    "            input_data = input_data[self.feature_names]\n",
712    "        \n",
713    "        return self.model.predict_proba(input_data)\n",
714    "\n",
715    "\n",
716    "# 사용 예시\n",
717    "# wrapper = ModelWrapper('best_model.joblib')\n",
718    "# wrapper.set_feature_names(cancer.feature_names)\n",
719    "# prediction = wrapper.predict(X[0:1])\n",
720    "# print(f\"예측 결과: {prediction}\")"
721   ]
722  },
723  {
724   "cell_type": "markdown",
725   "metadata": {},
726   "source": [
727    "## 요약 및 Best Practices\n",
728    "\n",
729    "### Pipeline 사용 시 장점\n",
730    "\n",
731    "1. **데이터 누수 방지**: 교차 검증 시 각 폴드에서 전처리가 학습 데이터만으로 수행됨\n",
732    "2. **코드 간소화**: 여러 단계를 하나의 객체로 관리\n",
733    "3. **재현성**: 모든 전처리 단계가 저장되어 동일한 처리 보장\n",
734    "4. **배포 용이**: 하나의 파일로 전체 워크플로우 저장 가능\n",
735    "\n",
736    "### 하이퍼파라미터 명명 규칙\n",
737    "\n",
738    "```python\n",
739    "# 형식: step_name__parameter_name\n",
740    "'classifier__C'  # classifier 단계의 C 파라미터\n",
741    "'preprocessor__num__scaler__with_mean'  # 중첩된 파라미터\n",
742    "```\n",
743    "\n",
744    "### 모델 저장 방법 비교\n",
745    "\n",
746    "| 방법 | 장점 | 단점 |\n",
747    "|------|------|------|\n",
748    "| joblib | 대용량 NumPy 배열 효율적 처리 | sklearn 전용 |\n",
749    "| pickle | 파이썬 표준 라이브러리 | 대용량에서 느림 |\n",
750    "| ONNX | 프레임워크 독립적, 다양한 언어 지원 | 변환 작업 필요 |\n",
751    "\n",
752    "### 실무 체크리스트\n",
753    "\n",
754    "- [ ] 항상 Pipeline 사용하여 데이터 누수 방지\n",
755    "- [ ] ColumnTransformer로 수치형/범주형 전처리 분리\n",
756    "- [ ] 모델 저장 시 메타데이터 포함 (버전, 날짜, 성능 등)\n",
757    "- [ ] 입력 검증 함수 작성\n",
758    "- [ ] 커스텀 Transformer는 BaseEstimator, TransformerMixin 상속\n",
759    "- [ ] GridSearchCV로 전체 파이프라인 튜닝"
760   ]
761  }
762 ],
763 "metadata": {
764  "kernelspec": {
765   "display_name": "Python 3",
766   "language": "python",
767   "name": "python3"
768  },
769  "language_info": {
770   "codemirror_mode": {
771    "name": "ipython",
772    "version": 3
773   },
774   "file_extension": ".py",
775   "mimetype": "text/x-python",
776   "name": "python",
777   "nbconvert_exporter": "python",
778   "pygments_lexer": "ipython3",
779   "version": "3.8.0"
780  }
781 },
782 "nbformat": 4,
783 "nbformat_minor": 4
784}