13_pipeline.ipynb

Download
json 785 lines 25.1 KB
  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "metadata": {},
  6   "source": [
  7    "# ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ์‹ค๋ฌด (Pipeline & Practice)\n",
  8    "\n",
  9    "sklearn์˜ Pipeline๊ณผ ColumnTransformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ์™€ ๋ชจ๋ธ๋ง์„ ํ•˜๋‚˜์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.\n",
 10    "\n",
 11    "**ํ•™์Šต ๋ชฉํ‘œ:**\n",
 12    "- Pipeline์˜ ํ•„์š”์„ฑ๊ณผ ์žฅ์  ์ดํ•ด\n",
 13    "- ColumnTransformer๋กœ ๋‹ค์–‘ํ•œ ํƒ€์ž…์˜ ํŠน์„ฑ ์ฒ˜๋ฆฌ\n",
 14    "- ์ปค์Šคํ…€ Transformer ์ž‘์„ฑ\n",
 15    "- Pipeline๊ณผ GridSearchCV ๊ฒฐํ•ฉ\n",
 16    "- ๋ชจ๋ธ ์ €์žฅ ๋ฐ ๋ฐฐํฌ"
 17   ]
 18  },
 19  {
 20   "cell_type": "code",
 21   "execution_count": null,
 22   "metadata": {},
 23   "outputs": [],
 24   "source": [
 25    "import numpy as np\n",
 26    "import pandas as pd\n",
 27    "import matplotlib.pyplot as plt\n",
 28    "import seaborn as sns\n",
 29    "\n",
 30    "from sklearn.pipeline import Pipeline, make_pipeline\n",
 31    "from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder\n",
 32    "from sklearn.decomposition import PCA\n",
 33    "from sklearn.linear_model import LogisticRegression\n",
 34    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
 35    "from sklearn.datasets import load_iris, load_breast_cancer\n",
 36    "\n",
 37    "import warnings\n",
 38    "warnings.filterwarnings('ignore')"
 39   ]
 40  },
 41  {
 42   "cell_type": "markdown",
 43   "metadata": {},
 44   "source": [
 45    "## 1. Pipeline ๊ธฐ์ดˆ\n",
 46    "\n",
 47    "### Pipeline์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๋•Œ์˜ ๋ฌธ์ œ์ \n",
 48    "\n",
 49    "1. **๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ (Data Leakage)**: ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ •๋ณด๊ฐ€ ํ•™์Šต์— ๋ฐ˜์˜๋  ์œ„ํ—˜\n",
 50    "2. **์ฝ”๋“œ ๋ณต์žก์„ฑ**: ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ์ˆ˜๋™์œผ๋กœ ๊ด€๋ฆฌํ•ด์•ผ ํ•จ\n",
 51    "3. **์žฌํ˜„์„ฑ ๋ฌธ์ œ**: ์ˆœ์„œ ์‹ค์ˆ˜, ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถˆ์ผ์น˜ ๊ฐ€๋Šฅ์„ฑ\n",
 52    "\n",
 53    "### Pipeline์˜ ์žฅ์ \n",
 54    "\n",
 55    "1. ์ฝ”๋“œ ๊ฐ„์†Œํ™”\n",
 56    "2. ๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ ๋ฐฉ์ง€\n",
 57    "3. ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ์™„๋ฒฝ ํ†ตํ•ฉ\n",
 58    "4. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์šฉ์ด\n",
 59    "5. ๋ชจ๋ธ ์ €์žฅ/๋ฐฐํฌ ํŽธ๋ฆฌ"
 60   ]
 61  },
 62  {
 63   "cell_type": "code",
 64   "execution_count": null,
 65   "metadata": {},
 66   "outputs": [],
 67   "source": [
 68    "# ๋ฐ์ดํ„ฐ ๋กœ๋“œ\n",
 69    "iris = load_iris()\n",
 70    "X_train, X_test, y_train, y_test = train_test_split(\n",
 71    "    iris.data, iris.target, test_size=0.2, random_state=42\n",
 72    ")\n",
 73    "\n",
 74    "# Pipeline ์ƒ์„ฑ (๋ช…์‹œ์  ์ด๋ฆ„)\n",
 75    "pipeline = Pipeline([\n",
 76    "    ('scaler', StandardScaler()),\n",
 77    "    ('pca', PCA(n_components=2)),\n",
 78    "    ('classifier', LogisticRegression())\n",
 79    "])\n",
 80    "\n",
 81    "# ํ•™์Šต ๋ฐ ์˜ˆ์ธก\n",
 82    "pipeline.fit(X_train, y_train)\n",
 83    "y_pred = pipeline.predict(X_test)\n",
 84    "score = pipeline.score(X_test, y_test)\n",
 85    "\n",
 86    "print(f\"Pipeline ์ •ํ™•๋„: {score:.4f}\")\n",
 87    "\n",
 88    "# make_pipeline (์ž๋™ ์ด๋ฆ„ ์ƒ์„ฑ)\n",
 89    "pipeline_auto = make_pipeline(\n",
 90    "    StandardScaler(),\n",
 91    "    PCA(n_components=2),\n",
 92    "    LogisticRegression()\n",
 93    ")\n",
 94    "\n",
 95    "pipeline_auto.fit(X_train, y_train)\n",
 96    "print(f\"make_pipeline ์ •ํ™•๋„: {pipeline_auto.score(X_test, y_test):.4f}\")"
 97   ]
 98  },
 99  {
100   "cell_type": "markdown",
101   "metadata": {},
102   "source": [
103    "### Pipeline ๋‹จ๊ณ„ ์ ‘๊ทผํ•˜๊ธฐ"
104   ]
105  },
106  {
107   "cell_type": "code",
108   "execution_count": null,
109   "metadata": {},
110   "outputs": [],
111   "source": [
112    "# ๋‹จ๊ณ„ ์ด๋ฆ„ ํ™•์ธ\n",
113    "print(\"Pipeline ๋‹จ๊ณ„:\")\n",
114    "for name, step in pipeline.named_steps.items():\n",
115    "    print(f\"  {name}: {type(step).__name__}\")\n",
116    "\n",
117    "# ํŠน์ • ๋‹จ๊ณ„ ์ ‘๊ทผ\n",
118    "print(f\"\\nPCA ์„ค๋ช…๋œ ๋ถ„์‚ฐ: {pipeline.named_steps['pca'].explained_variance_ratio_}\")\n",
119    "print(f\"๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๊ณ„์ˆ˜ ํ˜•์ƒ: {pipeline.named_steps['classifier'].coef_.shape}\")\n",
120    "\n",
121    "# ์ค‘๊ฐ„ ๋‹จ๊ณ„ ๊ฒฐ๊ณผ ์–ป๊ธฐ\n",
122    "X_scaled = pipeline.named_steps['scaler'].transform(X_test)\n",
123    "X_pca = pipeline.named_steps['pca'].transform(X_scaled)\n",
124    "print(f\"\\n์Šค์ผ€์ผ๋ง ํ›„ ํ˜•์ƒ: {X_scaled.shape}\")\n",
125    "print(f\"PCA ํ›„ ํ˜•์ƒ: {X_pca.shape}\")"
126   ]
127  },
128  {
129   "cell_type": "markdown",
130   "metadata": {},
131   "source": [
132    "## 2. ColumnTransformer - ๋‹ค์–‘ํ•œ ํƒ€์ž…์˜ ํŠน์„ฑ ์ฒ˜๋ฆฌ\n",
133    "\n",
134    "์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์ˆ˜์น˜ํ˜•๊ณผ ๋ฒ”์ฃผํ˜• ํŠน์„ฑ์ด ํ˜ผ์žฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ColumnTransformer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ ํƒ€์ž…์— ๋งž๋Š” ์ „์ฒ˜๋ฆฌ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค."
135   ]
136  },
137  {
138   "cell_type": "code",
139   "execution_count": null,
140   "metadata": {},
141   "outputs": [],
142   "source": [
143    "from sklearn.compose import ColumnTransformer\n",
144    "from sklearn.preprocessing import OrdinalEncoder\n",
145    "from sklearn.ensemble import RandomForestClassifier\n",
146    "\n",
147    "# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ\n",
148    "data = {\n",
149    "    'age': [25, 32, 47, 51, 62, 28, 35, 42, 55, 60],\n",
150    "    'income': [50000, 60000, 80000, 120000, 95000, 55000, 70000, 85000, 110000, 100000],\n",
151    "    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],\n",
152    "    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', \n",
153    "                  'Bachelor', 'PhD', 'Master', 'PhD', 'Bachelor'],\n",
154    "    'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
155    "}\n",
156    "df = pd.DataFrame(data)\n",
157    "\n",
158    "X = df.drop('purchased', axis=1)\n",
159    "y = df['purchased']\n",
160    "\n",
161    "print(\"๋ฐ์ดํ„ฐ ํƒ€์ž…:\")\n",
162    "print(X.dtypes)\n",
163    "print(\"\\n๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ:\")\n",
164    "print(X.head())"
165   ]
166  },
167  {
168   "cell_type": "code",
169   "execution_count": null,
170   "metadata": {},
171   "outputs": [],
172   "source": [
173    "# ํŠน์„ฑ ๋ถ„๋ฅ˜\n",
174    "numeric_features = ['age', 'income']\n",
175    "categorical_features = ['gender', 'education']\n",
176    "\n",
177    "# ColumnTransformer ์ •์˜\n",
178    "preprocessor = ColumnTransformer(\n",
179    "    transformers=[\n",
180    "        ('num', StandardScaler(), numeric_features),\n",
181    "        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)\n",
182    "    ],\n",
183    "    remainder='passthrough'  # ๋‚˜๋จธ์ง€ ํŠน์„ฑ ์ฒ˜๋ฆฌ: 'drop' ๋˜๋Š” 'passthrough'\n",
184    ")\n",
185    "\n",
186    "# ๋ณ€ํ™˜\n",
187    "X_transformed = preprocessor.fit_transform(X)\n",
188    "\n",
189    "print(f\"์›๋ณธ ํ˜•์ƒ: {X.shape}\")\n",
190    "print(f\"๋ณ€ํ™˜ ํ›„ ํ˜•์ƒ: {X_transformed.shape}\")\n",
191    "\n",
192    "# ๋ณ€ํ™˜๋œ ํŠน์„ฑ ์ด๋ฆ„\n",
193    "feature_names = (\n",
194    "    numeric_features +\n",
195    "    list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))\n",
196    ")\n",
197    "print(f\"\\nํŠน์„ฑ ์ด๋ฆ„: {feature_names}\")"
198   ]
199  },
200  {
201   "cell_type": "markdown",
202   "metadata": {},
203   "source": [
204    "### Pipeline + ColumnTransformer ๊ฒฐํ•ฉ"
205   ]
206  },
207  {
208   "cell_type": "code",
209   "execution_count": null,
210   "metadata": {},
211   "outputs": [],
212   "source": [
213    "# ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ\n",
214    "full_pipeline = Pipeline([\n",
215    "    ('preprocessor', preprocessor),\n",
216    "    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))\n",
217    "])\n",
218    "\n",
219    "# ํ•™์Šต\n",
220    "full_pipeline.fit(X, y)\n",
221    "\n",
222    "# ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก\n",
223    "new_data = pd.DataFrame({\n",
224    "    'age': [30, 45],\n",
225    "    'income': [70000, 90000],\n",
226    "    'gender': ['F', 'M'],\n",
227    "    'education': ['Master', 'PhD']\n",
228    "})\n",
229    "\n",
230    "predictions = full_pipeline.predict(new_data)\n",
231    "print(f\"์˜ˆ์ธก ๊ฒฐ๊ณผ: {predictions}\")"
232   ]
233  },
234  {
235   "cell_type": "markdown",
236   "metadata": {},
237   "source": [
238    "## 3. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฅผ ํฌํ•จํ•œ ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ"
239   ]
240  },
241  {
242   "cell_type": "code",
243   "execution_count": null,
244   "metadata": {},
245   "outputs": [],
246   "source": [
247    "from sklearn.impute import SimpleImputer\n",
248    "\n",
249    "# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ\n",
250    "data_missing = {\n",
251    "    'age': [25, np.nan, 47, 51, 62, 28, np.nan, 42, 55, 60],\n",
252    "    'income': [50000, 60000, np.nan, 120000, 95000, np.nan, 70000, 85000, 110000, 100000],\n",
253    "    'gender': ['M', 'F', 'M', None, 'M', 'F', 'M', None, 'M', 'F'],\n",
254    "    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', None, \n",
255    "                  'Bachelor', 'PhD', 'Master', None, 'Bachelor'],\n",
256    "    'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
257    "}\n",
258    "df_missing = pd.DataFrame(data_missing)\n",
259    "X_missing = df_missing.drop('purchased', axis=1)\n",
260    "y_missing = df_missing['purchased']\n",
261    "\n",
262    "print(\"๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜:\")\n",
263    "print(X_missing.isnull().sum())"
264   ]
265  },
266  {
267   "cell_type": "code",
268   "execution_count": null,
269   "metadata": {},
270   "outputs": [],
271   "source": [
272    "# ์ˆ˜์น˜ํ˜• ํŒŒ์ดํ”„๋ผ์ธ (๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ํฌํ•จ)\n",
273    "numeric_transformer = Pipeline([\n",
274    "    ('imputer', SimpleImputer(strategy='median')),\n",
275    "    ('scaler', StandardScaler())\n",
276    "])\n",
277    "\n",
278    "# ๋ฒ”์ฃผํ˜• ํŒŒ์ดํ”„๋ผ์ธ (๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ํฌํ•จ)\n",
279    "categorical_transformer = Pipeline([\n",
280    "    ('imputer', SimpleImputer(strategy='most_frequent')),\n",
281    "    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
282    "])\n",
283    "\n",
284    "# ColumnTransformer\n",
285    "preprocessor_full = ColumnTransformer(\n",
286    "    transformers=[\n",
287    "        ('num', numeric_transformer, numeric_features),\n",
288    "        ('cat', categorical_transformer, categorical_features)\n",
289    "    ]\n",
290    ")\n",
291    "\n",
292    "# ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ\n",
293    "complete_pipeline = Pipeline([\n",
294    "    ('preprocessor', preprocessor_full),\n",
295    "    ('classifier', RandomForestClassifier(random_state=42))\n",
296    "])\n",
297    "\n",
298    "complete_pipeline.fit(X_missing, y_missing)\n",
299    "print(\"๊ฒฐ์ธก์น˜ ํฌํ•จ ํŒŒ์ดํ”„๋ผ์ธ ํ•™์Šต ์™„๋ฃŒ\")\n",
300    "print(f\"ํ•™์Šต ์ •ํ™•๋„: {complete_pipeline.score(X_missing, y_missing):.4f}\")"
301   ]
302  },
303  {
304   "cell_type": "markdown",
305   "metadata": {},
306   "source": [
307    "## 4. Pipeline๊ณผ ๊ต์ฐจ ๊ฒ€์ฆ ๋ฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹"
308   ]
309  },
310  {
311   "cell_type": "code",
312   "execution_count": null,
313   "metadata": {},
314   "outputs": [],
315   "source": [
316    "# ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‹ค์Šต\n",
317    "cancer = load_breast_cancer()\n",
318    "X, y = cancer.data, cancer.target\n",
319    "\n",
320    "# ํŒŒ์ดํ”„๋ผ์ธ ์ •์˜\n",
321    "pipeline_cv = Pipeline([\n",
322    "    ('scaler', StandardScaler()),\n",
323    "    ('classifier', LogisticRegression(max_iter=1000))\n",
324    "])\n",
325    "\n",
326    "# ๊ต์ฐจ ๊ฒ€์ฆ (์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ๊ฐ ํด๋“œ์—์„œ ์Šค์ผ€์ผ๋Ÿฌ๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ fit)\n",
327    "scores = cross_val_score(pipeline_cv, X, y, cv=5, scoring='accuracy')\n",
328    "\n",
329    "print(\"๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ:\")\n",
330    "print(f\"  ๊ฐ ํด๋“œ: {scores}\")\n",
331    "print(f\"  ํ‰๊ท : {scores.mean():.4f} (+/- {scores.std():.4f})\")"
332   ]
333  },
334  {
335   "cell_type": "markdown",
336   "metadata": {},
337   "source": [
338    "### GridSearchCV๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹\n",
339    "\n",
340    "Pipeline์—์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„์€ `step__parameter` ํ˜•์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค."
341   ]
342  },
343  {
344   "cell_type": "code",
345   "execution_count": null,
346   "metadata": {},
347   "outputs": [],
348   "source": [
349    "# ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ (step__parameter ํ˜•์‹)\n",
350    "param_grid = {\n",
351    "    'scaler': [StandardScaler(), MinMaxScaler()],\n",
352    "    'classifier__C': [0.1, 1, 10],\n",
353    "    'classifier__penalty': ['l1', 'l2'],\n",
354    "    'classifier__solver': ['liblinear']\n",
355    "}\n",
356    "\n",
357    "# Grid Search\n",
358    "grid_search = GridSearchCV(\n",
359    "    pipeline_cv,\n",
360    "    param_grid,\n",
361    "    cv=5,\n",
362    "    scoring='accuracy',\n",
363    "    n_jobs=-1,\n",
364    "    verbose=1\n",
365    ")\n",
366    "\n",
367    "grid_search.fit(X, y)\n",
368    "\n",
369    "print(\"\\nGrid Search ๊ฒฐ๊ณผ:\")\n",
370    "print(f\"  ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ: {grid_search.best_params_}\")\n",
371    "print(f\"  ์ตœ์  ์ ์ˆ˜: {grid_search.best_score_:.4f}\")"
372   ]
373  },
374  {
375   "cell_type": "markdown",
376   "metadata": {},
377   "source": [
378    "### ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต"
379   ]
380  },
381  {
382   "cell_type": "code",
383   "execution_count": null,
384   "metadata": {},
385   "outputs": [],
386   "source": [
387    "from sklearn.ensemble import RandomForestClassifier\n",
388    "from sklearn.svm import SVC\n",
389    "\n",
390    "# ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ํŒŒ์ดํ”„๋ผ์ธ\n",
391    "pipeline_multi = Pipeline([\n",
392    "    ('scaler', StandardScaler()),\n",
393    "    ('classifier', LogisticRegression())  # placeholder\n",
394    "])\n",
395    "\n",
396    "# ๋ชจ๋ธ๋ณ„ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ\n",
397    "param_grid_multi = [\n",
398    "    {\n",
399    "        'classifier': [LogisticRegression(max_iter=1000)],\n",
400    "        'classifier__C': [0.1, 1, 10]\n",
401    "    },\n",
402    "    {\n",
403    "        'classifier': [RandomForestClassifier(random_state=42)],\n",
404    "        'classifier__n_estimators': [50, 100],\n",
405    "        'classifier__max_depth': [None, 5, 10]\n",
406    "    },\n",
407    "    {\n",
408    "        'classifier': [SVC()],\n",
409    "        'classifier__C': [0.1, 1],\n",
410    "        'classifier__kernel': ['rbf', 'linear']\n",
411    "    }\n",
412    "]\n",
413    "\n",
414    "grid_search_multi = GridSearchCV(\n",
415    "    pipeline_multi,\n",
416    "    param_grid_multi,\n",
417    "    cv=5,\n",
418    "    scoring='accuracy',\n",
419    "    n_jobs=-1,\n",
420    "    verbose=1\n",
421    ")\n",
422    "\n",
423    "grid_search_multi.fit(X, y)\n",
424    "\n",
425    "print(\"\\n์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋น„๊ต ๊ฒฐ๊ณผ:\")\n",
426    "print(f\"  ์ตœ์  ๋ชจ๋ธ: {type(grid_search_multi.best_params_['classifier']).__name__}\")\n",
427    "print(f\"  ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ: {grid_search_multi.best_params_}\")\n",
428    "print(f\"  ์ตœ์  ์ ์ˆ˜: {grid_search_multi.best_score_:.4f}\")"
429   ]
430  },
431  {
432   "cell_type": "markdown",
433   "metadata": {},
434   "source": [
435    "## 5. ๋ชจ๋ธ ์ €์žฅ๊ณผ ๋กœ๋“œ\n",
436    "\n",
437    "ํ•™์Šต๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ €์žฅํ•˜๊ณ  ๋‚˜์ค‘์— ๋‹ค์‹œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค."
438   ]
439  },
440  {
441   "cell_type": "code",
442   "execution_count": null,
443   "metadata": {},
444   "outputs": [],
445   "source": [
446    "import joblib\n",
447    "import pickle\n",
448    "import sklearn\n",
449    "from datetime import datetime\n",
450    "\n",
451    "# ์ตœ์  ๋ชจ๋ธ\n",
452    "best_pipeline = grid_search.best_estimator_\n",
453    "\n",
454    "# 1. joblib ์ €์žฅ (๊ถŒ์žฅ)\n",
455    "joblib.dump(best_pipeline, 'best_model.joblib')\n",
456    "print(\"๋ชจ๋ธ ์ €์žฅ ์™„๋ฃŒ: best_model.joblib\")\n",
457    "\n",
458    "# ๋ชจ๋ธ ๋กœ๋“œ\n",
459    "loaded_model = joblib.load('best_model.joblib')\n",
460    "\n",
461    "# ํ…Œ์ŠคํŠธ\n",
462    "X_test_sample = X[:5]\n",
463    "predictions = loaded_model.predict(X_test_sample)\n",
464    "print(f\"๋กœ๋“œ๋œ ๋ชจ๋ธ ์˜ˆ์ธก: {predictions}\")"
465   ]
466  },
467  {
468   "cell_type": "code",
469   "execution_count": null,
470   "metadata": {},
471   "outputs": [],
472   "source": [
473    "# 2. pickle ์ €์žฅ\n",
474    "with open('model.pkl', 'wb') as f:\n",
475    "    pickle.dump(best_pipeline, f)\n",
476    "\n",
477    "# pickle ๋กœ๋“œ\n",
478    "with open('model.pkl', 'rb') as f:\n",
479    "    loaded_model_pkl = pickle.load(f)\n",
480    "\n",
481    "print(\"pickle ๋ชจ๋ธ ์˜ˆ์ธก:\", loaded_model_pkl.predict(X[:3]))"
482   ]
483  },
484  {
485   "cell_type": "markdown",
486   "metadata": {},
487   "source": [
488    "### ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ์ €์žฅ (๊ถŒ์žฅ)"
489   ]
490  },
491  {
492   "cell_type": "code",
493   "execution_count": null,
494   "metadata": {},
495   "outputs": [],
496   "source": [
497    "# ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ์ €์žฅ\n",
498    "model_metadata = {\n",
499    "    'model': best_pipeline,\n",
500    "    'sklearn_version': sklearn.__version__,\n",
501    "    'training_date': datetime.now().isoformat(),\n",
502    "    'feature_names': list(cancer.feature_names),\n",
503    "    'target_names': list(cancer.target_names),\n",
504    "    'cv_score': grid_search.best_score_,\n",
505    "    'best_params': grid_search.best_params_\n",
506    "}\n",
507    "\n",
508    "joblib.dump(model_metadata, 'model_with_metadata.joblib')\n",
509    "\n",
510    "# ๋กœ๋“œ ๋ฐ ๊ฒ€์ฆ\n",
511    "loaded_metadata = joblib.load('model_with_metadata.joblib')\n",
512    "print(\"๋ชจ๋ธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ:\")\n",
513    "print(f\"  ํ•™์Šต ๋‚ ์งœ: {loaded_metadata['training_date']}\")\n",
514    "print(f\"  sklearn ๋ฒ„์ „: {loaded_metadata['sklearn_version']}\")\n",
515    "print(f\"  CV ์ ์ˆ˜: {loaded_metadata['cv_score']:.4f}\")\n",
516    "print(f\"  ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ: {loaded_metadata['best_params']}\")"
517   ]
518  },
519  {
520   "cell_type": "markdown",
521   "metadata": {},
522   "source": [
523    "## 6. ์ปค์Šคํ…€ Transformer ์ž‘์„ฑ\n",
524    "\n",
525    "sklearn์˜ BaseEstimator์™€ TransformerMixin์„ ์ƒ์†ํ•˜์—ฌ ์ž์‹ ๋งŒ์˜ Transformer๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค."
526   ]
527  },
528  {
529   "cell_type": "code",
530   "execution_count": null,
531   "metadata": {},
532   "outputs": [],
533   "source": [
534    "from sklearn.base import BaseEstimator, TransformerMixin\n",
535    "\n",
536    "class OutlierRemover(BaseEstimator, TransformerMixin):\n",
537    "    \"\"\"์ด์ƒ์น˜๋ฅผ ๊ฒฝ๊ณ„๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ\"\"\"\n",
538    "    \n",
539    "    def __init__(self, threshold=3):\n",
540    "        self.threshold = threshold\n",
541    "        self.mean_ = None\n",
542    "        self.std_ = None\n",
543    "    \n",
544    "    def fit(self, X, y=None):\n",
545    "        self.mean_ = np.mean(X, axis=0)\n",
546    "        self.std_ = np.std(X, axis=0)\n",
547    "        return self\n",
548    "    \n",
549    "    def transform(self, X):\n",
550    "        X = np.array(X)\n",
551    "        z_scores = np.abs((X - self.mean_) / (self.std_ + 1e-10))\n",
552    "        # ์ด์ƒ์น˜๋ฅผ ๊ฒฝ๊ณ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด\n",
553    "        X_clipped = np.where(z_scores > self.threshold,\n",
554    "                             self.mean_ + self.threshold * self.std_ * np.sign(X - self.mean_),\n",
555    "                             X)\n",
556    "        return X_clipped\n",
557    "\n",
558    "\n",
559    "class FeatureSelector(BaseEstimator, TransformerMixin):\n",
560    "    \"\"\"ํŠน์„ฑ ์„ ํƒ ํŠธ๋žœ์Šคํฌ๋จธ\"\"\"\n",
561    "    \n",
562    "    def __init__(self, feature_indices=None):\n",
563    "        self.feature_indices = feature_indices\n",
564    "    \n",
565    "    def fit(self, X, y=None):\n",
566    "        return self\n",
567    "    \n",
568    "    def transform(self, X):\n",
569    "        X = np.array(X)\n",
570    "        if self.feature_indices is not None:\n",
571    "            return X[:, self.feature_indices]\n",
572    "        return X\n",
573    "\n",
574    "\n",
575    "# ์ปค์Šคํ…€ ํŠธ๋žœ์Šคํฌ๋จธ ์‚ฌ์šฉ\n",
576    "custom_pipeline = Pipeline([\n",
577    "    ('outlier', OutlierRemover(threshold=3)),\n",
578    "    ('scaler', StandardScaler()),\n",
579    "    ('classifier', LogisticRegression(max_iter=1000))\n",
580    "])\n",
581    "\n",
582    "scores = cross_val_score(custom_pipeline, X, y, cv=5)\n",
583    "print(f\"์ปค์Šคํ…€ ํŠธ๋žœ์Šคํฌ๋จธ CV ์ ์ˆ˜: {scores.mean():.4f} (+/- {scores.std():.4f})\")"
584   ]
585  },
586  {
587   "cell_type": "markdown",
588   "metadata": {},
589   "source": [
590    "## 7. ์‹ค์ „ ํ…œํ”Œ๋ฆฟ - ๋ถ„๋ฅ˜ ๋ฌธ์ œ์šฉ ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ํ•จ์ˆ˜"
591   ]
592  },
593  {
594   "cell_type": "code",
595   "execution_count": null,
596   "metadata": {},
597   "outputs": [],
598   "source": [
599    "from sklearn.compose import make_column_selector\n",
600    "\n",
601    "def create_classification_pipeline(model, numeric_features=None, categorical_features=None):\n",
602    "    \"\"\"\n",
603    "    ๋ถ„๋ฅ˜ ๋ฌธ์ œ์šฉ ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ํ•จ์ˆ˜\n",
604    "    \n",
605    "    Parameters:\n",
606    "    -----------\n",
607    "    model : sklearn estimator\n",
608    "        ๋ถ„๋ฅ˜ ๋ชจ๋ธ\n",
609    "    numeric_features : list, optional\n",
610    "        ์ˆ˜์น˜ํ˜• ํŠน์„ฑ ์ด๋ฆ„ ๋ฆฌ์ŠคํŠธ\n",
611    "    categorical_features : list, optional\n",
612    "        ๋ฒ”์ฃผํ˜• ํŠน์„ฑ ์ด๋ฆ„ ๋ฆฌ์ŠคํŠธ\n",
613    "    \n",
614    "    Returns:\n",
615    "    --------\n",
616    "    pipeline : Pipeline\n",
617    "        ์ „์ฒ˜๋ฆฌ + ๋ชจ๋ธ ํŒŒ์ดํ”„๋ผ์ธ\n",
618    "    \"\"\"\n",
619    "    \n",
620    "    # ์ˆ˜์น˜ํ˜• ํŠน์„ฑ ํŒŒ์ดํ”„๋ผ์ธ\n",
621    "    numeric_transformer = Pipeline([\n",
622    "        ('imputer', SimpleImputer(strategy='median')),\n",
623    "        ('scaler', StandardScaler())\n",
624    "    ])\n",
625    "    \n",
626    "    # ๋ฒ”์ฃผํ˜• ํŠน์„ฑ ํŒŒ์ดํ”„๋ผ์ธ\n",
627    "    categorical_transformer = Pipeline([\n",
628    "        ('imputer', SimpleImputer(strategy='most_frequent')),\n",
629    "        ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
630    "    ])\n",
631    "    \n",
632    "    # ColumnTransformer\n",
633    "    if numeric_features is None and categorical_features is None:\n",
634    "        # ์ž๋™ ๊ฐ์ง€\n",
635    "        preprocessor = ColumnTransformer(\n",
636    "            transformers=[\n",
637    "                ('num', numeric_transformer, make_column_selector(dtype_include=np.number)),\n",
638    "                ('cat', categorical_transformer, make_column_selector(dtype_include=object))\n",
639    "            ]\n",
640    "        )\n",
641    "    else:\n",
642    "        preprocessor = ColumnTransformer(\n",
643    "            transformers=[\n",
644    "                ('num', numeric_transformer, numeric_features or []),\n",
645    "                ('cat', categorical_transformer, categorical_features or [])\n",
646    "            ]\n",
647    "        )\n",
648    "    \n",
649    "    # ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ\n",
650    "    pipeline = Pipeline([\n",
651    "        ('preprocessor', preprocessor),\n",
652    "        ('classifier', model)\n",
653    "    ])\n",
654    "    \n",
655    "    return pipeline\n",
656    "\n",
657    "\n",
658    "# ์‚ฌ์šฉ ์˜ˆ์‹œ\n",
659    "from sklearn.ensemble import GradientBoostingClassifier\n",
660    "\n",
661    "pipeline_template = create_classification_pipeline(\n",
662    "    GradientBoostingClassifier(random_state=42),\n",
663    "    numeric_features=['age', 'income'],\n",
664    "    categorical_features=['gender', 'education']\n",
665    ")\n",
666    "\n",
667    "print(\"๋ถ„๋ฅ˜ ํŒŒ์ดํ”„๋ผ์ธ ํ…œํ”Œ๋ฆฟ ์ƒ์„ฑ ์™„๋ฃŒ\")\n",
668    "print(pipeline_template)"
669   ]
670  },
671  {
672   "cell_type": "markdown",
673   "metadata": {},
674   "source": [
675    "## 8. ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ ๋ž˜ํผ ํด๋ž˜์Šค"
676   ]
677  },
678  {
679   "cell_type": "code",
680   "execution_count": null,
681   "metadata": {},
682   "outputs": [],
683   "source": [
684    "class ModelWrapper:\n",
685    "    \"\"\"๋ฐฐํฌ์šฉ ๋ชจ๋ธ ๋ž˜ํผ\"\"\"\n",
686    "    \n",
687    "    def __init__(self, model_path):\n",
688    "        self.model = joblib.load(model_path)\n",
689    "        self.feature_names = None\n",
690    "    \n",
691    "    def set_feature_names(self, names):\n",
692    "        \"\"\"ํŠน์„ฑ ์ด๋ฆ„ ์„ค์ •\"\"\"\n",
693    "        self.feature_names = names\n",
694    "    \n",
695    "    def predict(self, input_data):\n",
696    "        \"\"\"๋”•์…”๋„ˆ๋ฆฌ ๋˜๋Š” DataFrame ์ž…๋ ฅ ์ฒ˜๋ฆฌ\"\"\"\n",
697    "        if isinstance(input_data, dict):\n",
698    "            input_data = pd.DataFrame([input_data])\n",
699    "        \n",
700    "        if self.feature_names:\n",
701    "            input_data = input_data[self.feature_names]\n",
702    "        \n",
703    "        return self.model.predict(input_data)\n",
704    "    \n",
705    "    def predict_proba(self, input_data):\n",
706    "        \"\"\"ํ™•๋ฅ  ์˜ˆ์ธก\"\"\"\n",
707    "        if isinstance(input_data, dict):\n",
708    "            input_data = pd.DataFrame([input_data])\n",
709    "        \n",
710    "        if self.feature_names:\n",
711    "            input_data = input_data[self.feature_names]\n",
712    "        \n",
713    "        return self.model.predict_proba(input_data)\n",
714    "\n",
715    "\n",
716    "# ์‚ฌ์šฉ ์˜ˆ์‹œ\n",
717    "# wrapper = ModelWrapper('best_model.joblib')\n",
718    "# wrapper.set_feature_names(cancer.feature_names)\n",
719    "# prediction = wrapper.predict(X[0:1])\n",
720    "# print(f\"์˜ˆ์ธก ๊ฒฐ๊ณผ: {prediction}\")"
721   ]
722  },
723  {
724   "cell_type": "markdown",
725   "metadata": {},
726   "source": [
727    "## ์š”์•ฝ ๋ฐ Best Practices\n",
728    "\n",
729    "### Pipeline ์‚ฌ์šฉ ์‹œ ์žฅ์ \n",
730    "\n",
731    "1. **๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ ๋ฐฉ์ง€**: ๊ต์ฐจ ๊ฒ€์ฆ ์‹œ ๊ฐ ํด๋“œ์—์„œ ์ „์ฒ˜๋ฆฌ๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ˆ˜ํ–‰๋จ\n",
732    "2. **์ฝ”๋“œ ๊ฐ„์†Œํ™”**: ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋กœ ๊ด€๋ฆฌ\n",
733    "3. **์žฌํ˜„์„ฑ**: ๋ชจ๋“  ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๊ฐ€ ์ €์žฅ๋˜์–ด ๋™์ผํ•œ ์ฒ˜๋ฆฌ ๋ณด์žฅ\n",
734    "4. **๋ฐฐํฌ ์šฉ์ด**: ํ•˜๋‚˜์˜ ํŒŒ์ผ๋กœ ์ „์ฒด ์›Œํฌํ”Œ๋กœ์šฐ ์ €์žฅ ๊ฐ€๋Šฅ\n",
735    "\n",
736    "### ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ช…๋ช… ๊ทœ์น™\n",
737    "\n",
738    "```python\n",
739    "# ํ˜•์‹: step_name__parameter_name\n",
740    "'classifier__C'  # classifier ๋‹จ๊ณ„์˜ C ํŒŒ๋ผ๋ฏธํ„ฐ\n",
741    "'preprocessor__num__scaler__with_mean'  # ์ค‘์ฒฉ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ\n",
742    "```\n",
743    "\n",
744    "### ๋ชจ๋ธ ์ €์žฅ ๋ฐฉ๋ฒ• ๋น„๊ต\n",
745    "\n",
746    "| ๋ฐฉ๋ฒ• | ์žฅ์  | ๋‹จ์  |\n",
747    "|------|------|------|\n",
748    "| joblib | ๋Œ€์šฉ๋Ÿ‰ NumPy ๋ฐฐ์—ด ํšจ์œจ์  ์ฒ˜๋ฆฌ | sklearn ์ „์šฉ |\n",
749    "| pickle | ํŒŒ์ด์ฌ ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ | ๋Œ€์šฉ๋Ÿ‰์—์„œ ๋А๋ฆผ |\n",
750    "| ONNX | ํ”„๋ ˆ์ž„์›Œํฌ ๋…๋ฆฝ์ , ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ง€์› | ๋ณ€ํ™˜ ์ž‘์—… ํ•„์š” |\n",
751    "\n",
752    "### ์‹ค๋ฌด ์ฒดํฌ๋ฆฌ์ŠคํŠธ\n",
753    "\n",
754    "- [ ] ํ•ญ์ƒ Pipeline ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ ๋ฐฉ์ง€\n",
755    "- [ ] ColumnTransformer๋กœ ์ˆ˜์น˜ํ˜•/๋ฒ”์ฃผํ˜• ์ „์ฒ˜๋ฆฌ ๋ถ„๋ฆฌ\n",
756    "- [ ] ๋ชจ๋ธ ์ €์žฅ ์‹œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํฌํ•จ (๋ฒ„์ „, ๋‚ ์งœ, ์„ฑ๋Šฅ ๋“ฑ)\n",
757    "- [ ] ์ž…๋ ฅ ๊ฒ€์ฆ ํ•จ์ˆ˜ ์ž‘์„ฑ\n",
758    "- [ ] ์ปค์Šคํ…€ Transformer๋Š” BaseEstimator, TransformerMixin ์ƒ์†\n",
759    "- [ ] GridSearchCV๋กœ ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ํŠœ๋‹"
760   ]
761  }
762 ],
763 "metadata": {
764  "kernelspec": {
765   "display_name": "Python 3",
766   "language": "python",
767   "name": "python3"
768  },
769  "language_info": {
770   "codemirror_mode": {
771    "name": "ipython",
772    "version": 3
773   },
774   "file_extension": ".py",
775   "mimetype": "text/x-python",
776   "name": "python",
777   "nbconvert_exporter": "python",
778   "pygments_lexer": "ipython3",
779   "version": "3.8.0"
780  }
781 },
782 "nbformat": 4,
783 "nbformat_minor": 4
784}