1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# ํ์ดํ๋ผ์ธ๊ณผ ์ค๋ฌด (Pipeline & Practice)\n",
8 "\n",
9 "sklearn์ Pipeline๊ณผ ColumnTransformer๋ฅผ ์ฌ์ฉํ์ฌ ์ ์ฒ๋ฆฌ์ ๋ชจ๋ธ๋ง์ ํ๋์ ์ํฌํ๋ก์ฐ๋ก ํตํฉํ๋ ๋ฐฉ๋ฒ์ ํ์ตํฉ๋๋ค.\n",
10 "\n",
11 "**ํ์ต ๋ชฉํ:**\n",
12 "- Pipeline์ ํ์์ฑ๊ณผ ์ฅ์ ์ดํด\n",
13 "- ColumnTransformer๋ก ๋ค์ํ ํ์
์ ํน์ฑ ์ฒ๋ฆฌ\n",
14 "- ์ปค์คํ
Transformer ์์ฑ\n",
15 "- Pipeline๊ณผ GridSearchCV ๊ฒฐํฉ\n",
16 "- ๋ชจ๋ธ ์ ์ฅ ๋ฐ ๋ฐฐํฌ"
17 ]
18 },
19 {
20 "cell_type": "code",
21 "execution_count": null,
22 "metadata": {},
23 "outputs": [],
24 "source": [
25 "import numpy as np\n",
26 "import pandas as pd\n",
27 "import matplotlib.pyplot as plt\n",
28 "import seaborn as sns\n",
29 "\n",
30 "from sklearn.pipeline import Pipeline, make_pipeline\n",
31 "from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder\n",
32 "from sklearn.decomposition import PCA\n",
33 "from sklearn.linear_model import LogisticRegression\n",
34 "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
35 "from sklearn.datasets import load_iris, load_breast_cancer\n",
36 "\n",
37 "import warnings\n",
38 "warnings.filterwarnings('ignore')"
39 ]
40 },
41 {
42 "cell_type": "markdown",
43 "metadata": {},
44 "source": [
45 "## 1. Pipeline ๊ธฐ์ด\n",
46 "\n",
47 "### Pipeline์ ์ฌ์ฉํ์ง ์์ ๋์ ๋ฌธ์ ์ \n",
48 "\n",
49 "1. **๋ฐ์ดํฐ ๋์ (Data Leakage)**: ํ
์คํธ ๋ฐ์ดํฐ ์ ๋ณด๊ฐ ํ์ต์ ๋ฐ์๋ ์ํ\n",
50 "2. **์ฝ๋ ๋ณต์ก์ฑ**: ์ฌ๋ฌ ๋จ๊ณ๋ฅผ ์๋์ผ๋ก ๊ด๋ฆฌํด์ผ ํจ\n",
51 "3. **์ฌํ์ฑ ๋ฌธ์ **: ์์ ์ค์, ํ๋ผ๋ฏธํฐ ๋ถ์ผ์น ๊ฐ๋ฅ์ฑ\n",
52 "\n",
53 "### Pipeline์ ์ฅ์ \n",
54 "\n",
55 "1. ์ฝ๋ ๊ฐ์ํ\n",
56 "2. ๋ฐ์ดํฐ ๋์ ๋ฐฉ์ง\n",
57 "3. ๊ต์ฐจ ๊ฒ์ฆ๊ณผ ์๋ฒฝ ํตํฉ\n",
58 "4. ํ์ดํผํ๋ผ๋ฏธํฐ ํ๋ ์ฉ์ด\n",
59 "5. ๋ชจ๋ธ ์ ์ฅ/๋ฐฐํฌ ํธ๋ฆฌ"
60 ]
61 },
62 {
63 "cell_type": "code",
64 "execution_count": null,
65 "metadata": {},
66 "outputs": [],
67 "source": [
68 "# ๋ฐ์ดํฐ ๋ก๋\n",
69 "iris = load_iris()\n",
70 "X_train, X_test, y_train, y_test = train_test_split(\n",
71 " iris.data, iris.target, test_size=0.2, random_state=42\n",
72 ")\n",
73 "\n",
74 "# Pipeline ์์ฑ (๋ช
์์ ์ด๋ฆ)\n",
75 "pipeline = Pipeline([\n",
76 " ('scaler', StandardScaler()),\n",
77 " ('pca', PCA(n_components=2)),\n",
78 " ('classifier', LogisticRegression())\n",
79 "])\n",
80 "\n",
81 "# ํ์ต ๋ฐ ์์ธก\n",
82 "pipeline.fit(X_train, y_train)\n",
83 "y_pred = pipeline.predict(X_test)\n",
84 "score = pipeline.score(X_test, y_test)\n",
85 "\n",
86 "print(f\"Pipeline ์ ํ๋: {score:.4f}\")\n",
87 "\n",
88 "# make_pipeline (์๋ ์ด๋ฆ ์์ฑ)\n",
89 "pipeline_auto = make_pipeline(\n",
90 " StandardScaler(),\n",
91 " PCA(n_components=2),\n",
92 " LogisticRegression()\n",
93 ")\n",
94 "\n",
95 "pipeline_auto.fit(X_train, y_train)\n",
96 "print(f\"make_pipeline ์ ํ๋: {pipeline_auto.score(X_test, y_test):.4f}\")"
97 ]
98 },
99 {
100 "cell_type": "markdown",
101 "metadata": {},
102 "source": [
103 "### Pipeline ๋จ๊ณ ์ ๊ทผํ๊ธฐ"
104 ]
105 },
106 {
107 "cell_type": "code",
108 "execution_count": null,
109 "metadata": {},
110 "outputs": [],
111 "source": [
112 "# ๋จ๊ณ ์ด๋ฆ ํ์ธ\n",
113 "print(\"Pipeline ๋จ๊ณ:\")\n",
114 "for name, step in pipeline.named_steps.items():\n",
115 " print(f\" {name}: {type(step).__name__}\")\n",
116 "\n",
117 "# ํน์ ๋จ๊ณ ์ ๊ทผ\n",
118 "print(f\"\\nPCA ์ค๋ช
๋ ๋ถ์ฐ: {pipeline.named_steps['pca'].explained_variance_ratio_}\")\n",
119 "print(f\"๋ก์ง์คํฑ ํ๊ท ๊ณ์ ํ์: {pipeline.named_steps['classifier'].coef_.shape}\")\n",
120 "\n",
121 "# ์ค๊ฐ ๋จ๊ณ ๊ฒฐ๊ณผ ์ป๊ธฐ\n",
122 "X_scaled = pipeline.named_steps['scaler'].transform(X_test)\n",
123 "X_pca = pipeline.named_steps['pca'].transform(X_scaled)\n",
124 "print(f\"\\n์ค์ผ์ผ๋ง ํ ํ์: {X_scaled.shape}\")\n",
125 "print(f\"PCA ํ ํ์: {X_pca.shape}\")"
126 ]
127 },
128 {
129 "cell_type": "markdown",
130 "metadata": {},
131 "source": [
132 "## 2. ColumnTransformer - ๋ค์ํ ํ์
์ ํน์ฑ ์ฒ๋ฆฌ\n",
133 "\n",
134 "์ค์ ๋ฐ์ดํฐ์์๋ ์์นํ๊ณผ ๋ฒ์ฃผํ ํน์ฑ์ด ํผ์ฌ๋์ด ์์ต๋๋ค. ColumnTransformer๋ฅผ ์ฌ์ฉํ๋ฉด ๊ฐ ํ์
์ ๋ง๋ ์ ์ฒ๋ฆฌ๋ฅผ ์ ์ฉํ ์ ์์ต๋๋ค."
135 ]
136 },
137 {
138 "cell_type": "code",
139 "execution_count": null,
140 "metadata": {},
141 "outputs": [],
142 "source": [
143 "from sklearn.compose import ColumnTransformer\n",
144 "from sklearn.preprocessing import OrdinalEncoder\n",
145 "from sklearn.ensemble import RandomForestClassifier\n",
146 "\n",
147 "# ์ํ ๋ฐ์ดํฐ ์์ฑ\n",
148 "data = {\n",
149 " 'age': [25, 32, 47, 51, 62, 28, 35, 42, 55, 60],\n",
150 " 'income': [50000, 60000, 80000, 120000, 95000, 55000, 70000, 85000, 110000, 100000],\n",
151 " 'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],\n",
152 " 'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', \n",
153 " 'Bachelor', 'PhD', 'Master', 'PhD', 'Bachelor'],\n",
154 " 'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
155 "}\n",
156 "df = pd.DataFrame(data)\n",
157 "\n",
158 "X = df.drop('purchased', axis=1)\n",
159 "y = df['purchased']\n",
160 "\n",
161 "print(\"๋ฐ์ดํฐ ํ์
:\")\n",
162 "print(X.dtypes)\n",
163 "print(\"\\n๋ฐ์ดํฐ ์ํ:\")\n",
164 "print(X.head())"
165 ]
166 },
167 {
168 "cell_type": "code",
169 "execution_count": null,
170 "metadata": {},
171 "outputs": [],
172 "source": [
173 "# ํน์ฑ ๋ถ๋ฅ\n",
174 "numeric_features = ['age', 'income']\n",
175 "categorical_features = ['gender', 'education']\n",
176 "\n",
177 "# ColumnTransformer ์ ์\n",
178 "preprocessor = ColumnTransformer(\n",
179 " transformers=[\n",
180 " ('num', StandardScaler(), numeric_features),\n",
181 " ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)\n",
182 " ],\n",
183 " remainder='passthrough' # ๋๋จธ์ง ํน์ฑ ์ฒ๋ฆฌ: 'drop' ๋๋ 'passthrough'\n",
184 ")\n",
185 "\n",
186 "# ๋ณํ\n",
187 "X_transformed = preprocessor.fit_transform(X)\n",
188 "\n",
189 "print(f\"์๋ณธ ํ์: {X.shape}\")\n",
190 "print(f\"๋ณํ ํ ํ์: {X_transformed.shape}\")\n",
191 "\n",
192 "# ๋ณํ๋ ํน์ฑ ์ด๋ฆ\n",
193 "feature_names = (\n",
194 " numeric_features +\n",
195 " list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))\n",
196 ")\n",
197 "print(f\"\\nํน์ฑ ์ด๋ฆ: {feature_names}\")"
198 ]
199 },
200 {
201 "cell_type": "markdown",
202 "metadata": {},
203 "source": [
204 "### Pipeline + ColumnTransformer ๊ฒฐํฉ"
205 ]
206 },
207 {
208 "cell_type": "code",
209 "execution_count": null,
210 "metadata": {},
211 "outputs": [],
212 "source": [
213 "# ์ ์ฒด ํ์ดํ๋ผ์ธ\n",
214 "full_pipeline = Pipeline([\n",
215 " ('preprocessor', preprocessor),\n",
216 " ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))\n",
217 "])\n",
218 "\n",
219 "# ํ์ต\n",
220 "full_pipeline.fit(X, y)\n",
221 "\n",
222 "# ์๋ก์ด ๋ฐ์ดํฐ๋ก ์์ธก\n",
223 "new_data = pd.DataFrame({\n",
224 " 'age': [30, 45],\n",
225 " 'income': [70000, 90000],\n",
226 " 'gender': ['F', 'M'],\n",
227 " 'education': ['Master', 'PhD']\n",
228 "})\n",
229 "\n",
230 "predictions = full_pipeline.predict(new_data)\n",
231 "print(f\"์์ธก ๊ฒฐ๊ณผ: {predictions}\")"
232 ]
233 },
234 {
235 "cell_type": "markdown",
236 "metadata": {},
237 "source": [
238 "## 3. ๊ฒฐ์ธก์น ์ฒ๋ฆฌ๋ฅผ ํฌํจํ ๋ณต์กํ ํ์ดํ๋ผ์ธ"
239 ]
240 },
241 {
242 "cell_type": "code",
243 "execution_count": null,
244 "metadata": {},
245 "outputs": [],
246 "source": [
247 "from sklearn.impute import SimpleImputer\n",
248 "\n",
249 "# ๊ฒฐ์ธก์น๊ฐ ์๋ ๋ฐ์ดํฐ ์์ฑ\n",
250 "data_missing = {\n",
251 " 'age': [25, np.nan, 47, 51, 62, 28, np.nan, 42, 55, 60],\n",
252 " 'income': [50000, 60000, np.nan, 120000, 95000, np.nan, 70000, 85000, 110000, 100000],\n",
253 " 'gender': ['M', 'F', 'M', None, 'M', 'F', 'M', None, 'M', 'F'],\n",
254 " 'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', None, \n",
255 " 'Bachelor', 'PhD', 'Master', None, 'Bachelor'],\n",
256 " 'purchased': [0, 1, 1, 1, 0, 0, 1, 1, 1, 0]\n",
257 "}\n",
258 "df_missing = pd.DataFrame(data_missing)\n",
259 "X_missing = df_missing.drop('purchased', axis=1)\n",
260 "y_missing = df_missing['purchased']\n",
261 "\n",
262 "print(\"๊ฒฐ์ธก์น ๊ฐ์:\")\n",
263 "print(X_missing.isnull().sum())"
264 ]
265 },
266 {
267 "cell_type": "code",
268 "execution_count": null,
269 "metadata": {},
270 "outputs": [],
271 "source": [
272 "# ์์นํ ํ์ดํ๋ผ์ธ (๊ฒฐ์ธก์น ์ฒ๋ฆฌ ํฌํจ)\n",
273 "numeric_transformer = Pipeline([\n",
274 " ('imputer', SimpleImputer(strategy='median')),\n",
275 " ('scaler', StandardScaler())\n",
276 "])\n",
277 "\n",
278 "# ๋ฒ์ฃผํ ํ์ดํ๋ผ์ธ (๊ฒฐ์ธก์น ์ฒ๋ฆฌ ํฌํจ)\n",
279 "categorical_transformer = Pipeline([\n",
280 " ('imputer', SimpleImputer(strategy='most_frequent')),\n",
281 " ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
282 "])\n",
283 "\n",
284 "# ColumnTransformer\n",
285 "preprocessor_full = ColumnTransformer(\n",
286 " transformers=[\n",
287 " ('num', numeric_transformer, numeric_features),\n",
288 " ('cat', categorical_transformer, categorical_features)\n",
289 " ]\n",
290 ")\n",
291 "\n",
292 "# ์ ์ฒด ํ์ดํ๋ผ์ธ\n",
293 "complete_pipeline = Pipeline([\n",
294 " ('preprocessor', preprocessor_full),\n",
295 " ('classifier', RandomForestClassifier(random_state=42))\n",
296 "])\n",
297 "\n",
298 "complete_pipeline.fit(X_missing, y_missing)\n",
299 "print(\"๊ฒฐ์ธก์น ํฌํจ ํ์ดํ๋ผ์ธ ํ์ต ์๋ฃ\")\n",
300 "print(f\"ํ์ต ์ ํ๋: {complete_pipeline.score(X_missing, y_missing):.4f}\")"
301 ]
302 },
303 {
304 "cell_type": "markdown",
305 "metadata": {},
306 "source": [
307 "## 4. Pipeline๊ณผ ๊ต์ฐจ ๊ฒ์ฆ ๋ฐ ํ์ดํผํ๋ผ๋ฏธํฐ ํ๋"
308 ]
309 },
310 {
311 "cell_type": "code",
312 "execution_count": null,
313 "metadata": {},
314 "outputs": [],
315 "source": [
316 "# ์ค์ ๋ฐ์ดํฐ์
์ผ๋ก ์ค์ต\n",
317 "cancer = load_breast_cancer()\n",
318 "X, y = cancer.data, cancer.target\n",
319 "\n",
320 "# ํ์ดํ๋ผ์ธ ์ ์\n",
321 "pipeline_cv = Pipeline([\n",
322 " ('scaler', StandardScaler()),\n",
323 " ('classifier', LogisticRegression(max_iter=1000))\n",
324 "])\n",
325 "\n",
326 "# ๊ต์ฐจ ๊ฒ์ฆ (์ฌ๋ฐ๋ฅธ ๋ฐฉ๋ฒ: ๊ฐ ํด๋์์ ์ค์ผ์ผ๋ฌ๊ฐ ํ์ต ๋ฐ์ดํฐ๋ง์ผ๋ก fit)\n",
327 "scores = cross_val_score(pipeline_cv, X, y, cv=5, scoring='accuracy')\n",
328 "\n",
329 "print(\"๊ต์ฐจ ๊ฒ์ฆ ๊ฒฐ๊ณผ:\")\n",
330 "print(f\" ๊ฐ ํด๋: {scores}\")\n",
331 "print(f\" ํ๊ท : {scores.mean():.4f} (+/- {scores.std():.4f})\")"
332 ]
333 },
334 {
335 "cell_type": "markdown",
336 "metadata": {},
337 "source": [
338 "### GridSearchCV๋ก ํ์ดํผํ๋ผ๋ฏธํฐ ํ๋\n",
339 "\n",
340 "Pipeline์์ ํ์ดํผํ๋ผ๋ฏธํฐ ์ด๋ฆ์ `step__parameter` ํ์์ ์ฌ์ฉํฉ๋๋ค."
341 ]
342 },
343 {
344 "cell_type": "code",
345 "execution_count": null,
346 "metadata": {},
347 "outputs": [],
348 "source": [
349 "# ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ (step__parameter ํ์)\n",
350 "param_grid = {\n",
351 " 'scaler': [StandardScaler(), MinMaxScaler()],\n",
352 " 'classifier__C': [0.1, 1, 10],\n",
353 " 'classifier__penalty': ['l1', 'l2'],\n",
354 " 'classifier__solver': ['liblinear']\n",
355 "}\n",
356 "\n",
357 "# Grid Search\n",
358 "grid_search = GridSearchCV(\n",
359 " pipeline_cv,\n",
360 " param_grid,\n",
361 " cv=5,\n",
362 " scoring='accuracy',\n",
363 " n_jobs=-1,\n",
364 " verbose=1\n",
365 ")\n",
366 "\n",
367 "grid_search.fit(X, y)\n",
368 "\n",
369 "print(\"\\nGrid Search ๊ฒฐ๊ณผ:\")\n",
370 "print(f\" ์ต์ ํ๋ผ๋ฏธํฐ: {grid_search.best_params_}\")\n",
371 "print(f\" ์ต์ ์ ์: {grid_search.best_score_:.4f}\")"
372 ]
373 },
374 {
375 "cell_type": "markdown",
376 "metadata": {},
377 "source": [
378 "### ์ฌ๋ฌ ๋ชจ๋ธ ๋น๊ต"
379 ]
380 },
381 {
382 "cell_type": "code",
383 "execution_count": null,
384 "metadata": {},
385 "outputs": [],
386 "source": [
387 "from sklearn.ensemble import RandomForestClassifier\n",
388 "from sklearn.svm import SVC\n",
389 "\n",
390 "# ์ฌ๋ฌ ๋ชจ๋ธ์ ์ํ ํ์ดํ๋ผ์ธ\n",
391 "pipeline_multi = Pipeline([\n",
392 " ('scaler', StandardScaler()),\n",
393 " ('classifier', LogisticRegression()) # placeholder\n",
394 "])\n",
395 "\n",
396 "# ๋ชจ๋ธ๋ณ ๋ค๋ฅธ ํ๋ผ๋ฏธํฐ\n",
397 "param_grid_multi = [\n",
398 " {\n",
399 " 'classifier': [LogisticRegression(max_iter=1000)],\n",
400 " 'classifier__C': [0.1, 1, 10]\n",
401 " },\n",
402 " {\n",
403 " 'classifier': [RandomForestClassifier(random_state=42)],\n",
404 " 'classifier__n_estimators': [50, 100],\n",
405 " 'classifier__max_depth': [None, 5, 10]\n",
406 " },\n",
407 " {\n",
408 " 'classifier': [SVC()],\n",
409 " 'classifier__C': [0.1, 1],\n",
410 " 'classifier__kernel': ['rbf', 'linear']\n",
411 " }\n",
412 "]\n",
413 "\n",
414 "grid_search_multi = GridSearchCV(\n",
415 " pipeline_multi,\n",
416 " param_grid_multi,\n",
417 " cv=5,\n",
418 " scoring='accuracy',\n",
419 " n_jobs=-1,\n",
420 " verbose=1\n",
421 ")\n",
422 "\n",
423 "grid_search_multi.fit(X, y)\n",
424 "\n",
425 "print(\"\\n์ฌ๋ฌ ๋ชจ๋ธ ๋น๊ต ๊ฒฐ๊ณผ:\")\n",
426 "print(f\" ์ต์ ๋ชจ๋ธ: {type(grid_search_multi.best_params_['classifier']).__name__}\")\n",
427 "print(f\" ์ต์ ํ๋ผ๋ฏธํฐ: {grid_search_multi.best_params_}\")\n",
428 "print(f\" ์ต์ ์ ์: {grid_search_multi.best_score_:.4f}\")"
429 ]
430 },
431 {
432 "cell_type": "markdown",
433 "metadata": {},
434 "source": [
435 "## 5. ๋ชจ๋ธ ์ ์ฅ๊ณผ ๋ก๋\n",
436 "\n",
437 "ํ์ต๋ ํ์ดํ๋ผ์ธ์ ์ ์ฅํ๊ณ ๋์ค์ ๋ค์ ์ฌ์ฉํ ์ ์์ต๋๋ค."
438 ]
439 },
440 {
441 "cell_type": "code",
442 "execution_count": null,
443 "metadata": {},
444 "outputs": [],
445 "source": [
446 "import joblib\n",
447 "import pickle\n",
448 "import sklearn\n",
449 "from datetime import datetime\n",
450 "\n",
451 "# ์ต์ ๋ชจ๋ธ\n",
452 "best_pipeline = grid_search.best_estimator_\n",
453 "\n",
454 "# 1. joblib ์ ์ฅ (๊ถ์ฅ)\n",
455 "joblib.dump(best_pipeline, 'best_model.joblib')\n",
456 "print(\"๋ชจ๋ธ ์ ์ฅ ์๋ฃ: best_model.joblib\")\n",
457 "\n",
458 "# ๋ชจ๋ธ ๋ก๋\n",
459 "loaded_model = joblib.load('best_model.joblib')\n",
460 "\n",
461 "# ํ
์คํธ\n",
462 "X_test_sample = X[:5]\n",
463 "predictions = loaded_model.predict(X_test_sample)\n",
464 "print(f\"๋ก๋๋ ๋ชจ๋ธ ์์ธก: {predictions}\")"
465 ]
466 },
467 {
468 "cell_type": "code",
469 "execution_count": null,
470 "metadata": {},
471 "outputs": [],
472 "source": [
473 "# 2. pickle ์ ์ฅ\n",
474 "with open('model.pkl', 'wb') as f:\n",
475 " pickle.dump(best_pipeline, f)\n",
476 "\n",
477 "# pickle ๋ก๋\n",
478 "with open('model.pkl', 'rb') as f:\n",
479 " loaded_model_pkl = pickle.load(f)\n",
480 "\n",
481 "print(\"pickle ๋ชจ๋ธ ์์ธก:\", loaded_model_pkl.predict(X[:3]))"
482 ]
483 },
484 {
485 "cell_type": "markdown",
486 "metadata": {},
487 "source": [
488 "### ๋ฉํ๋ฐ์ดํฐ์ ํจ๊ป ์ ์ฅ (๊ถ์ฅ)"
489 ]
490 },
491 {
492 "cell_type": "code",
493 "execution_count": null,
494 "metadata": {},
495 "outputs": [],
496 "source": [
497 "# ๋ฉํ๋ฐ์ดํฐ์ ํจ๊ป ์ ์ฅ\n",
498 "model_metadata = {\n",
499 " 'model': best_pipeline,\n",
500 " 'sklearn_version': sklearn.__version__,\n",
501 " 'training_date': datetime.now().isoformat(),\n",
502 " 'feature_names': list(cancer.feature_names),\n",
503 " 'target_names': list(cancer.target_names),\n",
504 " 'cv_score': grid_search.best_score_,\n",
505 " 'best_params': grid_search.best_params_\n",
506 "}\n",
507 "\n",
508 "joblib.dump(model_metadata, 'model_with_metadata.joblib')\n",
509 "\n",
510 "# ๋ก๋ ๋ฐ ๊ฒ์ฆ\n",
511 "loaded_metadata = joblib.load('model_with_metadata.joblib')\n",
512 "print(\"๋ชจ๋ธ ๋ฉํ๋ฐ์ดํฐ:\")\n",
513 "print(f\" ํ์ต ๋ ์ง: {loaded_metadata['training_date']}\")\n",
514 "print(f\" sklearn ๋ฒ์ : {loaded_metadata['sklearn_version']}\")\n",
515 "print(f\" CV ์ ์: {loaded_metadata['cv_score']:.4f}\")\n",
516 "print(f\" ์ต์ ํ๋ผ๋ฏธํฐ: {loaded_metadata['best_params']}\")"
517 ]
518 },
519 {
520 "cell_type": "markdown",
521 "metadata": {},
522 "source": [
523 "## 6. ์ปค์คํ
Transformer ์์ฑ\n",
524 "\n",
525 "sklearn์ BaseEstimator์ TransformerMixin์ ์์ํ์ฌ ์์ ๋ง์ Transformer๋ฅผ ๋ง๋ค ์ ์์ต๋๋ค."
526 ]
527 },
528 {
529 "cell_type": "code",
530 "execution_count": null,
531 "metadata": {},
532 "outputs": [],
533 "source": [
534 "from sklearn.base import BaseEstimator, TransformerMixin\n",
535 "\n",
536 "class OutlierRemover(BaseEstimator, TransformerMixin):\n",
537 " \"\"\"์ด์์น๋ฅผ ๊ฒฝ๊ณ๊ฐ์ผ๋ก ๋์ฒดํ๋ ํธ๋์คํฌ๋จธ\"\"\"\n",
538 " \n",
539 " def __init__(self, threshold=3):\n",
540 " self.threshold = threshold\n",
541 " self.mean_ = None\n",
542 " self.std_ = None\n",
543 " \n",
544 " def fit(self, X, y=None):\n",
545 " self.mean_ = np.mean(X, axis=0)\n",
546 " self.std_ = np.std(X, axis=0)\n",
547 " return self\n",
548 " \n",
549 " def transform(self, X):\n",
550 " X = np.array(X)\n",
551 " z_scores = np.abs((X - self.mean_) / (self.std_ + 1e-10))\n",
552 " # ์ด์์น๋ฅผ ๊ฒฝ๊ณ๊ฐ์ผ๋ก ๋์ฒด\n",
553 " X_clipped = np.where(z_scores > self.threshold,\n",
554 " self.mean_ + self.threshold * self.std_ * np.sign(X - self.mean_),\n",
555 " X)\n",
556 " return X_clipped\n",
557 "\n",
558 "\n",
559 "class FeatureSelector(BaseEstimator, TransformerMixin):\n",
560 " \"\"\"ํน์ฑ ์ ํ ํธ๋์คํฌ๋จธ\"\"\"\n",
561 " \n",
562 " def __init__(self, feature_indices=None):\n",
563 " self.feature_indices = feature_indices\n",
564 " \n",
565 " def fit(self, X, y=None):\n",
566 " return self\n",
567 " \n",
568 " def transform(self, X):\n",
569 " X = np.array(X)\n",
570 " if self.feature_indices is not None:\n",
571 " return X[:, self.feature_indices]\n",
572 " return X\n",
573 "\n",
574 "\n",
575 "# ์ปค์คํ
ํธ๋์คํฌ๋จธ ์ฌ์ฉ\n",
576 "custom_pipeline = Pipeline([\n",
577 " ('outlier', OutlierRemover(threshold=3)),\n",
578 " ('scaler', StandardScaler()),\n",
579 " ('classifier', LogisticRegression(max_iter=1000))\n",
580 "])\n",
581 "\n",
582 "scores = cross_val_score(custom_pipeline, X, y, cv=5)\n",
583 "print(f\"์ปค์คํ
ํธ๋์คํฌ๋จธ CV ์ ์: {scores.mean():.4f} (+/- {scores.std():.4f})\")"
584 ]
585 },
586 {
587 "cell_type": "markdown",
588 "metadata": {},
589 "source": [
590 "## 7. ์ค์ ํ
ํ๋ฆฟ - ๋ถ๋ฅ ๋ฌธ์ ์ฉ ํ์ดํ๋ผ์ธ ์์ฑ ํจ์"
591 ]
592 },
593 {
594 "cell_type": "code",
595 "execution_count": null,
596 "metadata": {},
597 "outputs": [],
598 "source": [
599 "from sklearn.compose import make_column_selector\n",
600 "\n",
601 "def create_classification_pipeline(model, numeric_features=None, categorical_features=None):\n",
602 " \"\"\"\n",
603 " ๋ถ๋ฅ ๋ฌธ์ ์ฉ ํ์ดํ๋ผ์ธ ์์ฑ ํจ์\n",
604 " \n",
605 " Parameters:\n",
606 " -----------\n",
607 " model : sklearn estimator\n",
608 " ๋ถ๋ฅ ๋ชจ๋ธ\n",
609 " numeric_features : list, optional\n",
610 " ์์นํ ํน์ฑ ์ด๋ฆ ๋ฆฌ์คํธ\n",
611 " categorical_features : list, optional\n",
612 " ๋ฒ์ฃผํ ํน์ฑ ์ด๋ฆ ๋ฆฌ์คํธ\n",
613 " \n",
614 " Returns:\n",
615 " --------\n",
616 " pipeline : Pipeline\n",
617 " ์ ์ฒ๋ฆฌ + ๋ชจ๋ธ ํ์ดํ๋ผ์ธ\n",
618 " \"\"\"\n",
619 " \n",
620 " # ์์นํ ํน์ฑ ํ์ดํ๋ผ์ธ\n",
621 " numeric_transformer = Pipeline([\n",
622 " ('imputer', SimpleImputer(strategy='median')),\n",
623 " ('scaler', StandardScaler())\n",
624 " ])\n",
625 " \n",
626 " # ๋ฒ์ฃผํ ํน์ฑ ํ์ดํ๋ผ์ธ\n",
627 " categorical_transformer = Pipeline([\n",
628 " ('imputer', SimpleImputer(strategy='most_frequent')),\n",
629 " ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))\n",
630 " ])\n",
631 " \n",
632 " # ColumnTransformer\n",
633 " if numeric_features is None and categorical_features is None:\n",
634 " # ์๋ ๊ฐ์ง\n",
635 " preprocessor = ColumnTransformer(\n",
636 " transformers=[\n",
637 " ('num', numeric_transformer, make_column_selector(dtype_include=np.number)),\n",
638 " ('cat', categorical_transformer, make_column_selector(dtype_include=object))\n",
639 " ]\n",
640 " )\n",
641 " else:\n",
642 " preprocessor = ColumnTransformer(\n",
643 " transformers=[\n",
644 " ('num', numeric_transformer, numeric_features or []),\n",
645 " ('cat', categorical_transformer, categorical_features or [])\n",
646 " ]\n",
647 " )\n",
648 " \n",
649 " # ์ ์ฒด ํ์ดํ๋ผ์ธ\n",
650 " pipeline = Pipeline([\n",
651 " ('preprocessor', preprocessor),\n",
652 " ('classifier', model)\n",
653 " ])\n",
654 " \n",
655 " return pipeline\n",
656 "\n",
657 "\n",
658 "# ์ฌ์ฉ ์์\n",
659 "from sklearn.ensemble import GradientBoostingClassifier\n",
660 "\n",
661 "pipeline_template = create_classification_pipeline(\n",
662 " GradientBoostingClassifier(random_state=42),\n",
663 " numeric_features=['age', 'income'],\n",
664 " categorical_features=['gender', 'education']\n",
665 ")\n",
666 "\n",
667 "print(\"๋ถ๋ฅ ํ์ดํ๋ผ์ธ ํ
ํ๋ฆฟ ์์ฑ ์๋ฃ\")\n",
668 "print(pipeline_template)"
669 ]
670 },
671 {
672 "cell_type": "markdown",
673 "metadata": {},
674 "source": [
675 "## 8. ๋ฐฐํฌ๋ฅผ ์ํ ๋ชจ๋ธ ๋ํผ ํด๋์ค"
676 ]
677 },
678 {
679 "cell_type": "code",
680 "execution_count": null,
681 "metadata": {},
682 "outputs": [],
683 "source": [
684 "class ModelWrapper:\n",
685 " \"\"\"๋ฐฐํฌ์ฉ ๋ชจ๋ธ ๋ํผ\"\"\"\n",
686 " \n",
687 " def __init__(self, model_path):\n",
688 " self.model = joblib.load(model_path)\n",
689 " self.feature_names = None\n",
690 " \n",
691 " def set_feature_names(self, names):\n",
692 " \"\"\"ํน์ฑ ์ด๋ฆ ์ค์ \"\"\"\n",
693 " self.feature_names = names\n",
694 " \n",
695 " def predict(self, input_data):\n",
696 " \"\"\"๋์
๋๋ฆฌ ๋๋ DataFrame ์
๋ ฅ ์ฒ๋ฆฌ\"\"\"\n",
697 " if isinstance(input_data, dict):\n",
698 " input_data = pd.DataFrame([input_data])\n",
699 " \n",
700 " if self.feature_names:\n",
701 " input_data = input_data[self.feature_names]\n",
702 " \n",
703 " return self.model.predict(input_data)\n",
704 " \n",
705 " def predict_proba(self, input_data):\n",
706 " \"\"\"ํ๋ฅ ์์ธก\"\"\"\n",
707 " if isinstance(input_data, dict):\n",
708 " input_data = pd.DataFrame([input_data])\n",
709 " \n",
710 " if self.feature_names:\n",
711 " input_data = input_data[self.feature_names]\n",
712 " \n",
713 " return self.model.predict_proba(input_data)\n",
714 "\n",
715 "\n",
716 "# ์ฌ์ฉ ์์\n",
717 "# wrapper = ModelWrapper('best_model.joblib')\n",
718 "# wrapper.set_feature_names(cancer.feature_names)\n",
719 "# prediction = wrapper.predict(X[0:1])\n",
720 "# print(f\"์์ธก ๊ฒฐ๊ณผ: {prediction}\")"
721 ]
722 },
723 {
724 "cell_type": "markdown",
725 "metadata": {},
726 "source": [
727 "## ์์ฝ ๋ฐ Best Practices\n",
728 "\n",
729 "### Pipeline ์ฌ์ฉ ์ ์ฅ์ \n",
730 "\n",
731 "1. **๋ฐ์ดํฐ ๋์ ๋ฐฉ์ง**: ๊ต์ฐจ ๊ฒ์ฆ ์ ๊ฐ ํด๋์์ ์ ์ฒ๋ฆฌ๊ฐ ํ์ต ๋ฐ์ดํฐ๋ง์ผ๋ก ์ํ๋จ\n",
732 "2. **์ฝ๋ ๊ฐ์ํ**: ์ฌ๋ฌ ๋จ๊ณ๋ฅผ ํ๋์ ๊ฐ์ฒด๋ก ๊ด๋ฆฌ\n",
733 "3. **์ฌํ์ฑ**: ๋ชจ๋ ์ ์ฒ๋ฆฌ ๋จ๊ณ๊ฐ ์ ์ฅ๋์ด ๋์ผํ ์ฒ๋ฆฌ ๋ณด์ฅ\n",
734 "4. **๋ฐฐํฌ ์ฉ์ด**: ํ๋์ ํ์ผ๋ก ์ ์ฒด ์ํฌํ๋ก์ฐ ์ ์ฅ ๊ฐ๋ฅ\n",
735 "\n",
736 "### ํ์ดํผํ๋ผ๋ฏธํฐ ๋ช
๋ช
๊ท์น\n",
737 "\n",
738 "```python\n",
739 "# ํ์: step_name__parameter_name\n",
740 "'classifier__C' # classifier ๋จ๊ณ์ C ํ๋ผ๋ฏธํฐ\n",
741 "'preprocessor__num__scaler__with_mean' # ์ค์ฒฉ๋ ํ๋ผ๋ฏธํฐ\n",
742 "```\n",
743 "\n",
744 "### ๋ชจ๋ธ ์ ์ฅ ๋ฐฉ๋ฒ ๋น๊ต\n",
745 "\n",
746 "| ๋ฐฉ๋ฒ | ์ฅ์ | ๋จ์ |\n",
747 "|------|------|------|\n",
748 "| joblib | ๋์ฉ๋ NumPy ๋ฐฐ์ด ํจ์จ์ ์ฒ๋ฆฌ | sklearn ์ ์ฉ |\n",
749 "| pickle | ํ์ด์ฌ ํ์ค ๋ผ์ด๋ธ๋ฌ๋ฆฌ | ๋์ฉ๋์์ ๋๋ฆผ |\n",
750 "| ONNX | ํ๋ ์์ํฌ ๋
๋ฆฝ์ , ๋ค์ํ ์ธ์ด ์ง์ | ๋ณํ ์์
ํ์ |\n",
751 "\n",
752 "### ์ค๋ฌด ์ฒดํฌ๋ฆฌ์คํธ\n",
753 "\n",
754 "- [ ] ํญ์ Pipeline ์ฌ์ฉํ์ฌ ๋ฐ์ดํฐ ๋์ ๋ฐฉ์ง\n",
755 "- [ ] ColumnTransformer๋ก ์์นํ/๋ฒ์ฃผํ ์ ์ฒ๋ฆฌ ๋ถ๋ฆฌ\n",
756 "- [ ] ๋ชจ๋ธ ์ ์ฅ ์ ๋ฉํ๋ฐ์ดํฐ ํฌํจ (๋ฒ์ , ๋ ์ง, ์ฑ๋ฅ ๋ฑ)\n",
757 "- [ ] ์
๋ ฅ ๊ฒ์ฆ ํจ์ ์์ฑ\n",
758 "- [ ] ์ปค์คํ
Transformer๋ BaseEstimator, TransformerMixin ์์\n",
759 "- [ ] GridSearchCV๋ก ์ ์ฒด ํ์ดํ๋ผ์ธ ํ๋"
760 ]
761 }
762 ],
763 "metadata": {
764 "kernelspec": {
765 "display_name": "Python 3",
766 "language": "python",
767 "name": "python3"
768 },
769 "language_info": {
770 "codemirror_mode": {
771 "name": "ipython",
772 "version": 3
773 },
774 "file_extension": ".py",
775 "mimetype": "text/x-python",
776 "name": "python",
777 "nbconvert_exporter": "python",
778 "pygments_lexer": "ipython3",
779 "version": "3.8.0"
780 }
781 },
782 "nbformat": 4,
783 "nbformat_minor": 4
784}