7. Descriptive Stats & EDA
7. Descriptive Stats & EDA¶
Previous: Data Preprocessing | Next: Data Visualization Basics
Overview¶
Descriptive statistics summarize data characteristics, and EDA (Exploratory Data Analysis) is the process of visualizing and exploring data to discover patterns and insights.
1. ๊ธฐ์ ํต๊ณ๋¶
1.1 ์ค์ฌ ๊ฒฝํฅ ์ธก๋¶
import pandas as pd
import numpy as np
from scipy import stats
data = [10, 15, 20, 25, 30, 35, 40, 100] # ์ด์์น ํฌํจ
s = pd.Series(data)
# ํ๊ท (Mean)
print(f"ํ๊ท : {s.mean():.2f}") # 34.38
# ์ค์๊ฐ (Median)
print(f"์ค์๊ฐ: {s.median():.2f}") # 27.50
# ์ต๋น๊ฐ (Mode)
data_with_mode = [1, 2, 2, 3, 3, 3, 4, 4]
print(f"์ต๋น๊ฐ: {pd.Series(data_with_mode).mode().values}") # [3]
# ์ ์ฌํ๊ท (Trimmed Mean) - ์ด์์น ์ํฅ ๊ฐ์
print(f"์ ์ฌํ๊ท (10%): {stats.trim_mean(data, 0.1):.2f}")
# ๊ฐ์คํ๊ท
values = [10, 20, 30]
weights = [0.2, 0.3, 0.5]
weighted_mean = np.average(values, weights=weights)
print(f"๊ฐ์คํ๊ท : {weighted_mean}")
1.2 ์ฐํฌ ์ธก๋¶
s = pd.Series([10, 15, 20, 25, 30, 35, 40])
# ๋ฒ์ (Range)
print(f"๋ฒ์: {s.max() - s.min()}")
# ๋ถ์ฐ (Variance)
print(f"๋ถ์ฐ(ํ๋ณธ): {s.var():.2f}")
print(f"๋ถ์ฐ(๋ชจ์ง๋จ): {s.var(ddof=0):.2f}")
# ํ์คํธ์ฐจ (Standard Deviation)
print(f"ํ์คํธ์ฐจ(ํ๋ณธ): {s.std():.2f}")
print(f"ํ์คํธ์ฐจ(๋ชจ์ง๋จ): {s.std(ddof=0):.2f}")
# ์ฌ๋ถ์๋ฒ์ (IQR)
Q1 = s.quantile(0.25)
Q3 = s.quantile(0.75)
IQR = Q3 - Q1
print(f"IQR: {IQR}")
# ๋ณ๋๊ณ์ (CV) - ์๋์ ์ฐํฌ
cv = s.std() / s.mean()
print(f"๋ณ๋๊ณ์: {cv:.4f}")
# ํ๊ท ์ ๋ํธ์ฐจ (MAD)
mad = (s - s.mean()).abs().mean()
print(f"MAD: {mad:.2f}")
1.3 ๋ถํฌ ํํ ์ธก๋¶
s = pd.Series([1, 2, 2, 3, 3, 3, 4, 4, 5, 10])
# ์๋ (Skewness) - ๋น๋์นญ ์ ๋
print(f"์๋: {s.skew():.4f}")
# > 0: ์ค๋ฅธ์ชฝ ๊ผฌ๋ฆฌ (์์ ์๋)
# < 0: ์ผ์ชฝ ๊ผฌ๋ฆฌ (์์ ์๋)
# = 0: ๋์นญ
# ์ฒจ๋ (Kurtosis) - ๊ผฌ๋ฆฌ ๋๊ป
print(f"์ฒจ๋: {s.kurtosis():.4f}")
# > 0: ์ ๊ท๋ถํฌ๋ณด๋ค ๋พฐ์กฑ (๋๊บผ์ด ๊ผฌ๋ฆฌ)
# < 0: ์ ๊ท๋ถํฌ๋ณด๋ค ํํ (์์ ๊ผฌ๋ฆฌ)
# = 0: ์ ๊ท๋ถํฌ์ ์ ์ฌ
# scipy ์ฌ์ฉ (Fisher ์ ์)
print(f"์๋ (scipy): {stats.skew(s):.4f}")
print(f"์ฒจ๋ (scipy): {stats.kurtosis(s):.4f}")
1.4 ๋ฐฑ๋ถ์์์ ๋ถ์์¶
s = pd.Series(range(1, 101))
# ๋ฐฑ๋ถ์์
print(f"25๋ฒ์งธ ๋ฐฑ๋ถ์์: {s.quantile(0.25)}")
print(f"50๋ฒ์งธ ๋ฐฑ๋ถ์์: {s.quantile(0.50)}")
print(f"75๋ฒ์งธ ๋ฐฑ๋ถ์์: {s.quantile(0.75)}")
print(f"90๋ฒ์งธ ๋ฐฑ๋ถ์์: {s.quantile(0.90)}")
# ์ฌ๋ฌ ๋ถ์์ ํ๋ฒ์
print(s.quantile([0.1, 0.25, 0.5, 0.75, 0.9]))
# 5์ ์์ฝ (Five-number summary)
print("5์ ์์ฝ:")
print(f"์ต์: {s.min()}")
print(f"Q1: {s.quantile(0.25)}")
print(f"์ค์๊ฐ: {s.median()}")
print(f"Q3: {s.quantile(0.75)}")
print(f"์ต๋: {s.max()}")
2. DataFrame ๊ธฐ์ ํต๊ณ¶
2.1 describe ๋ฉ์๋¶
df = pd.DataFrame({
'age': [25, 30, 35, 40, 45, 50, 55],
'salary': [40000, 45000, 50000, 60000, 70000, 80000, 100000],
'department': ['IT', 'HR', 'IT', 'Sales', 'IT', 'HR', 'Sales']
})
# ์์นํ ์ด๋ง
print(df.describe())
# ๋ชจ๋ ์ด (๋ฒ์ฃผํ ํฌํจ)
print(df.describe(include='all'))
# ํน์ ๋ฐฑ๋ถ์์ ์ง์
print(df.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]))
2.2 ์๊ด๋ถ์¶
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 5, 4, 5],
'C': [5, 4, 3, 2, 1]
})
# ํผ์ด์จ ์๊ด๊ณ์ (์ ํ ๊ด๊ณ)
print("ํผ์ด์จ ์๊ด๊ณ์:")
print(df.corr())
# ์คํผ์ด๋ง ์๊ด๊ณ์ (์์ ๊ธฐ๋ฐ, ๋น์ ํ ๊ด๊ณ)
print("\n์คํผ์ด๋ง ์๊ด๊ณ์:")
print(df.corr(method='spearman'))
# ์ผ๋ฌ ์๊ด๊ณ์ (์์ ๊ธฐ๋ฐ)
print("\n์ผ๋ฌ ์๊ด๊ณ์:")
print(df.corr(method='kendall'))
# ํน์ ์ด ๊ฐ ์๊ด๊ณ์
print(f"\nA์ B์ ์๊ด๊ณ์: {df['A'].corr(df['B']):.4f}")
# p-value์ ํจ๊ป
from scipy.stats import pearsonr, spearmanr
corr, p_value = pearsonr(df['A'], df['B'])
print(f"์๊ด๊ณ์: {corr:.4f}, p-value: {p_value:.4f}")
2.3 ๊ณต๋ถ์ฐ¶
df = pd.DataFrame({
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 5, 4, 5]
})
# ๊ณต๋ถ์ฐ ํ๋ ฌ
print(df.cov())
# ๋ ๋ณ์ ๊ฐ ๊ณต๋ถ์ฐ
print(f"X์ Y์ ๊ณต๋ถ์ฐ: {df['X'].cov(df['Y']):.4f}")
3. ํ์์ ๋ฐ์ดํฐ ๋ถ์ (EDA)¶
3.1 ๋ฐ์ดํฐ ๊ฐ์ ํ์ ¶
import pandas as pd
import numpy as np
# ์ํ ๋ฐ์ดํฐ ์์ฑ
np.random.seed(42)
df = pd.DataFrame({
'id': range(1, 1001),
'age': np.random.randint(18, 70, 1000),
'income': np.random.lognormal(10, 1, 1000),
'gender': np.random.choice(['M', 'F'], 1000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000),
'score': np.random.normal(75, 15, 1000)
})
df.loc[np.random.choice(1000, 50), 'income'] = np.nan # ๊ฒฐ์ธก์น ์ถ๊ฐ
# 1. ๊ธฐ๋ณธ ์ ๋ณด
print("="*50)
print("1. ๋ฐ์ดํฐ ๊ธฐ๋ณธ ์ ๋ณด")
print("="*50)
print(f"ํ ์: {len(df)}")
print(f"์ด ์: {len(df.columns)}")
print(f"\n์ปฌ๋ผ ๋ชฉ๋ก: {df.columns.tolist()}")
print(f"\n๋ฐ์ดํฐ ํ์
:\n{df.dtypes}")
# 2. ๊ฒฐ์ธก์น ํํฉ
print("\n" + "="*50)
print("2. ๊ฒฐ์ธก์น ํํฉ")
print("="*50)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'๊ฒฐ์ธก์น ์': missing,
'๊ฒฐ์ธก์น ๋น์จ(%)': missing_pct
})
print(missing_df[missing_df['๊ฒฐ์ธก์น ์'] > 0])
# 3. ๊ธฐ์ ํต๊ณ
print("\n" + "="*50)
print("3. ์์นํ ๋ณ์ ๊ธฐ์ ํต๊ณ")
print("="*50)
print(df.describe())
# 4. ๋ฒ์ฃผํ ๋ณ์ ๋น๋
print("\n" + "="*50)
print("4. ๋ฒ์ฃผํ ๋ณ์ ๋น๋")
print("="*50)
for col in df.select_dtypes(include='object').columns:
print(f"\n[{col}]")
print(df[col].value_counts())
3.2 ๋จ๋ณ๋ ๋ถ์¶
import matplotlib.pyplot as plt
# ์์นํ ๋ณ์
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('id')
fig, axes = plt.subplots(2, len(numeric_cols), figsize=(15, 8))
for i, col in enumerate(numeric_cols):
# ํ์คํ ๊ทธ๋จ
axes[0, i].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[0, i].set_title(f'{col} - Histogram')
axes[0, i].set_xlabel(col)
axes[0, i].set_ylabel('Frequency')
# ๋ฐ์คํ๋กฏ
axes[1, i].boxplot(df[col].dropna())
axes[1, i].set_title(f'{col} - Boxplot')
plt.tight_layout()
plt.show()
# ๋ฒ์ฃผํ ๋ณ์
categorical_cols = df.select_dtypes(include='object').columns.tolist()
fig, axes = plt.subplots(1, len(categorical_cols), figsize=(12, 4))
for i, col in enumerate(categorical_cols):
df[col].value_counts().plot(kind='bar', ax=axes[i], edgecolor='black')
axes[i].set_title(f'{col} - Bar Chart')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Count')
axes[i].tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()
3.3 ์ด๋ณ๋ ๋ถ์¶
import seaborn as sns
# ์์นํ vs ์์นํ: ์ฐ์ ๋
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(df['age'], df['income'], alpha=0.5)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Income')
axes[0].set_title('Age vs Income')
axes[1].scatter(df['age'], df['score'], alpha=0.5)
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Score')
axes[1].set_title('Age vs Score')
plt.tight_layout()
plt.show()
# ๋ฒ์ฃผํ vs ์์นํ: ๋ฐ์คํ๋กฏ
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
df.boxplot(column='income', by='gender', ax=axes[0])
axes[0].set_title('Income by Gender')
df.boxplot(column='score', by='category', ax=axes[1])
axes[1].set_title('Score by Category')
plt.suptitle('') # ์๋ ์์ฑ๋ ์ ๋ชฉ ์ ๊ฑฐ
plt.tight_layout()
plt.show()
# ๋ฒ์ฃผํ vs ๋ฒ์ฃผํ: ๊ต์ฐจํ
print("Gender์ Category ๊ต์ฐจํ:")
print(pd.crosstab(df['gender'], df['category']))
print(pd.crosstab(df['gender'], df['category'], normalize='index')) # ํ ๊ธฐ์ค ๋น์จ
3.4 ๋ค๋ณ๋ ๋ถ์¶
# ์๊ดํ๋ ฌ ํํธ๋งต
numeric_df = df[['age', 'income', 'score']].dropna()
correlation_matrix = numeric_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
center=0, vmin=-1, vmax=1, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# ์ฐ์ ๋ ํ๋ ฌ (Pair Plot)
sns.pairplot(df[['age', 'income', 'score', 'gender']].dropna(),
hue='gender', diag_kind='hist')
plt.suptitle('Pair Plot', y=1.02)
plt.tight_layout()
plt.show()
4. ๋ถํฌ ๋ถ์¶
4.1 ๋ถํฌ ํ์ธ¶
from scipy import stats
data = df['income'].dropna()
# ์ ๊ท์ฑ ๊ฒ์ - Shapiro-Wilk (n < 5000)
if len(data) < 5000:
stat, p_value = stats.shapiro(data[:5000])
print(f"Shapiro-Wilk ๊ฒ์ : ํต๊ณ๋={stat:.4f}, p-value={p_value:.4f}")
# ์ ๊ท์ฑ ๊ฒ์ - Kolmogorov-Smirnov
stat, p_value = stats.kstest(data, 'norm',
args=(data.mean(), data.std()))
print(f"K-S ๊ฒ์ : ํต๊ณ๋={stat:.4f}, p-value={p_value:.4f}")
# ์ ๊ท์ฑ ๊ฒ์ - Anderson-Darling
result = stats.anderson(data, dist='norm')
print(f"Anderson-Darling ๊ฒ์ : ํต๊ณ๋={result.statistic:.4f}")
# ์๊ฐ์ ํ์ธ: Q-Q plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# ํ์คํ ๊ทธ๋จ๊ณผ ์ ๊ท๋ถํฌ ๊ณก์
axes[0].hist(data, bins=50, density=True, alpha=0.7, edgecolor='black')
x = np.linspace(data.min(), data.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, data.mean(), data.std()),
'r-', linewidth=2, label='Normal')
axes[0].set_title('Histogram with Normal Curve')
axes[0].legend()
# Q-Q plot
stats.probplot(data, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot')
plt.tight_layout()
plt.show()
4.2 ๋ถํฌ ๋ณํ¶
# ๋ก๊ทธ ์ ๊ท๋ถํฌ ๋ฐ์ดํฐ
data = df['income'].dropna()
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# ์๋ณธ
axes[0, 0].hist(data, bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_title(f'Original (Skew: {stats.skew(data):.2f})')
# ๋ก๊ทธ ๋ณํ
log_data = np.log1p(data)
axes[0, 1].hist(log_data, bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_title(f'Log Transform (Skew: {stats.skew(log_data):.2f})')
# ์ ๊ณฑ๊ทผ ๋ณํ
sqrt_data = np.sqrt(data)
axes[1, 0].hist(sqrt_data, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_title(f'Square Root Transform (Skew: {stats.skew(sqrt_data):.2f})')
# Box-Cox ๋ณํ
boxcox_data, lambda_param = stats.boxcox(data)
axes[1, 1].hist(boxcox_data, bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_title(f'Box-Cox Transform (ฮป={lambda_param:.2f}, Skew: {stats.skew(boxcox_data):.2f})')
plt.tight_layout()
plt.show()
5. ๊ทธ๋ฃน๋ณ ๋ถ์¶
5.1 ๊ทธ๋ฃน๋ณ ๊ธฐ์ ํต๊ณ¶
# ๊ทธ๋ฃน๋ณ ์์ฝ
print("์ฑ๋ณ ๊ทธ๋ฃน๋ณ ํต๊ณ:")
print(df.groupby('gender')[['age', 'income', 'score']].agg(['mean', 'std', 'count']))
print("\n์นดํ
๊ณ ๋ฆฌ๋ณ ํต๊ณ:")
print(df.groupby('category')[['income', 'score']].describe())
# ๋ค์ค ๊ทธ๋ฃน
print("\n์ฑ๋ณ & ์นดํ
๊ณ ๋ฆฌ๋ณ ํ๊ท ์๋:")
print(df.groupby(['gender', 'category'])['income'].mean().unstack())
5.2 ๊ทธ๋ฃน ๊ฐ ๋น๊ต¶
# t-๊ฒ์ (๋ ๊ทธ๋ฃน)
male_income = df[df['gender'] == 'M']['income'].dropna()
female_income = df[df['gender'] == 'F']['income'].dropna()
stat, p_value = stats.ttest_ind(male_income, female_income)
print(f"๋
๋ฆฝํ๋ณธ t-๊ฒ์ : t={stat:.4f}, p-value={p_value:.4f}")
# ANOVA (์ธ ๊ทธ๋ฃน ์ด์)
groups = [df[df['category'] == cat]['score'].dropna() for cat in df['category'].unique()]
stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F={stat:.4f}, p-value={p_value:.4f}")
# ์นด์ด์ ๊ณฑ ๊ฒ์ (๋ฒ์ฃผํ ๋ณ์)
contingency_table = pd.crosstab(df['gender'], df['category'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"์นด์ด์ ๊ณฑ ๊ฒ์ : ฯยฒ={chi2:.4f}, p-value={p_value:.4f}")
6. EDA ์๋ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ¶
6.1 pandas-profiling (ydata-profiling)¶
# pip install ydata-profiling
# from ydata_profiling import ProfileReport
# report = ProfileReport(df, title="EDA Report", explorative=True)
# report.to_file("eda_report.html")
# ๊ฐ์ํ ๋ฒ์ (๋์ฉ๋ ๋ฐ์ดํฐ)
# report = ProfileReport(df, minimal=True)
6.2 sweetviz¶
# pip install sweetviz
# import sweetviz as sv
# report = sv.analyze(df)
# report.show_html("sweetviz_report.html")
# ๋ ๋ฐ์ดํฐ์
๋น๊ต
# report = sv.compare(df1, df2)
7. EDA ์ฒดํฌ๋ฆฌ์คํธ¶
## EDA ์ฒดํฌ๋ฆฌ์คํธ
### 1. ๋ฐ์ดํฐ ๊ฐ์
- [ ] ํ/์ด ์ ํ์ธ
- [ ] ๋ฐ์ดํฐ ํ์
ํ์ธ
- [ ] ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ ํ์ธ
### 2. ๊ฒฐ์ธก์น
- [ ] ๊ฒฐ์ธก์น ์กด์ฌ ์ฌ๋ถ
- [ ] ๊ฒฐ์ธก์น ๋น์จ
- [ ] ๊ฒฐ์ธก์น ํจํด (MCAR, MAR, MNAR)
### 3. ์์นํ ๋ณ์
- [ ] ๊ธฐ์ ํต๊ณ (ํ๊ท , ์ค์๊ฐ, ํ์คํธ์ฐจ ๋ฑ)
- [ ] ๋ถํฌ ํํ (์๋, ์ฒจ๋)
- [ ] ์ด์์น ์กด์ฌ ์ฌ๋ถ
- [ ] ํ์คํ ๊ทธ๋จ/๋ฐ์คํ๋กฏ
### 4. ๋ฒ์ฃผํ ๋ณ์
- [ ] ์นดํ
๊ณ ๋ฆฌ ์
- [ ] ๋น๋ ๋ถํฌ
- [ ] ํฌ์ ์นดํ
๊ณ ๋ฆฌ ์กด์ฌ ์ฌ๋ถ
### 5. ๋ณ์ ๊ฐ ๊ด๊ณ
- [ ] ์๊ด๋ถ์ (์์นํ)
- [ ] ๊ต์ฐจํ (๋ฒ์ฃผํ)
- [ ] ๊ทธ๋ฃน๋ณ ๋น๊ต
### 6. ํ๊ฒ ๋ณ์
- [ ] ํ๊ฒ ๋ถํฌ (๋ถ๊ท ํ ์ฌ๋ถ)
- [ ] ํ๊ฒ๊ณผ ํน์ฑ ๊ฐ ๊ด๊ณ
Practice Problems¶
Problem 1: ๊ธฐ์ ํต๊ณ¶
๋ค์ ๋ฐ์ดํฐ์ 5์ ์์ฝ์ ๊ตฌํ์ธ์.
data = [12, 15, 18, 22, 25, 28, 30, 35, 40, 100]
s = pd.Series(data)
# Solution
print(f"์ต์: {s.min()}")
print(f"Q1: {s.quantile(0.25)}")
print(f"์ค์๊ฐ: {s.median()}")
print(f"Q3: {s.quantile(0.75)}")
print(f"์ต๋: {s.max()}")
Problem 2: ์๊ด๋ถ์¶
๋ ๋ณ์ ๊ฐ์ ์๊ด๊ณ์๋ฅผ ๊ตฌํ๊ณ ํด์ํ์ธ์.
df = pd.DataFrame({
'study_hours': [1, 2, 3, 4, 5, 6, 7, 8],
'score': [50, 55, 60, 65, 70, 80, 85, 95]
})
# Solution
corr = df['study_hours'].corr(df['score'])
print(f"์๊ด๊ณ์: {corr:.4f}")
# ๊ฐํ ์์ ์๊ด๊ด๊ณ (๊ณต๋ถ ์๊ฐ์ด ๋์๋ก ์ ์ ์ฆ๊ฐ)
Problem 3: ๊ทธ๋ฃน ๋น๊ต¶
๊ทธ๋ฃน๋ณ ํ๊ท ๊ณผ ํ์คํธ์ฐจ๋ฅผ ๊ตฌํ์ธ์.
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'A', 'B'],
'value': [10, 12, 20, 22, 11, 21]
})
# Solution
print(df.groupby('group')['value'].agg(['mean', 'std']))
Summary¶
| Measure Type | Measure | Function |
|---|---|---|
| ์ค์ฌ ๊ฒฝํฅ | ํ๊ท , ์ค์๊ฐ, ์ต๋น๊ฐ | mean(), median(), mode() |
| ์ฐํฌ | ๋ถ์ฐ, ํ์คํธ์ฐจ, IQR | var(), std(), quantile() |
| ๋ถํฌ ํํ | ์๋, ์ฒจ๋ | skew(), kurtosis() |
| ๊ด๊ณ | ์๊ด๊ณ์, ๊ณต๋ถ์ฐ | corr(), cov() |
| ์์ฝ | ๊ธฐ์ ํต๊ณ | describe() |