1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Exploratory Data Analysis with Pandas\n",
8 "\n",
9 "This notebook demonstrates essential exploratory data analysis (EDA) techniques using Pandas.\n",
10 "\n",
11 "## Topics Covered:\n",
12 "- Dataset overview and inspection\n",
13 "- Missing value analysis\n",
14 "- Statistical summaries\n",
15 "- Groupby operations and aggregations\n",
16 "- Correlation analysis\n",
17 "- Filtering and sorting\n",
18 "- Date/time operations"
19 ]
20 },
21 {
22 "cell_type": "code",
23 "execution_count": null,
24 "metadata": {},
25 "outputs": [],
26 "source": [
27 "import numpy as np\n",
28 "import pandas as pd\n",
29 "import matplotlib.pyplot as plt\n",
30 "import seaborn as sns\n",
31 "\n",
32 "# Set random seed for reproducibility\n",
33 "np.random.seed(42)\n",
34 "\n",
35 "# Set display options\n",
36 "pd.set_option('display.max_columns', None)\n",
37 "pd.set_option('display.max_rows', 100)\n",
38 "pd.set_option('display.precision', 2)"
39 ]
40 },
41 {
42 "cell_type": "markdown",
43 "metadata": {},
44 "source": [
45 "## 1. Creating a Sample Dataset\n",
46 "\n",
47 "We'll create a synthetic sales dataset with multiple columns including dates, categories, and numerical values."
48 ]
49 },
50 {
51 "cell_type": "code",
52 "execution_count": null,
53 "metadata": {},
54 "outputs": [],
55 "source": [
56 "# Generate synthetic sales data\n",
57 "n_records = 1000\n",
58 "\n",
59 "# Date range\n",
60 "dates = pd.date_range(start='2023-01-01', end='2023-12-31', periods=n_records)\n",
61 "\n",
62 "# Product categories\n",
63 "categories = ['Electronics', 'Clothing', 'Food', 'Books', 'Home & Garden']\n",
64 "category_list = np.random.choice(categories, n_records)\n",
65 "\n",
66 "# Regions\n",
67 "regions = ['North', 'South', 'East', 'West']\n",
68 "region_list = np.random.choice(regions, n_records)\n",
69 "\n",
70 "# Sales amounts (with some variation by category)\n",
71 "base_sales = np.random.uniform(10, 500, n_records)\n",
72 "category_multipliers = {'Electronics': 2.0, 'Clothing': 1.2, 'Food': 0.8, 'Books': 1.0, 'Home & Garden': 1.5}\n",
73 "sales = [base_sales[i] * category_multipliers[category_list[i]] for i in range(n_records)]\n",
74 "\n",
75 "# Quantities\n",
76 "quantities = np.random.randint(1, 20, n_records)\n",
77 "\n",
78 "# Customer IDs\n",
79 "customer_ids = np.random.randint(1000, 2000, n_records)\n",
80 "\n",
81 "# Create DataFrame\n",
82 "df = pd.DataFrame({\n",
83 " 'date': dates,\n",
84 " 'customer_id': customer_ids,\n",
85 " 'category': category_list,\n",
86 " 'region': region_list,\n",
87 " 'sales_amount': sales,\n",
88 " 'quantity': quantities\n",
89 "})\n",
90 "\n",
91 "# Calculate unit price\n",
92 "df['unit_price'] = df['sales_amount'] / df['quantity']\n",
93 "\n",
94 "# Introduce some missing values (realistic scenario)\n",
95 "missing_indices = np.random.choice(df.index, size=50, replace=False)\n",
96 "df.loc[missing_indices, 'sales_amount'] = np.nan\n",
97 "\n",
98 "missing_indices_2 = np.random.choice(df.index, size=30, replace=False)\n",
99 "df.loc[missing_indices_2, 'quantity'] = np.nan\n",
100 "\n",
101 "print(\"Sample dataset created successfully!\")\n",
102 "df.head(10)"
103 ]
104 },
105 {
106 "cell_type": "markdown",
107 "metadata": {},
108 "source": [
109 "## 2. Dataset Overview\n",
110 "\n",
111 "First, let's understand the basic structure and characteristics of our dataset."
112 ]
113 },
114 {
115 "cell_type": "code",
116 "execution_count": null,
117 "metadata": {},
118 "outputs": [],
119 "source": [
120 "# Shape of the dataset\n",
121 "print(f\"Dataset shape: {df.shape}\")\n",
122 "print(f\"Number of rows: {df.shape[0]}\")\n",
123 "print(f\"Number of columns: {df.shape[1]}\")"
124 ]
125 },
126 {
127 "cell_type": "code",
128 "execution_count": null,
129 "metadata": {},
130 "outputs": [],
131 "source": [
132 "# Column names and data types\n",
133 "print(\"\\nColumn names:\")\n",
134 "print(df.columns.tolist())\n",
135 "\n",
136 "print(\"\\nData types:\")\n",
137 "print(df.dtypes)"
138 ]
139 },
140 {
141 "cell_type": "code",
142 "execution_count": null,
143 "metadata": {},
144 "outputs": [],
145 "source": [
146 "# Comprehensive information about the DataFrame\n",
147 "df.info()"
148 ]
149 },
150 {
151 "cell_type": "markdown",
152 "metadata": {},
153 "source": [
154 "## 3. Missing Value Analysis\n",
155 "\n",
156 "Identifying and handling missing values is a critical step in EDA."
157 ]
158 },
159 {
160 "cell_type": "code",
161 "execution_count": null,
162 "metadata": {},
163 "outputs": [],
164 "source": [
165 "# Count missing values per column\n",
166 "missing_counts = df.isnull().sum()\n",
167 "print(\"Missing values per column:\")\n",
168 "print(missing_counts)\n",
169 "\n",
170 "# Percentage of missing values\n",
171 "missing_percentages = (df.isnull().sum() / len(df)) * 100\n",
172 "print(\"\\nPercentage of missing values:\")\n",
173 "print(missing_percentages)"
174 ]
175 },
176 {
177 "cell_type": "code",
178 "execution_count": null,
179 "metadata": {},
180 "outputs": [],
181 "source": [
182 "# Visualize missing values\n",
183 "missing_summary = pd.DataFrame({\n",
184 " 'Column': df.columns,\n",
185 " 'Missing_Count': missing_counts.values,\n",
186 " 'Missing_Percentage': missing_percentages.values\n",
187 "})\n",
188 "missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]\n",
189 "print(\"Summary of columns with missing values:\")\n",
190 "print(missing_summary)"
191 ]
192 },
193 {
194 "cell_type": "code",
195 "execution_count": null,
196 "metadata": {},
197 "outputs": [],
198 "source": [
199 "# Handle missing values - create a copy for demonstration\n",
200 "df_filled = df.copy()\n",
201 "\n",
202 "# Fill missing sales_amount with median\n",
203 "df_filled['sales_amount'].fillna(df_filled['sales_amount'].median(), inplace=True)\n",
204 "\n",
205 "# Fill missing quantity with mode (most common value)\n",
206 "df_filled['quantity'].fillna(df_filled['quantity'].mode()[0], inplace=True)\n",
207 "\n",
208 "# Recalculate unit_price\n",
209 "df_filled['unit_price'] = df_filled['sales_amount'] / df_filled['quantity']\n",
210 "\n",
211 "print(\"Missing values after filling:\")\n",
212 "print(df_filled.isnull().sum())"
213 ]
214 },
215 {
216 "cell_type": "markdown",
217 "metadata": {},
218 "source": [
219 "## 4. Statistical Summaries\n",
220 "\n",
221 "Understanding the distribution and central tendencies of numerical columns."
222 ]
223 },
224 {
225 "cell_type": "code",
226 "execution_count": null,
227 "metadata": {},
228 "outputs": [],
229 "source": [
230 "# Descriptive statistics for numerical columns\n",
231 "df_filled.describe()"
232 ]
233 },
234 {
235 "cell_type": "code",
236 "execution_count": null,
237 "metadata": {},
238 "outputs": [],
239 "source": [
240 "# Detailed statistics for a specific column\n",
241 "print(\"Statistics for sales_amount:\")\n",
242 "print(f\"Mean: {df_filled['sales_amount'].mean():.2f}\")\n",
243 "print(f\"Median: {df_filled['sales_amount'].median():.2f}\")\n",
244 "print(f\"Standard Deviation: {df_filled['sales_amount'].std():.2f}\")\n",
245 "print(f\"Min: {df_filled['sales_amount'].min():.2f}\")\n",
246 "print(f\"Max: {df_filled['sales_amount'].max():.2f}\")\n",
247 "print(f\"25th Percentile: {df_filled['sales_amount'].quantile(0.25):.2f}\")\n",
248 "print(f\"75th Percentile: {df_filled['sales_amount'].quantile(0.75):.2f}\")\n",
249 "print(f\"IQR: {df_filled['sales_amount'].quantile(0.75) - df_filled['sales_amount'].quantile(0.25):.2f}\")"
250 ]
251 },
252 {
253 "cell_type": "code",
254 "execution_count": null,
255 "metadata": {},
256 "outputs": [],
257 "source": [
258 "# Summary statistics for categorical columns\n",
259 "print(\"Category distribution:\")\n",
260 "print(df_filled['category'].value_counts())\n",
261 "\n",
262 "print(\"\\nRegion distribution:\")\n",
263 "print(df_filled['region'].value_counts())"
264 ]
265 },
266 {
267 "cell_type": "markdown",
268 "metadata": {},
269 "source": [
270 "## 5. Groupby Operations and Aggregations\n",
271 "\n",
272 "Analyzing data by groups to identify patterns and trends."
273 ]
274 },
275 {
276 "cell_type": "code",
277 "execution_count": null,
278 "metadata": {},
279 "outputs": [],
280 "source": [
281 "# Group by category and calculate aggregates\n",
282 "category_summary = df_filled.groupby('category').agg({\n",
283 " 'sales_amount': ['sum', 'mean', 'count'],\n",
284 " 'quantity': ['sum', 'mean'],\n",
285 " 'unit_price': 'mean'\n",
286 "}).round(2)\n",
287 "\n",
288 "print(\"Summary by Category:\")\n",
289 "print(category_summary)"
290 ]
291 },
292 {
293 "cell_type": "code",
294 "execution_count": null,
295 "metadata": {},
296 "outputs": [],
297 "source": [
298 "# Group by region\n",
299 "region_summary = df_filled.groupby('region').agg({\n",
300 " 'sales_amount': ['sum', 'mean'],\n",
301 " 'customer_id': 'nunique' # Count unique customers\n",
302 "}).round(2)\n",
303 "\n",
304 "region_summary.columns = ['Total Sales', 'Avg Sales', 'Unique Customers']\n",
305 "print(\"\\nSummary by Region:\")\n",
306 "print(region_summary)"
307 ]
308 },
309 {
310 "cell_type": "code",
311 "execution_count": null,
312 "metadata": {},
313 "outputs": [],
314 "source": [
315 "# Multiple groupby - category and region\n",
316 "multi_group = df_filled.groupby(['category', 'region'])['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
317 "print(\"\\nSummary by Category and Region:\")\n",
318 "print(multi_group.head(10))"
319 ]
320 },
321 {
322 "cell_type": "markdown",
323 "metadata": {},
324 "source": [
325 "## 6. Correlation Analysis\n",
326 "\n",
327 "Examining relationships between numerical variables."
328 ]
329 },
330 {
331 "cell_type": "code",
332 "execution_count": null,
333 "metadata": {},
334 "outputs": [],
335 "source": [
336 "# Select only numerical columns for correlation\n",
337 "numerical_cols = df_filled.select_dtypes(include=[np.number]).columns\n",
338 "correlation_matrix = df_filled[numerical_cols].corr()\n",
339 "\n",
340 "print(\"Correlation Matrix:\")\n",
341 "print(correlation_matrix)"
342 ]
343 },
344 {
345 "cell_type": "code",
346 "execution_count": null,
347 "metadata": {},
348 "outputs": [],
349 "source": [
350 "# Correlation with a specific column\n",
351 "print(\"\\nCorrelation with sales_amount:\")\n",
352 "print(correlation_matrix['sales_amount'].sort_values(ascending=False))"
353 ]
354 },
355 {
356 "cell_type": "markdown",
357 "metadata": {},
358 "source": [
359 "## 7. Filtering and Sorting\n",
360 "\n",
361 "Essential techniques for data exploration and analysis."
362 ]
363 },
364 {
365 "cell_type": "code",
366 "execution_count": null,
367 "metadata": {},
368 "outputs": [],
369 "source": [
370 "# Filter: High-value sales (> 500)\n",
371 "high_value_sales = df_filled[df_filled['sales_amount'] > 500]\n",
372 "print(f\"Number of high-value sales: {len(high_value_sales)}\")\n",
373 "print(high_value_sales.head())"
374 ]
375 },
376 {
377 "cell_type": "code",
378 "execution_count": null,
379 "metadata": {},
380 "outputs": [],
381 "source": [
382 "# Multiple conditions: Electronics in North region\n",
383 "electronics_north = df_filled[(df_filled['category'] == 'Electronics') & (df_filled['region'] == 'North')]\n",
384 "print(f\"\\nElectronics sales in North region: {len(electronics_north)}\")\n",
385 "print(electronics_north.head())"
386 ]
387 },
388 {
389 "cell_type": "code",
390 "execution_count": null,
391 "metadata": {},
392 "outputs": [],
393 "source": [
394 "# Sort by sales_amount (descending)\n",
395 "top_sales = df_filled.sort_values('sales_amount', ascending=False).head(10)\n",
396 "print(\"\\nTop 10 sales:\")\n",
397 "print(top_sales[['date', 'category', 'region', 'sales_amount', 'quantity']])"
398 ]
399 },
400 {
401 "cell_type": "code",
402 "execution_count": null,
403 "metadata": {},
404 "outputs": [],
405 "source": [
406 "# Value counts for categorical analysis\n",
407 "print(\"\\nTop categories by number of transactions:\")\n",
408 "print(df_filled['category'].value_counts())\n",
409 "\n",
410 "print(\"\\nCategory proportions:\")\n",
411 "print(df_filled['category'].value_counts(normalize=True))"
412 ]
413 },
414 {
415 "cell_type": "markdown",
416 "metadata": {},
417 "source": [
418 "## 8. Date/Time Operations\n",
419 "\n",
420 "Working with temporal data to identify trends and patterns."
421 ]
422 },
423 {
424 "cell_type": "code",
425 "execution_count": null,
426 "metadata": {},
427 "outputs": [],
428 "source": [
429 "# Extract date components\n",
430 "df_filled['year'] = df_filled['date'].dt.year\n",
431 "df_filled['month'] = df_filled['date'].dt.month\n",
432 "df_filled['day'] = df_filled['date'].dt.day\n",
433 "df_filled['day_of_week'] = df_filled['date'].dt.day_name()\n",
434 "df_filled['quarter'] = df_filled['date'].dt.quarter\n",
435 "\n",
436 "print(\"Date components added:\")\n",
437 "print(df_filled[['date', 'year', 'month', 'day', 'day_of_week', 'quarter']].head())"
438 ]
439 },
440 {
441 "cell_type": "code",
442 "execution_count": null,
443 "metadata": {},
444 "outputs": [],
445 "source": [
446 "# Monthly sales trend\n",
447 "monthly_sales = df_filled.groupby('month')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
448 "monthly_sales.columns = ['Total Sales', 'Avg Sales', 'Transactions']\n",
449 "print(\"\\nMonthly Sales Summary:\")\n",
450 "print(monthly_sales)"
451 ]
452 },
453 {
454 "cell_type": "code",
455 "execution_count": null,
456 "metadata": {},
457 "outputs": [],
458 "source": [
459 "# Sales by day of week\n",
460 "day_of_week_sales = df_filled.groupby('day_of_week')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
461 "print(\"\\nSales by Day of Week:\")\n",
462 "print(day_of_week_sales)"
463 ]
464 },
465 {
466 "cell_type": "code",
467 "execution_count": null,
468 "metadata": {},
469 "outputs": [],
470 "source": [
471 "# Quarterly performance\n",
472 "quarterly_sales = df_filled.groupby('quarter')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
473 "quarterly_sales.columns = ['Total Sales', 'Avg Sales', 'Transactions']\n",
474 "print(\"\\nQuarterly Sales Summary:\")\n",
475 "print(quarterly_sales)"
476 ]
477 },
478 {
479 "cell_type": "markdown",
480 "metadata": {},
481 "source": [
482 "## 9. Advanced Filtering and Insights\n",
483 "\n",
484 "Combining techniques to derive actionable insights."
485 ]
486 },
487 {
488 "cell_type": "code",
489 "execution_count": null,
490 "metadata": {},
491 "outputs": [],
492 "source": [
493 "# Find top-performing category-region combinations\n",
494 "top_combinations = df_filled.groupby(['category', 'region'])['sales_amount'].sum().sort_values(ascending=False).head(10)\n",
495 "print(\"Top 10 Category-Region Combinations by Total Sales:\")\n",
496 "print(top_combinations)"
497 ]
498 },
499 {
500 "cell_type": "code",
501 "execution_count": null,
502 "metadata": {},
503 "outputs": [],
504 "source": [
505 "# Identify customers with highest total purchases\n",
506 "top_customers = df_filled.groupby('customer_id').agg({\n",
507 " 'sales_amount': 'sum',\n",
508 " 'date': 'count'\n",
509 "}).sort_values('sales_amount', ascending=False).head(10)\n",
510 "top_customers.columns = ['Total Purchases', 'Number of Transactions']\n",
511 "print(\"\\nTop 10 Customers:\")\n",
512 "print(top_customers)"
513 ]
514 },
515 {
516 "cell_type": "code",
517 "execution_count": null,
518 "metadata": {},
519 "outputs": [],
520 "source": [
521 "# Calculate category-wise market share\n",
522 "category_sales = df_filled.groupby('category')['sales_amount'].sum()\n",
523 "category_market_share = (category_sales / category_sales.sum() * 100).round(2)\n",
524 "print(\"\\nMarket Share by Category (%):\")\n",
525 "print(category_market_share.sort_values(ascending=False))"
526 ]
527 },
528 {
529 "cell_type": "markdown",
530 "metadata": {},
531 "source": [
532 "## Summary\n",
533 "\n",
534 "In this notebook, we covered essential EDA techniques:\n",
535 "\n",
536 "1. **Dataset Overview**: Understanding shape, columns, and data types\n",
537 "2. **Missing Value Analysis**: Identifying and handling missing data\n",
538 "3. **Statistical Summaries**: Mean, median, standard deviation, quantiles\n",
539 "4. **Groupby Operations**: Aggregating data by categories\n",
540 "5. **Correlation Analysis**: Examining relationships between variables\n",
541 "6. **Filtering and Sorting**: Extracting specific subsets of data\n",
542 "7. **Date/Time Operations**: Temporal analysis and trends\n",
543 "8. **Advanced Insights**: Combining techniques for business intelligence\n",
544 "\n",
545 "These techniques form the foundation of data analysis and are essential for understanding your data before modeling."
546 ]
547 }
548 ],
549 "metadata": {
550 "kernelspec": {
551 "display_name": "Python 3",
552 "language": "python",
553 "name": "python3"
554 },
555 "language_info": {
556 "codemirror_mode": {
557 "name": "ipython",
558 "version": 3
559 },
560 "file_extension": ".py",
561 "mimetype": "text/x-python",
562 "name": "python",
563 "nbconvert_exporter": "python",
564 "pygments_lexer": "ipython3",
565 "version": "3.8.0"
566 }
567 },
568 "nbformat": 4,
569 "nbformat_minor": 4
570}