06_pandas_eda.ipynb

Download
json 571 lines 16.5 KB
  1{
  2 "cells": [
  3  {
  4   "cell_type": "markdown",
  5   "metadata": {},
  6   "source": [
  7    "# Exploratory Data Analysis with Pandas\n",
  8    "\n",
  9    "This notebook demonstrates essential exploratory data analysis (EDA) techniques using Pandas.\n",
 10    "\n",
 11    "## Topics Covered:\n",
 12    "- Dataset overview and inspection\n",
 13    "- Missing value analysis\n",
 14    "- Statistical summaries\n",
 15    "- Groupby operations and aggregations\n",
 16    "- Correlation analysis\n",
 17    "- Filtering and sorting\n",
 18    "- Date/time operations"
 19   ]
 20  },
 21  {
 22   "cell_type": "code",
 23   "execution_count": null,
 24   "metadata": {},
 25   "outputs": [],
 26   "source": [
 27    "import numpy as np\n",
 28    "import pandas as pd\n",
 29    "import matplotlib.pyplot as plt\n",
 30    "import seaborn as sns\n",
 31    "\n",
 32    "# Set random seed for reproducibility\n",
 33    "np.random.seed(42)\n",
 34    "\n",
 35    "# Set display options\n",
 36    "pd.set_option('display.max_columns', None)\n",
 37    "pd.set_option('display.max_rows', 100)\n",
 38    "pd.set_option('display.precision', 2)"
 39   ]
 40  },
 41  {
 42   "cell_type": "markdown",
 43   "metadata": {},
 44   "source": [
 45    "## 1. Creating a Sample Dataset\n",
 46    "\n",
 47    "We'll create a synthetic sales dataset with multiple columns including dates, categories, and numerical values."
 48   ]
 49  },
 50  {
 51   "cell_type": "code",
 52   "execution_count": null,
 53   "metadata": {},
 54   "outputs": [],
 55   "source": [
 56    "# Generate synthetic sales data\n",
 57    "n_records = 1000\n",
 58    "\n",
 59    "# Date range\n",
 60    "dates = pd.date_range(start='2023-01-01', end='2023-12-31', periods=n_records)\n",
 61    "\n",
 62    "# Product categories\n",
 63    "categories = ['Electronics', 'Clothing', 'Food', 'Books', 'Home & Garden']\n",
 64    "category_list = np.random.choice(categories, n_records)\n",
 65    "\n",
 66    "# Regions\n",
 67    "regions = ['North', 'South', 'East', 'West']\n",
 68    "region_list = np.random.choice(regions, n_records)\n",
 69    "\n",
 70    "# Sales amounts (with some variation by category)\n",
 71    "base_sales = np.random.uniform(10, 500, n_records)\n",
 72    "category_multipliers = {'Electronics': 2.0, 'Clothing': 1.2, 'Food': 0.8, 'Books': 1.0, 'Home & Garden': 1.5}\n",
 73    "sales = [base_sales[i] * category_multipliers[category_list[i]] for i in range(n_records)]\n",
 74    "\n",
 75    "# Quantities\n",
 76    "quantities = np.random.randint(1, 20, n_records)\n",
 77    "\n",
 78    "# Customer IDs\n",
 79    "customer_ids = np.random.randint(1000, 2000, n_records)\n",
 80    "\n",
 81    "# Create DataFrame\n",
 82    "df = pd.DataFrame({\n",
 83    "    'date': dates,\n",
 84    "    'customer_id': customer_ids,\n",
 85    "    'category': category_list,\n",
 86    "    'region': region_list,\n",
 87    "    'sales_amount': sales,\n",
 88    "    'quantity': quantities\n",
 89    "})\n",
 90    "\n",
 91    "# Calculate unit price\n",
 92    "df['unit_price'] = df['sales_amount'] / df['quantity']\n",
 93    "\n",
 94    "# Introduce some missing values (realistic scenario)\n",
 95    "missing_indices = np.random.choice(df.index, size=50, replace=False)\n",
 96    "df.loc[missing_indices, 'sales_amount'] = np.nan\n",
 97    "\n",
 98    "missing_indices_2 = np.random.choice(df.index, size=30, replace=False)\n",
 99    "df.loc[missing_indices_2, 'quantity'] = np.nan\n",
100    "\n",
101    "print(\"Sample dataset created successfully!\")\n",
102    "df.head(10)"
103   ]
104  },
105  {
106   "cell_type": "markdown",
107   "metadata": {},
108   "source": [
109    "## 2. Dataset Overview\n",
110    "\n",
111    "First, let's understand the basic structure and characteristics of our dataset."
112   ]
113  },
114  {
115   "cell_type": "code",
116   "execution_count": null,
117   "metadata": {},
118   "outputs": [],
119   "source": [
120    "# Shape of the dataset\n",
121    "print(f\"Dataset shape: {df.shape}\")\n",
122    "print(f\"Number of rows: {df.shape[0]}\")\n",
123    "print(f\"Number of columns: {df.shape[1]}\")"
124   ]
125  },
126  {
127   "cell_type": "code",
128   "execution_count": null,
129   "metadata": {},
130   "outputs": [],
131   "source": [
132    "# Column names and data types\n",
133    "print(\"\\nColumn names:\")\n",
134    "print(df.columns.tolist())\n",
135    "\n",
136    "print(\"\\nData types:\")\n",
137    "print(df.dtypes)"
138   ]
139  },
140  {
141   "cell_type": "code",
142   "execution_count": null,
143   "metadata": {},
144   "outputs": [],
145   "source": [
146    "# Comprehensive information about the DataFrame\n",
147    "df.info()"
148   ]
149  },
150  {
151   "cell_type": "markdown",
152   "metadata": {},
153   "source": [
154    "## 3. Missing Value Analysis\n",
155    "\n",
156    "Identifying and handling missing values is a critical step in EDA."
157   ]
158  },
159  {
160   "cell_type": "code",
161   "execution_count": null,
162   "metadata": {},
163   "outputs": [],
164   "source": [
165    "# Count missing values per column\n",
166    "missing_counts = df.isnull().sum()\n",
167    "print(\"Missing values per column:\")\n",
168    "print(missing_counts)\n",
169    "\n",
170    "# Percentage of missing values\n",
171    "missing_percentages = (df.isnull().sum() / len(df)) * 100\n",
172    "print(\"\\nPercentage of missing values:\")\n",
173    "print(missing_percentages)"
174   ]
175  },
176  {
177   "cell_type": "code",
178   "execution_count": null,
179   "metadata": {},
180   "outputs": [],
181   "source": [
182    "# Visualize missing values\n",
183    "missing_summary = pd.DataFrame({\n",
184    "    'Column': df.columns,\n",
185    "    'Missing_Count': missing_counts.values,\n",
186    "    'Missing_Percentage': missing_percentages.values\n",
187    "})\n",
188    "missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]\n",
189    "print(\"Summary of columns with missing values:\")\n",
190    "print(missing_summary)"
191   ]
192  },
193  {
194   "cell_type": "code",
195   "execution_count": null,
196   "metadata": {},
197   "outputs": [],
198   "source": [
199    "# Handle missing values - create a copy for demonstration\n",
200    "df_filled = df.copy()\n",
201    "\n",
202    "# Fill missing sales_amount with median\n",
203    "df_filled['sales_amount'].fillna(df_filled['sales_amount'].median(), inplace=True)\n",
204    "\n",
205    "# Fill missing quantity with mode (most common value)\n",
206    "df_filled['quantity'].fillna(df_filled['quantity'].mode()[0], inplace=True)\n",
207    "\n",
208    "# Recalculate unit_price\n",
209    "df_filled['unit_price'] = df_filled['sales_amount'] / df_filled['quantity']\n",
210    "\n",
211    "print(\"Missing values after filling:\")\n",
212    "print(df_filled.isnull().sum())"
213   ]
214  },
215  {
216   "cell_type": "markdown",
217   "metadata": {},
218   "source": [
219    "## 4. Statistical Summaries\n",
220    "\n",
221    "Understanding the distribution and central tendencies of numerical columns."
222   ]
223  },
224  {
225   "cell_type": "code",
226   "execution_count": null,
227   "metadata": {},
228   "outputs": [],
229   "source": [
230    "# Descriptive statistics for numerical columns\n",
231    "df_filled.describe()"
232   ]
233  },
234  {
235   "cell_type": "code",
236   "execution_count": null,
237   "metadata": {},
238   "outputs": [],
239   "source": [
240    "# Detailed statistics for a specific column\n",
241    "print(\"Statistics for sales_amount:\")\n",
242    "print(f\"Mean: {df_filled['sales_amount'].mean():.2f}\")\n",
243    "print(f\"Median: {df_filled['sales_amount'].median():.2f}\")\n",
244    "print(f\"Standard Deviation: {df_filled['sales_amount'].std():.2f}\")\n",
245    "print(f\"Min: {df_filled['sales_amount'].min():.2f}\")\n",
246    "print(f\"Max: {df_filled['sales_amount'].max():.2f}\")\n",
247    "print(f\"25th Percentile: {df_filled['sales_amount'].quantile(0.25):.2f}\")\n",
248    "print(f\"75th Percentile: {df_filled['sales_amount'].quantile(0.75):.2f}\")\n",
249    "print(f\"IQR: {df_filled['sales_amount'].quantile(0.75) - df_filled['sales_amount'].quantile(0.25):.2f}\")"
250   ]
251  },
252  {
253   "cell_type": "code",
254   "execution_count": null,
255   "metadata": {},
256   "outputs": [],
257   "source": [
258    "# Summary statistics for categorical columns\n",
259    "print(\"Category distribution:\")\n",
260    "print(df_filled['category'].value_counts())\n",
261    "\n",
262    "print(\"\\nRegion distribution:\")\n",
263    "print(df_filled['region'].value_counts())"
264   ]
265  },
266  {
267   "cell_type": "markdown",
268   "metadata": {},
269   "source": [
270    "## 5. Groupby Operations and Aggregations\n",
271    "\n",
272    "Analyzing data by groups to identify patterns and trends."
273   ]
274  },
275  {
276   "cell_type": "code",
277   "execution_count": null,
278   "metadata": {},
279   "outputs": [],
280   "source": [
281    "# Group by category and calculate aggregates\n",
282    "category_summary = df_filled.groupby('category').agg({\n",
283    "    'sales_amount': ['sum', 'mean', 'count'],\n",
284    "    'quantity': ['sum', 'mean'],\n",
285    "    'unit_price': 'mean'\n",
286    "}).round(2)\n",
287    "\n",
288    "print(\"Summary by Category:\")\n",
289    "print(category_summary)"
290   ]
291  },
292  {
293   "cell_type": "code",
294   "execution_count": null,
295   "metadata": {},
296   "outputs": [],
297   "source": [
298    "# Group by region\n",
299    "region_summary = df_filled.groupby('region').agg({\n",
300    "    'sales_amount': ['sum', 'mean'],\n",
301    "    'customer_id': 'nunique'  # Count unique customers\n",
302    "}).round(2)\n",
303    "\n",
304    "region_summary.columns = ['Total Sales', 'Avg Sales', 'Unique Customers']\n",
305    "print(\"\\nSummary by Region:\")\n",
306    "print(region_summary)"
307   ]
308  },
309  {
310   "cell_type": "code",
311   "execution_count": null,
312   "metadata": {},
313   "outputs": [],
314   "source": [
315    "# Multiple groupby - category and region\n",
316    "multi_group = df_filled.groupby(['category', 'region'])['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
317    "print(\"\\nSummary by Category and Region:\")\n",
318    "print(multi_group.head(10))"
319   ]
320  },
321  {
322   "cell_type": "markdown",
323   "metadata": {},
324   "source": [
325    "## 6. Correlation Analysis\n",
326    "\n",
327    "Examining relationships between numerical variables."
328   ]
329  },
330  {
331   "cell_type": "code",
332   "execution_count": null,
333   "metadata": {},
334   "outputs": [],
335   "source": [
336    "# Select only numerical columns for correlation\n",
337    "numerical_cols = df_filled.select_dtypes(include=[np.number]).columns\n",
338    "correlation_matrix = df_filled[numerical_cols].corr()\n",
339    "\n",
340    "print(\"Correlation Matrix:\")\n",
341    "print(correlation_matrix)"
342   ]
343  },
344  {
345   "cell_type": "code",
346   "execution_count": null,
347   "metadata": {},
348   "outputs": [],
349   "source": [
350    "# Correlation with a specific column\n",
351    "print(\"\\nCorrelation with sales_amount:\")\n",
352    "print(correlation_matrix['sales_amount'].sort_values(ascending=False))"
353   ]
354  },
355  {
356   "cell_type": "markdown",
357   "metadata": {},
358   "source": [
359    "## 7. Filtering and Sorting\n",
360    "\n",
361    "Essential techniques for data exploration and analysis."
362   ]
363  },
364  {
365   "cell_type": "code",
366   "execution_count": null,
367   "metadata": {},
368   "outputs": [],
369   "source": [
370    "# Filter: High-value sales (> 500)\n",
371    "high_value_sales = df_filled[df_filled['sales_amount'] > 500]\n",
372    "print(f\"Number of high-value sales: {len(high_value_sales)}\")\n",
373    "print(high_value_sales.head())"
374   ]
375  },
376  {
377   "cell_type": "code",
378   "execution_count": null,
379   "metadata": {},
380   "outputs": [],
381   "source": [
382    "# Multiple conditions: Electronics in North region\n",
383    "electronics_north = df_filled[(df_filled['category'] == 'Electronics') & (df_filled['region'] == 'North')]\n",
384    "print(f\"\\nElectronics sales in North region: {len(electronics_north)}\")\n",
385    "print(electronics_north.head())"
386   ]
387  },
388  {
389   "cell_type": "code",
390   "execution_count": null,
391   "metadata": {},
392   "outputs": [],
393   "source": [
394    "# Sort by sales_amount (descending)\n",
395    "top_sales = df_filled.sort_values('sales_amount', ascending=False).head(10)\n",
396    "print(\"\\nTop 10 sales:\")\n",
397    "print(top_sales[['date', 'category', 'region', 'sales_amount', 'quantity']])"
398   ]
399  },
400  {
401   "cell_type": "code",
402   "execution_count": null,
403   "metadata": {},
404   "outputs": [],
405   "source": [
406    "# Value counts for categorical analysis\n",
407    "print(\"\\nTop categories by number of transactions:\")\n",
408    "print(df_filled['category'].value_counts())\n",
409    "\n",
410    "print(\"\\nCategory proportions:\")\n",
411    "print(df_filled['category'].value_counts(normalize=True))"
412   ]
413  },
414  {
415   "cell_type": "markdown",
416   "metadata": {},
417   "source": [
418    "## 8. Date/Time Operations\n",
419    "\n",
420    "Working with temporal data to identify trends and patterns."
421   ]
422  },
423  {
424   "cell_type": "code",
425   "execution_count": null,
426   "metadata": {},
427   "outputs": [],
428   "source": [
429    "# Extract date components\n",
430    "df_filled['year'] = df_filled['date'].dt.year\n",
431    "df_filled['month'] = df_filled['date'].dt.month\n",
432    "df_filled['day'] = df_filled['date'].dt.day\n",
433    "df_filled['day_of_week'] = df_filled['date'].dt.day_name()\n",
434    "df_filled['quarter'] = df_filled['date'].dt.quarter\n",
435    "\n",
436    "print(\"Date components added:\")\n",
437    "print(df_filled[['date', 'year', 'month', 'day', 'day_of_week', 'quarter']].head())"
438   ]
439  },
440  {
441   "cell_type": "code",
442   "execution_count": null,
443   "metadata": {},
444   "outputs": [],
445   "source": [
446    "# Monthly sales trend\n",
447    "monthly_sales = df_filled.groupby('month')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
448    "monthly_sales.columns = ['Total Sales', 'Avg Sales', 'Transactions']\n",
449    "print(\"\\nMonthly Sales Summary:\")\n",
450    "print(monthly_sales)"
451   ]
452  },
453  {
454   "cell_type": "code",
455   "execution_count": null,
456   "metadata": {},
457   "outputs": [],
458   "source": [
459    "# Sales by day of week\n",
460    "day_of_week_sales = df_filled.groupby('day_of_week')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
461    "print(\"\\nSales by Day of Week:\")\n",
462    "print(day_of_week_sales)"
463   ]
464  },
465  {
466   "cell_type": "code",
467   "execution_count": null,
468   "metadata": {},
469   "outputs": [],
470   "source": [
471    "# Quarterly performance\n",
472    "quarterly_sales = df_filled.groupby('quarter')['sales_amount'].agg(['sum', 'mean', 'count']).round(2)\n",
473    "quarterly_sales.columns = ['Total Sales', 'Avg Sales', 'Transactions']\n",
474    "print(\"\\nQuarterly Sales Summary:\")\n",
475    "print(quarterly_sales)"
476   ]
477  },
478  {
479   "cell_type": "markdown",
480   "metadata": {},
481   "source": [
482    "## 9. Advanced Filtering and Insights\n",
483    "\n",
484    "Combining techniques to derive actionable insights."
485   ]
486  },
487  {
488   "cell_type": "code",
489   "execution_count": null,
490   "metadata": {},
491   "outputs": [],
492   "source": [
493    "# Find top-performing category-region combinations\n",
494    "top_combinations = df_filled.groupby(['category', 'region'])['sales_amount'].sum().sort_values(ascending=False).head(10)\n",
495    "print(\"Top 10 Category-Region Combinations by Total Sales:\")\n",
496    "print(top_combinations)"
497   ]
498  },
499  {
500   "cell_type": "code",
501   "execution_count": null,
502   "metadata": {},
503   "outputs": [],
504   "source": [
505    "# Identify customers with highest total purchases\n",
506    "top_customers = df_filled.groupby('customer_id').agg({\n",
507    "    'sales_amount': 'sum',\n",
508    "    'date': 'count'\n",
509    "}).sort_values('sales_amount', ascending=False).head(10)\n",
510    "top_customers.columns = ['Total Purchases', 'Number of Transactions']\n",
511    "print(\"\\nTop 10 Customers:\")\n",
512    "print(top_customers)"
513   ]
514  },
515  {
516   "cell_type": "code",
517   "execution_count": null,
518   "metadata": {},
519   "outputs": [],
520   "source": [
521    "# Calculate category-wise market share\n",
522    "category_sales = df_filled.groupby('category')['sales_amount'].sum()\n",
523    "category_market_share = (category_sales / category_sales.sum() * 100).round(2)\n",
524    "print(\"\\nMarket Share by Category (%):\")\n",
525    "print(category_market_share.sort_values(ascending=False))"
526   ]
527  },
528  {
529   "cell_type": "markdown",
530   "metadata": {},
531   "source": [
532    "## Summary\n",
533    "\n",
534    "In this notebook, we covered essential EDA techniques:\n",
535    "\n",
536    "1. **Dataset Overview**: Understanding shape, columns, and data types\n",
537    "2. **Missing Value Analysis**: Identifying and handling missing data\n",
538    "3. **Statistical Summaries**: Mean, median, standard deviation, quantiles\n",
539    "4. **Groupby Operations**: Aggregating data by categories\n",
540    "5. **Correlation Analysis**: Examining relationships between variables\n",
541    "6. **Filtering and Sorting**: Extracting specific subsets of data\n",
542    "7. **Date/Time Operations**: Temporal analysis and trends\n",
543    "8. **Advanced Insights**: Combining techniques for business intelligence\n",
544    "\n",
545    "These techniques form the foundation of data analysis and are essential for understanding your data before modeling."
546   ]
547  }
548 ],
549 "metadata": {
550  "kernelspec": {
551   "display_name": "Python 3",
552   "language": "python",
553   "name": "python3"
554  },
555  "language_info": {
556   "codemirror_mode": {
557    "name": "ipython",
558    "version": 3
559   },
560   "file_extension": ".py",
561   "mimetype": "text/x-python",
562   "name": "python",
563   "nbconvert_exporter": "python",
564   "pygments_lexer": "ipython3",
565   "version": "3.8.0"
566  }
567 },
568 "nbformat": 4,
569 "nbformat_minor": 4
570}