## π1. Definitions
* **Iteration:** The process of accessing each element (row or column) of a DataFrame one by one.
* **Descriptive Statistics:** Statistical functions used to summarize the central tendency, dispersion, and shape of a datasetβs distribution (e.g., mean, median, max).
* **Missing Data (NaN):** Represents "Not a Number." It indicates missing or undefined values in the dataset.
* **Binary Operation:** Mathematical operations performed between two DataFrames or a Series and a DataFrame (e.g., addition, subtraction).
* **Groupby:** A technique to split data into groups based on some criteria, apply a function to each group, and combine the results.
---
## π οΈ 2.1 Introduction & Setup
Before we start, let's create a "Master DataFrame" that we will use for most examples. Imagine this is a result sheet for a class.
### π» Code Example: Creating the Dataset
```python
import pandas as pd
import numpy as np
# Creating a Dictionary
data = {
'Name': ['Arjun', 'Bina', 'Chirag', 'Divya'],
'Maths': [90, 85, 78, 92],
'IP': [88, 95, np.nan, 90], # Chirag was absent for IP (NaN)
'Section': ['A', 'A', 'B', 'B']
}
# Creating the DataFrame
df = pd.DataFrame(data)
print(df)
```
### π Output
```text
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 NaN B
3 Divya 92 90.0 B
```
---
## π 2.2 Iterating Over a DataFrame
Iteration means visiting the data items one by one.
### 2.2.1 Horizontal Iteration (`iterrows()`)
Traverses the DataFrame **row by row**.
### π» Code Example
```python
# Printing Name and Maths marks for each student
for index, row in df.iterrows():
print(f"Index {index}: {row['Name']} scored {row['Maths']}")
```
### π Output
```text
Index 0: Arjun scored 90
Index 1: Bina scored 85
Index 2: Chirag scored 78
Index 3: Divya scored 92
```
### 2.2.2 Vertical Iteration (`iteritems()`)
Traverses the DataFrame **column by column**.
### π» Code Example
```python
# Printing each column name and its content
for col_name, col_data in df.iteritems():
print(f"--- Column: {col_name} ---")
print(col_data.values) # Using .values just to keep output short
```
### π Output
```text
--- Column: Name ---
['Arjun' 'Bina' 'Chirag' 'Divya']
--- Column: Maths ---
[90 85 78 92]
--- Column: IP ---
[88. 95. nan 90.]
--- Column: Section ---
['A' 'A' 'B' 'B']
```
---
## β 2.3 Binary Operations
This refers to math operations (Add, Sub, Mul, Div) between two DataFrames.
**π Key Rule:** Pandas aligns data based on **Row Index** and **Column Label**. If labels don't match, you get `NaN`.
### π» Code Example
```python
# Let's create two small DataFrames
df1 = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=[0, 1])
df2 = pd.DataFrame({'A': [5, 5], 'B': [5, 5]}, index=[1, 2])
print("DF1:\n", df1)
print("DF2:\n", df2)
# Adding them
print("\n--- Result of DF1 + DF2 ---")
print(df1 + df2)
```
### π Output
```text
DF1:
A B
0 10 30
1 20 40
DF2:
A B
1 5 5
2 5 5
--- Result of DF1 + DF2 ---
A B
0 NaN NaN <-- Index 0 exists in DF1 but not DF2
1 25.0 45.0 <-- Index 1 exists in both (Matched!)
2 NaN NaN <-- Index 2 exists in DF2 but not DF1
```
---
## π 2.4 Descriptive Statistics
Functions to analyze numbers. *Note: These functions automatically skip `NaN` values.*
### 2.4.1 Min, Max, Sum, Count
### π» Code Example
```python
print("Highest Marks (Max):\n", df[['Maths', 'IP']].max())
print("\nTotal Marks (Sum):\n", df[['Maths', 'IP']].sum())
print("\nCount of Values:\n", df[['Maths', 'IP']].count())
```
### π Output
```text
Highest Marks (Max):
Maths 92.0
IP 95.0
dtype: float64
Total Marks (Sum):
Maths 345.0
IP 273.0
dtype: float64
Count of Values:
Maths 4
IP 3 <-- Note: Chirag's NaN was not counted
dtype: int64
```
### 2.4.2 Mean, Median, Mode
### π» Code Example
```python
print("Average (Mean):\n", df[['Maths', 'IP']].mean())
print("\nMedian (Middle Value):\n", df[['Maths', 'IP']].median())
```
### π Output
```text
Average (Mean):
Maths 86.25
IP 91.00
dtype: float64
Median (Middle Value):
Maths 87.5
IP 90.0
dtype: float64
```
### 2.4.5 The `describe()` Function
The "Swiss Army Knife" of statistics. It gives you everything in one go.
### π» Code Example
```python
print(df.describe())
```
### π Output
```text
Maths IP
count 4.000000 3.000000 <-- Count
mean 86.250000 91.000000 <-- Average
std 6.184658 3.605551 <-- Standard Deviation
min 78.000000 88.000000 <-- Minimum
25% 83.250000 89.000000 <-- 25th Percentile
50% 87.500000 90.000000 <-- Median (50%)
75% 90.500000 92.500000 <-- 75th Percentile
max 92.000000 95.000000 <-- Maximum
```
---
## π 2.5 Essential Functions
### 2.5.1 Inspection (`info`, `head`, `tail`)
### π» Code Example
```python
# Shows structure of the DataFrame
df.info()
print("\n--- Top 2 Rows (head) ---")
print(df.head(2))
```
### π Output
```text
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Maths 4 non-null int64
2 IP 3 non-null float64
3 Section 4 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
--- Top 2 Rows (head) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
```
---
## π« 2.6 Handling Missing Data
In our data, **Chirag** has `NaN` in IP. Let's fix it.
### 2.6.1 Detecting Missing Data
### π» Code Example
```python
print(df.isnull()) # True if data is missing
```
### π Output
```text
Name Maths IP Section
0 False False False False
1 False False False False
2 False False True False <-- Look at index 2 (IP)
3 False False False False
```
###2.6.2 Dropping & Filling
### π» Code Example
```python
# Strategy 1: Delete rows with missing data
print("--- After Dropna ---")
print(df.dropna())
# Strategy 2: Fill missing data with 0
print("\n--- After Fillna(0) ---")
print(df.fillna(0))
```
### π Output
```text
--- After Dropna ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
3 Divya 92 90.0 B <-- Chirag is gone!
--- After Fillna(0) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 0.0 B <-- Chirag is back with 0 marks
3 Divya 92 90.0 B
```
---
## π¦ 2.7 Function `groupby()`
This is used to split data into groups. Let's group our students by **Section**.
### π» Code Example
```python
# Group by 'Section' and find the average marks for each section
grouped_df = df.groupby('Section').mean()
print(grouped_df)
```
### π Output
```text
Maths IP
Section
A 87.5 91.5
B 85.0 90.0
```
* **Explanation:**
* **Section A:** (Arjun 90 + Bina 85)/2 = **87.5**
* **Section B:** (Chirag 78 + Divya 92)/2 = **85.0**
---
## π Pro Tips for the Exam
1. **Iterrows vs Iteritems:** Remember `rows` = Horizontal (slow), `items` = Vertical (fast).
2. **Describe:** If asked "Which function gives the summary of the dataset?", the answer is `describe()`.
3. **Axis:**
* `axis=0` means calculate **down** the column (default).
* `axis=1` means calculate **across** the row.
---
* **Iteration:** The process of accessing each element (row or column) of a DataFrame one by one.
* **Descriptive Statistics:** Statistical functions used to summarize the central tendency, dispersion, and shape of a datasetβs distribution (e.g., mean, median, max).
* **Missing Data (NaN):** Represents "Not a Number." It indicates missing or undefined values in the dataset.
* **Binary Operation:** Mathematical operations performed between two DataFrames or a Series and a DataFrame (e.g., addition, subtraction).
* **Groupby:** A technique to split data into groups based on some criteria, apply a function to each group, and combine the results.
---
## π οΈ 2.1 Introduction & Setup
Before we start, let's create a "Master DataFrame" that we will use for most examples. Imagine this is a result sheet for a class.
### π» Code Example: Creating the Dataset
```python
import pandas as pd
import numpy as np
# Creating a Dictionary
data = {
'Name': ['Arjun', 'Bina', 'Chirag', 'Divya'],
'Maths': [90, 85, 78, 92],
'IP': [88, 95, np.nan, 90], # Chirag was absent for IP (NaN)
'Section': ['A', 'A', 'B', 'B']
}
# Creating the DataFrame
df = pd.DataFrame(data)
print(df)
```
### π Output
```text
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 NaN B
3 Divya 92 90.0 B
```
---
## π 2.2 Iterating Over a DataFrame
Iteration means visiting the data items one by one.
### 2.2.1 Horizontal Iteration (`iterrows()`)
Traverses the DataFrame **row by row**.
### π» Code Example
```python
# Printing Name and Maths marks for each student
for index, row in df.iterrows():
print(f"Index {index}: {row['Name']} scored {row['Maths']}")
```
### π Output
```text
Index 0: Arjun scored 90
Index 1: Bina scored 85
Index 2: Chirag scored 78
Index 3: Divya scored 92
```
### 2.2.2 Vertical Iteration (`iteritems()`)
Traverses the DataFrame **column by column**.
### π» Code Example
```python
# Printing each column name and its content
for col_name, col_data in df.iteritems():
print(f"--- Column: {col_name} ---")
print(col_data.values) # Using .values just to keep output short
```
### π Output
```text
--- Column: Name ---
['Arjun' 'Bina' 'Chirag' 'Divya']
--- Column: Maths ---
[90 85 78 92]
--- Column: IP ---
[88. 95. nan 90.]
--- Column: Section ---
['A' 'A' 'B' 'B']
```
---
## β 2.3 Binary Operations
This refers to math operations (Add, Sub, Mul, Div) between two DataFrames.
**π Key Rule:** Pandas aligns data based on **Row Index** and **Column Label**. If labels don't match, you get `NaN`.
### π» Code Example
```python
# Let's create two small DataFrames
df1 = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=[0, 1])
df2 = pd.DataFrame({'A': [5, 5], 'B': [5, 5]}, index=[1, 2])
print("DF1:\n", df1)
print("DF2:\n", df2)
# Adding them
print("\n--- Result of DF1 + DF2 ---")
print(df1 + df2)
```
### π Output
```text
DF1:
A B
0 10 30
1 20 40
DF2:
A B
1 5 5
2 5 5
--- Result of DF1 + DF2 ---
A B
0 NaN NaN <-- Index 0 exists in DF1 but not DF2
1 25.0 45.0 <-- Index 1 exists in both (Matched!)
2 NaN NaN <-- Index 2 exists in DF2 but not DF1
```
---
## π 2.4 Descriptive Statistics
Functions to analyze numbers. *Note: These functions automatically skip `NaN` values.*
### 2.4.1 Min, Max, Sum, Count
### π» Code Example
```python
print("Highest Marks (Max):\n", df[['Maths', 'IP']].max())
print("\nTotal Marks (Sum):\n", df[['Maths', 'IP']].sum())
print("\nCount of Values:\n", df[['Maths', 'IP']].count())
```
### π Output
```text
Highest Marks (Max):
Maths 92.0
IP 95.0
dtype: float64
Total Marks (Sum):
Maths 345.0
IP 273.0
dtype: float64
Count of Values:
Maths 4
IP 3 <-- Note: Chirag's NaN was not counted
dtype: int64
```
### 2.4.2 Mean, Median, Mode
### π» Code Example
```python
print("Average (Mean):\n", df[['Maths', 'IP']].mean())
print("\nMedian (Middle Value):\n", df[['Maths', 'IP']].median())
```
### π Output
```text
Average (Mean):
Maths 86.25
IP 91.00
dtype: float64
Median (Middle Value):
Maths 87.5
IP 90.0
dtype: float64
```
### 2.4.5 The `describe()` Function
The "Swiss Army Knife" of statistics. It gives you everything in one go.
### π» Code Example
```python
print(df.describe())
```
### π Output
```text
Maths IP
count 4.000000 3.000000 <-- Count
mean 86.250000 91.000000 <-- Average
std 6.184658 3.605551 <-- Standard Deviation
min 78.000000 88.000000 <-- Minimum
25% 83.250000 89.000000 <-- 25th Percentile
50% 87.500000 90.000000 <-- Median (50%)
75% 90.500000 92.500000 <-- 75th Percentile
max 92.000000 95.000000 <-- Maximum
```
---
## π 2.5 Essential Functions
### 2.5.1 Inspection (`info`, `head`, `tail`)
### π» Code Example
```python
# Shows structure of the DataFrame
df.info()
print("\n--- Top 2 Rows (head) ---")
print(df.head(2))
```
### π Output
```text
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Maths 4 non-null int64
2 IP 3 non-null float64
3 Section 4 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
--- Top 2 Rows (head) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
```
---
## π« 2.6 Handling Missing Data
In our data, **Chirag** has `NaN` in IP. Let's fix it.
### 2.6.1 Detecting Missing Data
### π» Code Example
```python
print(df.isnull()) # True if data is missing
```
### π Output
```text
Name Maths IP Section
0 False False False False
1 False False False False
2 False False True False <-- Look at index 2 (IP)
3 False False False False
```
###2.6.2 Dropping & Filling
### π» Code Example
```python
# Strategy 1: Delete rows with missing data
print("--- After Dropna ---")
print(df.dropna())
# Strategy 2: Fill missing data with 0
print("\n--- After Fillna(0) ---")
print(df.fillna(0))
```
### π Output
```text
--- After Dropna ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
3 Divya 92 90.0 B <-- Chirag is gone!
--- After Fillna(0) ---
Name Maths IP Section
0 Arjun 90 88.0 A
1 Bina 85 95.0 A
2 Chirag 78 0.0 B <-- Chirag is back with 0 marks
3 Divya 92 90.0 B
```
---
## π¦ 2.7 Function `groupby()`
This is used to split data into groups. Let's group our students by **Section**.
### π» Code Example
```python
# Group by 'Section' and find the average marks for each section
grouped_df = df.groupby('Section').mean()
print(grouped_df)
```
### π Output
```text
Maths IP
Section
A 87.5 91.5
B 85.0 90.0
```
* **Explanation:**
* **Section A:** (Arjun 90 + Bina 85)/2 = **87.5**
* **Section B:** (Chirag 78 + Divya 92)/2 = **85.0**
---
## π Pro Tips for the Exam
1. **Iterrows vs Iteritems:** Remember `rows` = Horizontal (slow), `items` = Vertical (fast).
2. **Describe:** If asked "Which function gives the summary of the dataset?", the answer is `describe()`.
3. **Axis:**
* `axis=0` means calculate **down** the column (default).
* `axis=1` means calculate **across** the row.
---