Data Analysis with pandas for Finance and Accounting
Welcome to post #12 in our Python journey! In the previous post, we explored NumPy and its powerful numerical capabilities. Now we’re taking a step up to pandas, which builds on NumPy’s foundation to provide specialized tools for working with tabular and time series data—exactly the kind of data we deal with daily in finance and accounting.
As a finance professional, I’ve found pandas to be one of the most valuable tools in my Python toolkit. It’s transformed how I handle everything from financial statements to transaction records. Let’s dive in!
Why pandas for Finance?
pandas is specifically designed for data analysis and manipulation, with particular strengths in:
- Working with tabular data (like spreadsheets, CSV files, and SQL tables)
- Handling time series (perfect for financial data)
- Cleaning messy data (a constant challenge in real-world accounting)
- Joining and merging datasets (combining data from multiple sources)
- Advanced grouping and aggregation (similar to pivot tables in Excel)
If you’ve spent hours manipulating data in Excel, pandas will feel both familiar yet incredibly more powerful—think Excel on steroids with the automation capabilities of Python.
Installing pandas
Let’s start by installing pandas:
pip install pandas
If you’re using Anaconda, pandas is already included in the distribution.
Let’s verify the installation:
import pandas as pd
print(pd.__version__)
You should see the version number displayed (like 2.0.3
or similar). By convention, pandas is imported with the alias pd
to make your code more readable.
The Core pandas Data Structures
pandas provides two primary data structures:
- Series: A one-dimensional labeled array (like a column in a spreadsheet)
- DataFrame: A two-dimensional labeled data structure with columns (like a spreadsheet or SQL table)
Let’s explore both with financial examples.
Series: One-Dimensional Data
A Series is essentially a column of data with labeled indices:
import pandas as pd
import numpy as np
# Create a Series of monthly expenses
monthly_expenses = pd.Series([1200, 1500, 1100, 1800],
index=['Rent', 'Payroll', 'Utilities', 'Inventory'])
print(monthly_expenses)
# Output:
# Rent 1200
# Payroll 1500
# Utilities 1100
# Inventory 1800
# dtype: int64
# Access by label
print(f"Monthly rent: ${monthly_expenses['Rent']}") # Output: Monthly rent: $1200
# Access by position
print(f"Second highest expense: ${monthly_expenses.iloc[1]}") # Output: Second highest expense: $1500
# Perform calculations
print(f"Total monthly expenses: ${monthly_expenses.sum()}") # Output: Total monthly expenses: $5600
print(f"Average expense: ${monthly_expenses.mean()}") # Output: Average expense: $1400
# Filter expenses greater than 1200
high_expenses = monthly_expenses[monthly_expenses > 1200]
print(high_expenses)
# Output:
# Payroll 1500
# Inventory 1800
# dtype: int64
Series are useful for representing:
- A single financial metric across multiple companies
- Monthly or yearly values for a specific account
- Daily stock prices
DataFrame: Two-Dimensional Data
DataFrames are where pandas really shines. Think of them as an Excel spreadsheet within Python:
# Create a DataFrame representing a simplified income statement
income_statement = pd.DataFrame({
'Q1': [100000, 45000, 15000, 40000],
'Q2': [110000, 47000, 15000, 48000],
'Q3': [95000, 42000, 16000, 37000],
'Q4': [125000, 52000, 17000, 56000]
}, index=['Revenue', 'COGS', 'Operating Expenses', 'Net Income'])
print(income_statement)
# Output:
# Q1 Q2 Q3 Q4
# Revenue 100000 110000 95000 125000
# COGS 45000 47000 42000 52000
# Operating Expenses 15000 15000 16000 17000
# Net Income 40000 48000 37000 56000
# Or create the same DataFrame with a different orientation
# (More commonly used format with rows as observations and columns as variables)
income_statement_alt = pd.DataFrame({
'Quarter': ['Q1', 'Q2', 'Q3', 'Q4'],
'Revenue': [100000, 110000, 95000, 125000],
'COGS': [45000, 47000, 42000, 52000],
'Operating_Expenses': [15000, 15000, 16000, 17000],
'Net_Income': [40000, 48000, 37000, 56000]
})
print(income_statement_alt)
# Output:
# Quarter Revenue COGS Operating_Expenses Net_Income
# 0 Q1 100000 45000 15000 40000
# 1 Q2 110000 47000 15000 48000
# 2 Q3 95000 42000 16000 37000
# 3 Q4 125000 52000 17000 56000
The orientation you choose often depends on your data and analysis goals. For financial data:
- Use index for time periods and columns for accounts when tracking few accounts over time
- Use rows for time periods and columns for accounts when tracking many accounts
Reading Data from External Sources
In real-world finance, data often comes from external files. pandas excels at importing from various sources:
Reading from CSV
CSVs are common for exporting financial data:
# Read a transaction register CSV file
# transactions.csv contains: Date,Description,Category,Amount
transactions = pd.read_csv('transactions.csv')
# Preview the first 5 rows
print(transactions.head())
# Output:
# Date Description Category Amount
# 0 2023-01-05 Office Supplies Expense -89.99
# 1 2023-01-07 Client Payment - ABC Inc Income 1250.00
# 2 2023-01-10 Payroll Expense -4500.00
# 3 2023-01-15 Software License Expense -199.99
# 4 2023-01-18 Bank Interest Income 2.13
Reading from Excel
Excel files are ubiquitous in finance and accounting:
# Install openpyxl first if you haven't:
# pip install openpyxl
# Read an Excel file, specifying the sheet
financial_data = pd.read_excel('financial_model.xlsx', sheet_name='Income Statement')
# Preview the data
print(financial_data.head())
Connecting to Databases
For more enterprise-level finance applications:
# pip install sqlalchemy
# pip install pymysql (or other database driver)
from sqlalchemy import create_engine
# Connect to database (example with MySQL)
engine = create_engine('mysql+pymysql://username:password@localhost/finance_db')
# Read data directly from SQL query
sql_query = "SELECT * FROM general_ledger WHERE account_type = 'Asset' AND transaction_date > '2023-01-01'"
gl_data = pd.read_sql(sql_query, engine)
print(gl_data.head())
Inspecting and Exploring Your Data
When you receive a new financial dataset, your first step is usually to understand its structure:
# Assuming we've loaded our transactions data
print(f"Data shape (rows, columns): {transactions.shape}")
print("\nColumn names:")
print(transactions.columns.tolist())
print("\nData types:")
print(transactions.dtypes)
print("\nSummary statistics:")
print(transactions.describe())
print("\nMissing values:")
print(transactions.isnull().sum())
# Check for duplicates
print(f"\nDuplicate rows: {transactions.duplicated().sum()}")
Data Selection and Filtering
pandas offers powerful ways to select and filter data:
# Select specific columns
amounts_and_categories = transactions[['Category', 'Amount']]
print(amounts_and_categories.head())
# Filter rows based on conditions
expenses = transactions[transactions['Amount'] < 0]
income = transactions[transactions['Amount'] > 0]
print(f"Total expenses: ${expenses['Amount'].sum():.2f}")
print(f"Total income: ${income['Amount'].sum():.2f}")
# Filter with multiple conditions
# Find large office expenses
large_office_expenses = transactions[
(transactions['Category'] == 'Expense') &
(transactions['Description'].str.contains('Office')) &
(transactions['Amount'] < -100)
]
print(large_office_expenses)
# Select data from a specific date range
# First, ensure Date is a datetime type
transactions['Date'] = pd.to_datetime(transactions['Date'])
# Filter for Q1 data
q1_data = transactions[(transactions['Date'] >= '2023-01-01') &
(transactions['Date'] <= '2023-03-31')]
print(f"Q1 transaction count: {len(q1_data)}")
Handling Missing Data
Missing data is a common challenge in financial datasets:
# Check for missing values
print(transactions.isnull().sum())
# Fill missing values in Amount column with 0
transactions['Amount'] = transactions['Amount'].fillna(0)
# Fill missing categories with 'Uncategorized'
transactions['Category'] = transactions['Category'].fillna('Uncategorized')
# Drop rows with any remaining missing values
transactions_clean = transactions.dropna()
# Alternative: Only drop rows where specific columns are missing
transactions_essential = transactions.dropna(subset=['Date', 'Amount'])
Data Transformation and Feature Engineering
Often, you’ll need to create new calculated fields for financial analysis:
# Add a Month column for easier grouping
transactions['Month'] = transactions['Date'].dt.strftime('%Y-%m')
# Add an Absolute Amount column for analysis
transactions['Abs_Amount'] = transactions['Amount'].abs()
# Categorize transactions by size
def categorize_amount(amount):
if abs(amount) < 100:
return 'Small'
elif abs(amount) < 1000:
return 'Medium'
else:
return 'Large'
transactions['Size_Category'] = transactions['Amount'].apply(categorize_amount)
# Create a new column indicating if it's end of quarter
transactions['Is_Quarter_End'] = transactions['Date'].dt.is_quarter_end
# Add a calculated running balance
transactions = transactions.sort_values('Date')
transactions['Running_Balance'] = transactions['Amount'].cumsum()
print(transactions.head())
Grouping and Aggregation: The Heart of Financial Analysis
Grouping operations are similar to Excel’s pivot tables and are perfect for financial reporting:
# Group by Category and calculate sum, count, and average
category_summary = transactions.groupby('Category').agg({
'Amount': ['sum', 'count', 'mean'],
'Description': 'count' # count of transactions
})
print(category_summary)
# Group by Month and Category to see spending trends
monthly_by_category = transactions.groupby(['Month', 'Category'])['Amount'].sum().unstack()
print(monthly_by_category)
# Calculate monthly totals
monthly_totals = transactions.groupby('Month')['Amount'].sum()
print(monthly_totals)
# Find the month with the highest expenses
monthly_expenses = transactions[transactions['Amount'] < 0].groupby('Month')['Amount'].sum()
highest_expense_month = monthly_expenses.idxmin() # min because expenses are negative
print(f"Month with highest expenses: {highest_expense_month}, Amount: ${monthly_expenses.min():.2f}")
Advanced Financial Calculations
Let’s implement some practical financial calculations:
# Monthly Profit and Loss Statement
monthly_pl = transactions.pivot_table(
index='Month',
columns='Category',
values='Amount',
aggfunc='sum'
).fillna(0)
# Add a Profit column
monthly_pl['Profit'] = monthly_pl['Income'] + monthly_pl['Expense'] # Adding because expenses are negative
print(monthly_pl)
# Calculate running cash balance by date
daily_net = transactions.groupby('Date')['Amount'].sum().reset_index()
daily_net['Running_Balance'] = daily_net['Amount'].cumsum()
print(daily_net.head())
# Calculate 30-day moving average of daily transaction amounts
transactions_by_date = transactions.groupby('Date')['Amount'].sum()
moving_avg_30d = transactions_by_date.rolling(window=30).mean()
print(moving_avg_30d.tail())
Merging and Joining Datasets
In finance, we often need to combine data from multiple sources:
# Create a sample budget DataFrame
budget = pd.DataFrame({
'Category': ['Rent', 'Utilities', 'Payroll', 'Marketing', 'Software'],
'Budgeted_Amount': [1200, 300, 5000, 1000, 500]
})
# Group actual expenses by category
actual_expenses = transactions[transactions['Amount'] < 0].groupby('Category')['Amount'].sum().abs().reset_index()
actual_expenses.columns = ['Category', 'Actual_Amount']
# Merge the budget with actual expenses
budget_vs_actual = pd.merge(budget, actual_expenses, on='Category', how='outer')
# Calculate variance (negative means over budget)
budget_vs_actual['Variance'] = budget_vs_actual['Budgeted_Amount'] - budget_vs_actual['Actual_Amount']
budget_vs_actual['Variance_Percent'] = (budget_vs_actual['Variance'] / budget_vs_actual['Budgeted_Amount']) * 100
print(budget_vs_actual)
# Find categories that are over budget
over_budget = budget_vs_actual[budget_vs_actual['Variance'] < 0]
print("\nCategories over budget:")
print(over_budget)
Time Series Analysis: Perfect for Financial Data
pandas has exceptional support for time series data:
# Convert Date to datetime if not already
transactions['Date'] = pd.to_datetime(transactions['Date'])
# Set Date as index for time series analysis
ts_data = transactions.set_index('Date')
# Resample to get monthly totals
monthly_data = ts_data['Amount'].resample('M').sum()
print(monthly_data)
# Compute rolling average (e.g., 3-month rolling revenue)
income_only = ts_data[ts_data['Amount'] > 0]
rolling_3m_revenue = income_only['Amount'].resample('M').sum().rolling(window=3).mean()
print(rolling_3m_revenue)
# Calculate percent change period-over-period
monthly_growth = monthly_data.pct_change() * 100
print(f"Monthly growth rates (%):\n{monthly_growth}")
# Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
# Need 2+ years of data for proper decomposition, but here's the concept:
# Assuming we have enough data:
# decomposition = seasonal_decompose(monthly_data, model='additive', period=12)
# fig = decomposition.plot()
Handling Excel-Like Functionality with pandas
If you’re transitioning from Excel, pandas offers similar functionality with more power:
# VLOOKUP equivalent
product_info = pd.DataFrame({
'Product_ID': ['A001', 'B002', 'C003', 'D004'],
'Product_Name': ['Laptop', 'Monitor', 'Keyboard', 'Mouse'],
'Unit_Cost': [1200, 300, 80, 25]
})
sales = pd.DataFrame({
'Date': ['2023-01-15', '2023-01-20', '2023-01-22', '2023-01-25'],
'Product_ID': ['A001', 'C003', 'A001', 'B002'],
'Quantity': [2, 5, 1, 3]
})
# Merge sales with product info (like VLOOKUP)
sales_with_info = pd.merge(sales, product_info, on='Product_ID')
# Calculate total sales amount
sales_with_info['Total_Cost'] = sales_with_info['Quantity'] * sales_with_info['Unit_Cost']
print(sales_with_info)
# Pivot tables
pivot_sales = sales_with_info.pivot_table(
index='Product_Name',
values=['Quantity', 'Total_Cost'],
aggfunc={'Quantity': 'sum', 'Total_Cost': 'sum'}
)
print(pivot_sales)
# Sort by total cost (descending)
print(pivot_sales.sort_values('Total_Cost', ascending=False))
Exporting and Saving Your Analysis
Once your analysis is complete, you’ll often need to export the results:
# Export as CSV
budget_vs_actual.to_csv('budget_analysis.csv', index=False)
# Export to Excel
with pd.ExcelWriter('financial_analysis.xlsx') as writer:
budget_vs_actual.to_excel(writer, sheet_name='Budget Analysis', index=False)
pivot_sales.to_excel(writer, sheet_name='Sales by Product')
monthly_pl.to_excel(writer, sheet_name='Monthly P&L')
# Export specific results to JSON (useful for web applications)
over_budget.to_json('over_budget_alert.json', orient='records')
Case Study: Comprehensive Financial Analysis
Let’s pull everything together in a finance-focused case study:
# 1. Load and prepare data
transactions = pd.read_csv('transactions.csv')
transactions['Date'] = pd.to_datetime(transactions['Date'])
transactions['Month'] = transactions['Date'].dt.strftime('%Y-%m')
transactions['Quarter'] = transactions['Date'].dt.to_period('Q').astype(str)
# 2. Clean data
transactions['Category'] = transactions['Category'].fillna('Uncategorized')
transactions['Amount'] = transactions['Amount'].fillna(0)
# 3. Create income and expense dataframes
income = transactions[transactions['Amount'] > 0]
expenses = transactions[transactions['Amount'] < 0]
expenses['Abs_Amount'] = expenses['Amount'].abs() # Make expenses positive for easier calculations
# 4. Quarterly Financial Reports
quarterly_summary = pd.DataFrame({
'Total_Income': income.groupby('Quarter')['Amount'].sum(),
'Total_Expenses': expenses.groupby('Quarter')['Abs_Amount'].sum(),
})
quarterly_summary['Net_Profit'] = quarterly_summary['Total_Income'] - quarterly_summary['Total_Expenses']
quarterly_summary['Profit_Margin'] = quarterly_summary['Net_Profit'] / quarterly_summary['Total_Income'] * 100
print("Quarterly Financial Summary:")
print(quarterly_summary)
# 5. Top 5 expense categories
top_expenses = expenses.groupby('Category')['Abs_Amount'].sum().sort_values(ascending=False).head(5)
print("\nTop 5 Expense Categories:")
print(top_expenses)
# 6. Monthly trend analysis
monthly_net = pd.DataFrame({
'Income': income.groupby('Month')['Amount'].sum(),
'Expenses': expenses.groupby('Month')['Abs_Amount'].sum() * -1 # Make expenses negative again
})
monthly_net['Net'] = monthly_net['Income'] + monthly_net['Expenses'] # Net will be positive or negative
monthly_net['3M_Rolling_Avg'] = monthly_net['Net'].rolling(window=3).mean()
print("\nMonthly Trend Analysis:")
print(monthly_net)
# 7. Cash flow analysis
transactions = transactions.sort_values('Date')
transactions['Running_Balance'] = transactions['Amount'].cumsum()
# Find any days where cash balance went negative
cash_issues = transactions[transactions['Running_Balance'] < 0]
if not cash_issues.empty:
print("\nCash Flow Alert - Negative Balance Detected:")
print(cash_issues[['Date', 'Running_Balance']].head())
# 8. Year-over-year comparison (assuming we have multiple years of data)
transactions['Year'] = transactions['Date'].dt.year
yoy_income = income.groupby(['Year', 'Quarter'])['Amount'].sum().unstack()
print("\nYear-over-Year Quarterly Income:")
print(yoy_income)
# 9. Export key insights to Excel
with pd.ExcelWriter('financial_insights.xlsx') as writer:
quarterly_summary.to_excel(writer, sheet_name='Quarterly Summary')
top_expenses.to_excel(writer, sheet_name='Top Expenses')
monthly_net.to_excel(writer, sheet_name='Monthly Trends')
print("\nFinancial analysis complete! Results exported to 'financial_insights.xlsx'")
Key Takeaways from pandas for Finance Professionals
For finance and accounting professionals, pandas offers several key advantages:
- Automation: Automate repetitive data tasks that would take hours in Excel
- Scale: Handle much larger datasets than Excel can manage
- Reproducibility: Document your analysis as code for future reference or audit
- Flexibility: Customize your analysis beyond what Excel templates allow
- Integration: Connect directly to databases, APIs, and other data sources
- Analysis Power: Perform complex statistical operations with minimal code
Next Steps
Now that you understand the basics of pandas for financial data analysis, you’re ready to move on to data visualization in our next post. We’ll explore how to transform these financial insights into compelling visualizations using Matplotlib and Seaborn!
Exercise: Financial Data Analysis
Before you go, try this exercise to practice your pandas skills:
You have a CSV file of transactions with columns for Date, Description, Category, and Amount. Using pandas:
- Calculate the monthly income, expenses, and net profit
- Find the top 3 expense categories by total amount
- Create a month-over-month percentage change analysis of your net profit
- Identify any expense categories that show an increasing trend over time
Good luck, and feel free to share your solutions in the comments!
This post is part of my journey learning Python. I’m a chartered accountant exploring programming to enhance my financial analysis toolkit. Follow along as we discover together how Python can transform our work in finance and accounting!