Credit Card Fraud Detection
A comprehensive machine learning approach to identifying fraudulent transactions in real-time payment systems
Project Agenda
01
Problem & Impact
Understanding fraud challenges and business impact
02
Goals & KPIs
Defining success metrics and evaluation criteria
03
Dataset & Schema
Exploring data structure and key features
04
EDA Analysis
Class distribution, amounts, categories, and correlations
05
Preprocessing & Features
Data preparation and feature engineering
06
Handling Imbalance
Addressing class distribution challenges
07
Models & Training
Machine learning model development
08
Evaluation Metrics
Performance assessment and validation
09
Results & Visuals
Model performance and key findings
10
Conclusion & Next Steps
Summary and future recommendations
Problem & Impact
The Challenge
Credit card fraud represents a significant threat to financial institutions and consumers worldwide. Fraudulent transactions cause direct monetary losses, costly chargebacks, and damage to customer trust.
The primary technical challenge lies in the extreme class imbalance - legitimate transactions vastly outnumber fraudulent ones, making detection particularly difficult.
Key Statistics
0.52%
Fraud Rate
Only 0.52% of all transactions are fraudulent
This severe imbalance means traditional machine learning approaches often fail to detect fraud effectively, as models tend to optimize for overall accuracy rather than fraud detection.
Goals & KPIs
Business Goals
  • Detect fraud transactions early in the payment process
  • Minimize financial losses from fraudulent activities
  • Reduce chargeback costs and operational overhead
  • Maintain customer trust and satisfaction
Academic Objectives
  • Develop robust machine learning models for imbalanced datasets
  • Apply advanced feature engineering techniques
  • Compare multiple algorithms and evaluation strategies
  • Demonstrate practical application of data science methods
Key Performance Indicators
Recall
Percentage of actual fraud cases correctly identified. Critical for minimizing false negatives.
F1 Score
Harmonic mean of precision and recall, providing balanced performance assessment.
ROC-AUC
Area under the receiver operating characteristic curve, measuring model discrimination ability.

Important Note: The optimal threshold will be chosen based on business cost considerations, balancing false negatives (missed fraud) versus false positives (legitimate transactions flagged).
Dataset Overview
Dataset Specifications
Our analysis utilizes a comprehensive fraud detection dataset specifically designed for machine learning research and practical applications.
Data Source
Fraud-detection dataset
Click here
Raw Dataset Size
1,852,394 rows × 23 columns
Processed Features
Approximately 31 engineered features for modeling
Fraud Cases
9,651 fraudulent transactions
0.52%
Fraud Rate
Severe class imbalance challenge
Data Schema & Key Features
Understanding the structure and composition of our fraud detection dataset is crucial for effective model development.
Transaction Amount
amt — The monetary value of each transaction, a critical feature for fraud detection as fraudulent transactions often involve unusual amounts.
Fraud Label
is_fraud — Binary target variable (0/1) indicating whether a transaction is fraudulent or legitimate.
Temporal Features
hour, day_of_week — Time-based features capturing transaction timing patterns that may indicate fraudulent behavior.
Geographic Data
merch_lat, merch_long, lat, long — Merchant and customer location coordinates for geographic analysis.
Categorical Features
category, gender — Transaction category and customer demographic information for pattern recognition.
Class Distribution Analysis
Severe Class Imbalance Challenge
The dataset exhibits a significant class imbalance that presents the core challenge for our fraud detection system. This imbalance is typical in real-world fraud detection scenarios.
0.52%
Fraudulent
≈ 9,651 fraud cases
99.48%
Legitimate
≈ 1,842,743 normal transactions
Impact on Model Training
  • Standard algorithms may achieve 99%+ accuracy by simply predicting "not fraud"
  • Requires specialized techniques for imbalanced learning
  • Evaluation metrics must focus on minority class performance
  • Sampling strategies become critical for effective training
Transaction Amount Distribution
Fraud Transactions Show Distinct Amount Patterns
Analysis of transaction amounts reveals significant differences between fraudulent and legitimate transactions, providing valuable insights for our detection model.
Legitimate Transactions
Mean Amount: ≈ $67.65
  • Lower average transaction values
  • More consistent spending patterns
  • Fewer extreme outliers
  • Normal distribution characteristics
Fraudulent Transactions
Mean Amount: ≈ $530.66
  • Significantly higher average values
  • More extreme transaction amounts
  • Greater variance in spending
  • Presence of high-value outliers
Fraudulent transactions demonstrate a 7.8x higher mean amount compared to legitimate transactions, indicating that fraudsters often target high-value transactions to maximize their illicit gains.
Category & Demographics Analysis
Fraud Risk Varies Significantly Across Transaction Categories
Different merchant categories exhibit varying levels of fraud risk, providing important features for our detection model.
1.59%
shopping_net
Highest fraud rate category
1.30%
misc_net
Second highest risk category
1.26%
grocery_pos
Third highest fraud rate
Key Insights
  • Online shopping categories show elevated fraud rates
  • Point-of-sale transactions generally have lower risk
  • Miscellaneous online categories require special attention
  • Category-based risk scoring improves detection accuracy
Demographic Patterns
Additional analysis reveals patterns in gender and age demographics that correlate with fraudulent activity, though these relationships are more subtle than category-based patterns.
Correlation Analysis of Numerical Features
Below is a visual representation of the correlation matrix for our numerical features, providing a quick overview of feature relationships. Following that, a detailed table lists the exact correlation values with our target variable, 'is_fraud'.
Data Cleaning Summary
01
Missing Values Handled
Utilized median and mode imputation strategies for robust data completion.
02
Type Conversions
Converted 'amount' field to numeric for accurate quantitative analysis.
03
Outlier Capping
Applied IQR and clipping methods to manage and reduce the impact of extreme outliers.
04
Duplicates Removed
Identified and eliminated redundant records to ensure data integrity and prevent bias.
Feature Engineering Overview
Our feature engineering process focused on creating robust and informative variables across several key domains to enhance predictive power.
Amount Features
Created features related to transactional values and financial magnitudes.
Time Features
Derived variables from timestamps, capturing temporal patterns and seasonality.
Geographic Features
Incorporated location-based data to identify regional trends and spatial relationships.
Velocity Features
Developed metrics to track rates of activity and changes over time periods.
Category Risk Features
Assessed risk profiles associated with different categorical data points.
Following thorough selection and refinement, a total of ~31 features were ultimately chosen for integration into our modeling efforts, ensuring a balanced and impactful set of variables.
Feature Correlation Analysis
Top Correlates with Fraud Target Variable
Understanding which features have the strongest correlation with fraudulent transactions helps guide our feature selection and model development process.
1
Transaction Amount
Correlation: ≈ 0.1296
Strongest predictor - higher amounts indicate increased fraud risk
2
Night Transactions
Correlation: ≈ 0.0887
Transactions during nighttime hours show elevated fraud patterns
3
Log Amount
Correlation: ≈ 0.0700
Logarithmic transformation of amount provides additional predictive power
4
Transactions Per Hour
Correlation: ≈ 0.0676
High transaction frequency within short time windows indicates risk
5
Category Risk Score
Correlation: ≈ 0.0671
Engineered feature based on historical fraud rates by category

Feature Engineering Impact: While individual correlations appear modest, combining these features through advanced machine learning techniques creates powerful fraud detection capabilities.
Handling Class Imbalance
Algorithm-Specific Weights
Utilized class_weight parameter in Logistic Regression and Random Forest models to assign higher penalties to misclassified minority class samples, balancing the impact of imbalance during training.
Custom Weighting for Random Forest
Implemented a custom weighting scheme for Random Forest, specifying class_weight={{0:1, 1:10}} to heavily penalize misclassifications of the fraudulent (minority) class, improving its detection.
XGBoost Scale Positive Weight
Employed scale_pos_weight in XGBoost, setting it to the ratio of negative to positive samples. This parameter directly scales the gradient of positive samples, making the model more sensitive to the minority class.
Stratified Sampling & SMOTE
Ensured training and testing datasets maintained representative proportions of both classes using stratified sampling during the split. SMOTE was considered for synthetic minority oversampling but used carefully to avoid overfitting.
Models Tried & Rationale
Logistic Regression
  • Baseline model for comparison.
  • Fast to train and make predictions.
  • Highly interpretable.
Random Forest
  • Robust against overfitting.
  • Provides important feature importances.
XGBoost
  • Strong performance with imbalanced datasets.
  • Efficient and scalable.
Below is a visual representation of training cells, conceptually linked to the advanced ensemble models like Random Forest and XGBoost, showcasing their complexity and iterative learning processes.
Training Setup & Tuning
Data Splitting
Utilized an 80/20 stratified train/test split to maintain class distribution in both subsets, ensuring robust model evaluation and preventing bias.
Evaluation Metric
Hyperparameter tuning was primarily optimized for F1-score and Recall, crucial for minimizing false negatives in our imbalanced dataset and prioritizing detection.
Key Hyperparameters Tuned
Specific parameters adjusted included C for Logistic Regression, n_estimators and max_depth for Random Forest, and scale_pos_weight for XGBoost.
Evaluation Metrics
Precision
Precision = TP / (TP+FP)
Measures the accuracy of positive predictions. In fraud detection, high precision means fewer legitimate transactions are flagged as fraudulent (minimizing false positives), which reduces customer inconvenience and operational costs associated with investigating false alarms.
Recall
Recall = TP / (TP+FN)
Measures the ability to find all positive samples. In fraud detection, high recall means more actual fraudulent transactions are correctly identified (minimizing false negatives), which is crucial for preventing financial losses and maintaining security.
F1-Score
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
The harmonic mean of Precision and Recall. It provides a balanced measure of a model's performance, especially useful when there is an uneven class distribution and both false positives and false negatives are important to consider.
ROC-AUC
ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) measures the model's ability to distinguish between classes. A higher AUC indicates better overall discrimination between fraudulent and legitimate transactions, regardless of the classification threshold.
Results Table
The results indicate that RandomForest achieves the highest balance between precision and recall with its F1 score, suggesting strong performance in identifying both fraudulent and legitimate transactions. XGBoost shows a high recall but lower precision, while Logistic regression underperforms in this context, especially in recall and F1 metrics.
Model Selection
Recommended Models by Use Case
Balanced Performance (Production Ready)
Model: Random Forest (Original)
F1-Score: 0.7954
When to use: Need good balance of precision and recall
Conservative Approach (Minimize False Alarms)
Model: Random Forest (Original)
Precision: 0.9709
When to use: Customer experience is critical
Aggressive Detection (Catch All Fraud)
Model: XGBoost (Original)
Recall: 0.9508
When to use: Fraud prevention is top priority
Best Discriminator (Highest AUC)
Model: XGBoost (Original)
ROC-AUC: 0.9978
When to use: Model confidence is important
Conclusion
Summary
Thoughtful feature engineering combined with tree-based models delivered strong predictive performance.
Recommendation
For balanced performance, use Random Forest; for aggressive detection, use XGBoost. Consider testing in shadow mode before full deployment.
Takeaway
The approach is effective, flexible, and ready for production experimentation.
Made with