← Back to Home

Fraud Detection with Machine Learning


1. Background and Objective

This project focuses on developing a machine learning–based fraud detection system for real-world transaction data.

Fraudulent transactions are typically rare, highly heterogeneous, and evolve over time, making rule-based systems or single global models insufficient.

The objective of this project is to build a robust, explainable, and scalable modeling framework that improves fraud detection performance by combining domain-informed feature engineering, user-segmentation modeling, and ensemble learning.


2. Data Characteristics and Challenges

The transaction data exhibits several industry-grade challenges:

Given these properties, the project prioritizes modeling strategy and feature design over excessive model complexity.


3. Feature Engineering

3.1 User-Level Features

Encrypted address information enables unique identification of purchasers, allowing user-centric historical analysis.

Key idea: segment users based on purchase history

For users with purchase history, the following features were constructed:

These features capture behavioral consistency, which is critical for fraud detection.


3.2 Merchant-Level Features

To complement transaction-level signals, merchant-specific risk characteristics were incorporated:

This enables the model to learn merchant-level risk profiles.


4. User Segmentation Modeling Strategy

New users and long-time users differ fundamentally in both available information and behavioral stability.

Therefore, this project adopts an explicit user segmentation modeling strategy.

4.1 Modeling Policy

- New users

- Users with purchase history


5. Model Selection

Multiple models were evaluated during experimentation:

Considering robustness to missing values, handling of categorical features, and training stability, CatBoost was selected as the core model.


6. Hyperparameter Tuning and Analysis

Systematic hyperparameter tuning was conducted for CatBoost, along with importance analysis.

Key findings include:

These insights guided efficient and interpretable tuning strategies.

Hyperparameter Importance Analysis
Hyperparameter Importance Analysis

7. Ensemble Learning

To further enhance robustness, AutoGluon was applied for ensembling:


8. Experimental Results (F1-score)

8.1 Overall Performance

8.2 Segmented Performance

User Group Before After
Users with purchase history 0.44 0.60
Users without purchase history 0.18 0.30

These results indicate:


9. Summary

This project demonstrates that effective fraud detection relies on:

The framework is production-oriented and can be naturally extended to merchant-level, regional, or time-window–based modeling.