My first data science project

8 min readJan 7, 2021

A couple months ago I started a new adventure on my professional life, studying a new field, after 11 years on Support/Implementation Software. I worked on BI project, using ETL in my work, but Data Science is something that I never considered as right for me because I am not a statistics kind of person, but I’m always was fascinated with “predicting the future, based on past behavior”.

So I started to search for training courses in this area. I participated a bootcamp, some free online courses , but nothing that I truly enjoyed. Until I heard of Meigarom Lopes and his website “Seja um Data Scientist”. I suspected on first, but then I was conviced to try “Data Science em Produção”. This course opened my mind with Meigarom “framework” and how he brings his experience how data scientist to us, begginers and the diverses applications of Data Science.

After finalizing his course, and ear several times “You need to pratice and make your projects portfolio”, I started pratice on “Comunidade DS”, a community of data science students and professionals dedicated to make projects and share our experiences with each other.

And were we are, I took a project that community worked in past and tried to do it by myself, applying all the knowledge gained until now.

The context

The Blocker Fraud Company is a company specialized in detecting fraud in financial transactions made through mobile devices. The company has a service called “Blocker Fraud” in which it guarantees the blocking of fraudulent transactions.

And the business model of the company is of the Service type with the monetization made by the performance of the service provided, that is, the user pays a fixed fee on the success in detecting fraud in the customer’s transactions.

However, the Blocker Fraud Company is expanding in Brazil and to acquire customers more quickly, it has adopted a very aggressive strategy. The strategy works as follows:

The company will receive 25% of the value of each transaction that is truly detected as fraud.
The company will receive 5% of the value of each transaction detected as fraud, but the transaction is truly legitimate.
The company will return 100% of the value to the customer, for each transaction detected as legitimate, however the transaction is truly a fraud.
With this aggressive strategy, the company assumes the risks of failing to detect fraud and is compensated for assertive fraud detection.

For the client, it is an excellent business to hire the Blocker Fraud Company. Although the fee charged is very high, 25% upon success, the company reduces its costs with fraudulent transactions correctly detected and the damage caused by an error in the anti-fraud service will be covered by the Blocker Fraud Company itself.

For the company, in addition to getting many customers with this risky strategy to guarantee reimbursement in the event of a failure to detect customer fraud, it depends only on the precision and accuracy of the models built by its Data Scientists, that is, how much the more accurate the “Blocker Fraud” model, the greater the company’s revenue. However, if the model has low accuracy, the company could have a huge loss.

The challenge

You have been hired as a Data Science Consultant to create a model of high precision and accuracy in detecting fraud of transactions made through mobile devices.

At the end of your consultation, you need to deliver to the CEO of Blocker Fraud Company a model in production in which your access will be made via API, that is, customers will send their transactions via API so that your model classifies them as fraudulent or legitimate.

In addition, you will need to submit a report of your model’s performance and results in relation to the profit and loss that the company will have when using the model you produced. Your report should contain the answers to the following questions:

What is the model’s Precision and Accuracy?
How Reliable is the model in classifying transactions as legitimate or fraudulent?
What is the Expected Billing by the Company if we classify 100% of transactions with the model?
What is the Loss Expected by the Company in case of model failure?
What is the Profit Expected by the Blocker Fraud Company when using the model?

The Dataset

The dataset is available on kaggle plataform, but all the business context was extracted from website “Seja um Data Scientists”.

The Solution

Data Description

In this step I tried to understand the dataset. I looked to dataset, to find how big it is, how many columns and rows it has, if null rows exist, and some statisticals from data.

Two points caugth my attention on this step. The size of dataset (6MM of rows) and unbalanced between frauds(0,13%) and legitimate transactions(99,87%).

Feature Engieneering

After an overview, I began creating a hypothesis about the data. I did this step before analysing all data to avoid being influenced by what i saw. I start to create a Mental Map with all features of dataset and tried to get insights by this features.

I did some derivating feature, but the features that increased significantily the models performances was:

diffOrig — Delta between old Balance and new Balance based on Amount Transaction. In regular case must to be equal zero

diffDest — Delta between old Balance and new Balance based on Amount Transaction, in transfer transactions to regular customer. Anothers transactions types dont update dest Balance. In regular case must to be equal zero

qtdTransferOrigName — Quantity transactions from accountName

qtdTransferDestName — Quantity transactions to accountName

Exploratory Data Analysis

This is the most challenge step for me, i’m still working hard to better on this step, because irealize hist importance, but i broke my mind sometimes.

In this step i try look insights, values and informations about the business and fenomenos. I Applied logarithm to allow see behavior of frauds, because of unbalanced proportion of frauds e legitimacy transactions.

The points made me attention was the balance difference between old balance and new balance by amount transaction.

Differences balance on Orig and Dest. Orig has negative balance, because the transf is bigger than the balance. Dest almost still with nothing after receving a transfer.

The most fraud have unique transaction on orig, however have several transactions in the same dest.

Data Preparation

In this step i prepare the data to be processed by a Machine Learning Model. We apply some transformation of data according to your characteristic.

Rescaling — step,qtdTransferOrigName,qtdTransferDestName,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,week,day,nameOrigNumber,nameDestNumber,diffOrig,diffDest. In this features was applied reescaling to change the range of data, by whitout change the fenomenos.

Enconding — type,origType,destType — The Most of Machine Learning Model dont work with categorical feature, so i need to transform this feature in numerical types.Was applied dummy transformation. This transformation make a new column to each variety of possible data.

Nature Transformation — hour — Applied transformation sin and cos, to create a event cycle. This transformation helps the model understand the spacing of data.

Feature Selection

In this step i try reduce the features to minimal necessary to explain to Machine Learning Model the fenomenos. This is necessary based the principal of Occam’s Razor, and this reduction influce significantily on model performance and the comportament of frauds.

We chouce the columns using 2 techniques:

Feature Importance — using a Random Forest Regressor that calculate the influency of feature on fenomenos
Boruta — using a Forest to find the best features that explain the fenomenos

Machine Learning Models

In this step we applied some machine learning models to try predict the frauds. I used in this step the models:

Random — Baseline
Ridge Classifier
XGBoost Classifier
Random Forest Classifier

Whit this performance result i’ll continue with XGBoost and Random Forest Models.

Models Performance

In this step i try to validate the performance resultaded from previous step. I applied the cross validation, that is a tecnique to split the trainning dataset in smaller peaces and try to predict the next piece . This step is necessary to understand if the model is not overffiting, that is when model memorize the dataset, but it cant predict data that it never saw.

XGBoost with parameter to unbalanced data

Models Test Predict

After all, now we gonna test the real performance with new data. Until now, all execution was maded with trainning e validation data. With this result we can see if the real performance and efficientily of the model.

XGBoost Final Performance | Random Forest Final Performance

Final Model

The models performance were similar, beeing “Random Forest Model” with 99,51% of Precision/Recall and “XGBoost Model” with 99,57% of Precision/Recall. However Random Forest presents best execution performance, executing about 33% faster then “XGBoost”. The pickle exports of “Random Forest” also show significantily smaller, beeign 78% smaller then “XGBoost”. So we’ll continue with “Random Forest Model”.

The Business Result

In test data set have 1.272.524 transcation, being 1643 frauds transactions, totalizing 2,507,288,036.22 of frauds.

Our model is able to detected 99,51 percent of all frauds, avoiding $ 2,504,426,229.95 of frauds.

Frauds not detected by our model, totalize 0,49 percentl of all frauds, not avoiding 2,861,806.27 of frauds. All this amount will be refound to customer.

In our bussiness strategy, we are able to generate a economy of 1,881,181,478.73 to customer, generating a revenue of 623,244,751.22 to Blocker Fraud Company.