My first data science project
A couple months ago I started a new adventure on my professional life, studying a new field, after 11 years on Support/Implementation Software. I worked on BI project, using ETL in my work, but Data Science is something that I never considered as right for me because I am not a statistics kind of person, but I’m always was fascinated with “predicting the future, based on past behavior”.
So I started to search for training courses in this area. I participated a bootcamp, some free online courses , but nothing that I truly enjoyed. Until I heard of Meigarom Lopes and his website “Seja um Data Scientist”. I suspected on first, but then I was conviced to try “Data Science em Produção”. This course opened my mind with Meigarom “framework” and how he brings his experience how data scientist to us, begginers and the diverses applications of Data Science.
After finalizing his course, and ear several times “You need to pratice and make your projects portfolio”, I started pratice on “Comunidade DS”, a community of data science students and professionals dedicated to make projects and share our experiences with each other.
And were we are, I took a project that community worked in past and tried to do it by myself, applying all the knowledge gained until now.
The Blocker Fraud Company is a company specialized in detecting fraud in financial transactions made through mobile devices. The company has a service called “Blocker Fraud” in which it guarantees the blocking of fraudulent transactions.
And the business model of the company is of the Service type with the monetization made by the performance of the service provided, that is, the user pays a fixed fee on the success in detecting fraud in the customer’s transactions.
However, the Blocker Fraud Company is expanding in Brazil and to acquire customers more quickly, it has adopted a very aggressive strategy. The strategy works as follows:
- The company will receive 25% of the value of each transaction that is truly detected as fraud.
- The company will receive 5% of the value of each transaction detected as fraud, but the transaction is truly legitimate.
- The company will return 100% of the value to the customer, for each transaction detected as legitimate, however the transaction is truly a fraud.
- With this aggressive strategy, the company assumes the risks of failing to detect fraud and is compensated for assertive fraud detection.
For the client, it is an excellent business to hire the Blocker Fraud Company. Although the fee charged is very high, 25% upon success, the company reduces its costs with fraudulent transactions correctly detected and the damage caused by an error in the anti-fraud service will be covered by the Blocker Fraud Company itself.
For the company, in addition to getting many customers with this risky strategy to guarantee reimbursement in the event of a failure to detect customer fraud, it depends only on the precision and accuracy of the models built by its Data Scientists, that is, how much the more accurate the “Blocker Fraud” model, the greater the company’s revenue. However, if the model has low accuracy, the company could have a huge loss.
You have been hired as a Data Science Consultant to create a model of high precision and accuracy in detecting fraud of transactions made through mobile devices.
At the end of your consultation, you need to deliver to the CEO of Blocker Fraud Company a model in production in which your access will be made via API, that is, customers will send their transactions via API so that your model classifies them as fraudulent or legitimate.
In addition, you will need to submit a report of your model’s performance and results in relation to the profit and loss that the company will have when using the model you produced. Your report should contain the answers to the following questions:
- What is the model’s Precision and Accuracy?
- How Reliable is the model in classifying transactions as legitimate or fraudulent?
- What is the Expected Billing by the Company if we classify 100% of transactions with the model?
- What is the Loss Expected by the Company in case of model failure?
- What is the Profit Expected by the Blocker Fraud Company when using the model?
In this step I tried to understand the dataset. I looked to dataset, to find how big it is, how many columns and rows it has, if null rows exist, and some statisticals from data.
Two points caugth my attention on this step. The size of dataset (6MM of rows) and unbalanced between frauds(0,13%) and legitimate transactions(99,87%).
After an overview, I began creating a hypothesis about the data. I did this step before analysing all data to avoid being influenced by what i saw. I start to create a Mental Map with all features of dataset and tried to get insights by this features.
I did some derivating feature, but the features that increased significantily the models performances was:
diffOrig — Delta between old Balance and new Balance based on Amount Transaction. In regular case must to be equal zero
diffDest — Delta between old Balance and new Balance based on Amount Transaction, in transfer transactions to regular customer. Anothers transactions types dont update dest Balance. In regular case must to be equal zero
qtdTransferOrigName — Quantity transactions from accountName
qtdTransferDestName — Quantity transactions to accountName
Exploratory Data Analysis
This is the most challenge step for me, i’m still working hard to better on this step, because irealize hist importance, but i broke my mind sometimes.
In this step i try look insights, values and informations about the business and fenomenos. I Applied logarithm to allow see behavior of frauds, because of unbalanced proportion of frauds e legitimacy transactions.
The points made me attention was the balance difference between old balance and new balance by amount transaction.
In this step i prepare the data to be processed by a Machine Learning Model. We apply some transformation of data according to your characteristic.
Rescaling — step,qtdTransferOrigName,qtdTransferDestName,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,week,day,nameOrigNumber,nameDestNumber,diffOrig,diffDest. In this features was applied reescaling to change the range of data, by whitout change the fenomenos.
Enconding — type,origType,destType — The Most of Machine Learning Model dont work with categorical feature, so i need to transform this feature in numerical types.Was applied dummy transformation. This transformation make a new column to each variety of possible data.
Nature Transformation — hour — Applied transformation sin and cos, to create a event cycle. This transformation helps the model understand the spacing of data.
In this step i try reduce the features to minimal necessary to explain to Machine Learning Model the fenomenos. This is necessary based the principal of Occam’s Razor, and this reduction influce significantily on model performance and the comportament of frauds.
We chouce the columns using 2 techniques:
- Feature Importance — using a Random Forest Regressor that calculate the influency of feature on fenomenos
- Boruta — using a Forest to find the best features that explain the fenomenos
Machine Learning Models
In this step we applied some machine learning models to try predict the frauds. I used in this step the models:
- Random — Baseline
- Ridge Classifier
- XGBoost Classifier
- Random Forest Classifier
Whit this performance result i’ll continue with XGBoost and Random Forest Models.
In this step i try to validate the performance resultaded from previous step. I applied the cross validation, that is a tecnique to split the trainning dataset in smaller peaces and try to predict the next piece . This step is necessary to understand if the model is not overffiting, that is when model memorize the dataset, but it cant predict data that it never saw.
Models Test Predict
After all, now we gonna test the real performance with new data. Until now, all execution was maded with trainning e validation data. With this result we can see if the real performance and efficientily of the model.
The models performance were similar, beeing “Random Forest Model” with 99,51% of Precision/Recall and “XGBoost Model” with 99,57% of Precision/Recall. However Random Forest presents best execution performance, executing about 33% faster then “XGBoost”. The pickle exports of “Random Forest” also show significantily smaller, beeign 78% smaller then “XGBoost”. So we’ll continue with “Random Forest Model”.
The Business Result
In test data set have 1.272.524 transcation, being 1643 frauds transactions, totalizing 2,507,288,036.22 of frauds.
Our model is able to detected 99,51 percent of all frauds, avoiding $ 2,504,426,229.95 of frauds.
Frauds not detected by our model, totalize 0,49 percentl of all frauds, not avoiding 2,861,806.27 of frauds. All this amount will be refound to customer.
In our bussiness strategy, we are able to generate a economy of 1,881,181,478.73 to customer, generating a revenue of 623,244,751.22 to Blocker Fraud Company.
Who i am
My name is Saulo Ferreira Cunha, System Analyst and Data scientist.