Detecting Electronic Banking Fraud on Highly Imbalanced Data using Hidden Markov Models

Recent researches have revealed the capability of Machine Learning (ML) techniques to effectively detect fraud in electronic banking transactions since they have the potential to detect new and unknown intrusions. A major challenge in the application of ML to fraud detection is the presence of highly imbalanced data sets. In many available datasets, majority of transactions are genuine with an extremely small percentage of fraudulent ones. Designing an accurate and efficient fraud detection system that is low on false positives but detects fraudulent activity effectively is a significant challenge for researchers. In this paper, a framework based on Hidden Markov Models (HMM), modified Density Based Spatial Clustering of Applications with Noise (DBSCAN) and Synthetic Minority Oversampling Technique Techniques (SMOTE) is proposed to effectively detect fraud in a highly imbalanced electronic banking dataset. The various transaction types, transaction amounts and the frequency of transactions are taken into consideration by the proposed model to enable effective detection. With different number of hidden states for the proposed HMMs, simulations are performed for four (4) different approaches and their performances compared using precision, recall rate and F1-Score as the evaluation metrics. The study revealed that, our proposed


Introduction
E-banking is a form of banking where funds are transferred as exchange of electronic signals rather than cash, checks, or other types of paper documents [1]. Over the last few decades, E-Banking has redefined the way banking is conducted across the globe and the use of electronic payments platforms has continued to experience significant growth. It allows customers a 24-hour access to their accounts with the ability to transfer funds, perform on-line payments and apply for loans and other financial products virtually [2].
Fraud can be defined as any premeditated act of criminal deceit, trickery or falsification by a person or group of persons with the intention of altering facts, in order to obtain undue personal monetary advantage [3]. Unfortunately, fraud cases relating to cyber-crime perpetrated through E-banking resulted in an actual loss of GH¢14.31 million and therefore presents a unique challenge to individuals and financial institutions that offer those services [4]. To address this problem, financial institutions employ various fraud prevention tools such as real-time transaction authorization, transaction verification codes, transaction alerts, rule-based detection among others. Fraudsters however are adaptive, and given time, they devise several ways to circumvent such protection mechanisms [5]. There is therefore the need to implement enhanced technologies and systems that can detect fraud in real-time effectively in order to maintain the viability of these electronic payment systems where fraudsters constitute a very inventive and fast-moving fraternity. As preventive technology changes, so does the technology of criminals and the way they go about with their fraudulent activities [6]. While it is necessary to detect and possibly prevent fraudulent transactions, it is also very critical to ensure genuine transactions are executed successfully.
One of the most important techniques for intrusion/anomaly detection based on machine learning is using Hidden Markov Models (HMM) which are machine learning algorithms consisting of hidden states and observable outputs for modelling probability distributions over sequences of observations. The hidden state layer is a stable Markov chain and its state probability and state transition probability are decided from the initial state probability vector π and the state transition probabilities. Observable output layer is decided from the observed symbols probability matrix which is derived from the observed symbols of each hidden state [7].
The application of HMMs ranges from speech and image recognition, intrusion/fraud detection to motion/action analysis in videos among others and is generally characterized by the following [8]; 1. The number of hidden states in the model denoted as N. The state at a specific time t is denoted by .
2. The number of unique observation symbols denoted as M.
3. A transition probability between states denoted by a matrix = , where: Also, 4. An emission probability matrix, = , where 5. An initial probability for each state denoted by the vector In recent decades, many research communities have been working toward HMMbased intrusion detection mainly because of its ability to detect new and unknown intrusions and usage in real-time applications by processing data streams on-the-fly. HMMs also allow for the usage of heterogeneous data sources as input, and visual representation of acquired knowledge relative to the other techniques of machine learning.
Over the past few years, the use of Electronic banking platforms has continued to experience significant growth and has redefined the way banking or E-commerce is conducted across the world [9]. On the other hand, fraudulent Electronic banking and Ecommerce activities are becoming more and more sophisticated and challenging leading to massive financial losses. Effective and efficient detection of Electronic banking fraud is therefore regarded as one of the major challenges to all financial institutions, and is an increasing cause for concern [2].
According to the Bank of Ghana 2019 banking industry fraud report, fraud cases relating to cyber-crime perpetrated through electronic banking and mobile banking platforms accounts for the highest value of attempted fraud amounting to GH¢ 50.54 318 million with actual loss of GH¢14.31 million [4]. From available literature, majority of the works in the area of HMM-based fraud detection in Electronic banking focuses only on payments to merchants for goods and services . Transaction amounts are mostly taken as observation symbols and the types of items purchased considered as the hidden states of the proposed Hidden Markov Models. In related studies conducted by [10], [11], [12], [13], [14], and [15], techniques such as Neural Network , Bayesian Network , Dempster-Shafer theory, Support Vector Machine etc. are employed which incorporated other forms of electronic banking options such as remote funds transfers and deposits. However, all these proposed techniques perform classification based on a single transaction while relying on domain-expert features without considering a sequence of transactions to make a decision hence producing high levels of false positives.
A large number of false positives may translate into bad customer experience and may lead customers to take their business elsewhere. A major challenge in applying ML to fraud detection is presence of highly imbalanced data sets. In many available datasets, majority of transactions are genuine with an extremely small percentage of fraudulent ones. Designing an accurate and efficient fraud detection system that is low on false positives but detects fraudulent activity effectively is a significant challenge for researchers.
This proposed research seeks to develop and implement an improved fraud/intrusion detection system for both debit and credit transactions in electronic Banking using Hidden Markov Models by incorporating the various electronic banking platforms employed by customers, transaction amounts and the frequency at which these transactions occur. To determine the transaction profile of customers, the Density-based Spatial Clustering of Applications with Noise (DBSCAN) which is capable of discovering clusters of different shapes and sizes from a large amount of data containing noise and outliers was employed. Synthetic Minority Oversampling Technique (SMOTE) was also employed to handle the imbalanced class problem typical of Electronic banking datasets.
The rest of the paper is organized as follows: In Section 2, we present a review of related works. The methodology adopted for the study is outlined in Section 3. Detailed experimental results and discussion to establish the efficiency of the proposed approach is presented in Section 4. Finally, we conclude the paper with some discussions in Section 5.

Literature Review
Fraud Detection in Electronic Banking is understudied in literature perhaps due to security and data privacy concerns. We will begin by considering related works in electronic banking in general and then consider those specifically related to the use of credit cards which has been given considerable attention by researchers.
[10] presents a fraud detection system for online banking where differential analysis is used to obtain local evidence of fraud where a significant deviation from normal behavior indicates a potential fraud. The Dempster's rule of combination is applied to these evidences for final suspicion score of fraud. Their main contribution is a fraud detection method based on effective identification of devices used to access accounts and assessing the likelihood of being a fraud by tracking the number of different accounts accessed by each device. However, their system performs poorly for higher number of Hidden states and also when users' transaction patterns changes frequently.
[16] considered transaction amounts and purchases types as the emission symbols and hidden states respectively of the proposed HMM for online banking FDS. The model is trained with the normal behavior of an account holder using Baum-Welch algorithm and a One-time-Password is sent to the Customers contact number for authorization if an incoming transaction violates the behavior sequence. Although, the accuracy of their system was close to 72 percent over a wide variation in the input data, False Positives was still high especially when the transaction data is highly skewed. A fraudulent transaction could still go through if a fraudster has access to a customer's phone. [11] incorporates several advanced data mining techniques for online banking fraud detection by building a contrast vector for each transaction based on its customer's historical behavior sequence. A novel algorithm, Contrast Miner, was introduced to efficiently mine contrast patterns and distinguish fraudulent from genuine behavior, followed by an effective pattern selection and risk scoring that combines predictions from different models. Results from experiments on large-scale real online banking data demonstrated that the proposed system achieves substantially higher accuracy and with lower false positives by incorporating domain knowledge and traditional fraud detection methods.
[12] rather modeled the sequence of operations in online banking transaction processing using HMMs and described how it could be used for the detection of frauds.
The observation sequence length is fixed to two (2) whilst changing sequence length for training i.e., changing dataset length from 10 to 80 with difference of ten. Simulation results revealed that, although the complexity of the system also increases for increased observation sequence length, the accuracy of the proposed system is close to 60% with reduced false Positive rate.
The work done by [14] employed HMMs and k-means algorithm for detecting fraud in online banking transactions. In their proposed model, a variable is used to keep the number of transactions within a period of time before and after each transaction as well as the quantified amounts as the observation symbols. If an incoming transaction is not accepted by the trained HMM with sufficiently high probability, it is considered fraudulent. The feasibility of their proposed model is demonstrated through simulation experiments using real-world bank transaction data. In the case of enough historical transactions, their model performs well for low, medium frequency and amount of user groups. An efficient Prior determination of the number of clusters is considered a major challenge in their proposed approach.
Specifically on fraud detection relating to the use of Credit Cards, [17] considered purchase types and transaction amounts as hidden states and observation symbols respectively in their proposed HMM. In order to estimate the model parameters, the Kmeans clustering algorithm is employed to determine the spending profile of cardholders. An incoming transaction is considered fraud if it is not accepted by the HMM with a significantly high probability. Experimental results revealed that, their proposed model recorded an accuracy close to eighty (80) percent over a wide variation of the data. An efficient prior determination of the number of clusters and significant number of false positives were considered the major challenges in their proposed approach.
[18] performed a comparative analysis of intrusion detection models on highly skewed credit card data based on Decision Trees, Random Forest, Support Vector Machines (SVM) and logistic regression. The original sample was randomly partitioned into k-equal sized subsamples where a single subsample is retained as the validation data for testing the model, and the remaining − 1 subsamples used as training data. With the four basic metrics employed, namely True positive (TPR), True Negative (TNR), False Positive (FPR) and False Negative (FNR) rates, Simulation results using dataset provided by ULB machine learning revealed that, Logistic regression and Random forest shows the most precise and high accuracy in the area of credit card fraud detection but requires very large dataset for training and also suffers from the imbalanced dataset problem even after preprocessing.
In order to reduce the number of false positives, [19] proposed a model based on automated feature engineering to automatically derive behavioral features based on the historical data of a credit card associated with a transaction. A total of 237 features for each transaction was generated, and a random forest was then employed to learn a classifier. One important feature of their proposed model is that, it also utilizes the distance between two locations transactions on an account has occurred and whether they occurred in person or remotely is established. The proposed model was tested on data from a large multinational bank and compared to existing solutions and revealed that, on an unseen data of 1.852 million transactions, false positives was reduced by about 54%. However, since their models Perform classification based on a single transaction there was a performance degradation when transaction pattern of users changes frequently.

Methodology
There is generally a very limited number of public datasets on electronic banking for research purposes mainly due to personal and security concerns. In this research, a Kaggle provided dataset of simulated mobile based transactions is adopted. As detailed in Table 1, the dataset is highly imbalanced due to the fact that only 8,312 transactions out of the almost 6 million transactions are labeled as fraud. 'CASH IN' and 'CASH OUT' represents an increase in account balance of a customer as a result of cash inflow and a decrease in account balance as a result of cash outflow respectively. 'TRANSFER' refers to movement of money between users whilst 'PAYMENT' represents the settlements made for goods and services to merchants. 'DEBIT' as used in this context signifies the sending of money from a mobile service (electronic wallet) to a bank account.

Data pre-processing
To effectively evaluate the performance of our proposed models on the highly class imbalanced dataset, Synthetic Minority Oversampling Technique (SMOTE) is employed to generate virtual training records by linear interpolation for the fraudulent transactions by randomly selecting one or more of the k-nearest neighbors for each specific fraudulent transaction. After the oversampling process, the data is reconstructed and then the proposed Hidden Markov Models is applied on the processed data. Specifically, the sampling rate is set to 73000 %.
The proposed SMOTE technique as adopted in this study is presented in Algorithm 1.

Identifying transaction profile of customers
For optimal training of the our proposed Hidden Markov Models, a modified Density Based Spatial Clustering of Applications with Noise (DBSCAN) and the K-means clustering algorithms are executed on each customer's previous transactions by considering the amount and frequency of transactions. K-means is an unsupervised learning algorithm for grouping a given set of data based on their similarity where the numbers of clusters are fixed a priori. The grouping is performed by minimizing the sum Detecting Electronic Banking Fraud on Highly Imbalanced Data … of squares of distances between each data point and the centroid of the cluster to which it belongs to [20]. The DBSCAN clustering technique however filters out outliers and discovers clusters of arbitrary shapes [21]. We modified the DBSCAN algorithm by adding a step that computes the centroid of each cluster later to be used to dynamically convert an incoming transaction into an observation symbol in the fraud detection process.
The proposed DBSCAN technique as adopted in this study is presented as in Algorithm 2. where 8 is the number of points in cluster F .
Spending profiles of accountholders are determined at the end of the clustering step. Let G be the percentage of total number of transactions of an accountholder, then, the spending profile H of an account holder, I is determined as in (8): The cluster number to which most of the transactions of the account holder belongs to represents the spending profile of the account holder. The computed centroids are used to generate the observation symbol for a new transaction Ø (denoted by Ø P ) is defined as in (9).
The i th transaction on account ! denoted as ,W X Y is suspected to be an outlier if it does not belong to any cluster in the set Z′ where y refers to the frequency of transaction.
If the average distance of the amount p of an outlier transaction ,W X Y from the set of existing clusters in Z′ is [ S , then its level of deviation \ ] is given as in (10): The key idea of the modified DBSCAN algorithm is that for each point p in a cluster Z , there are at least a minimum number of points (MinPts) in the e-neighborhood of that point p denoted as j = i.e., the density in the e-neighborhood has to exceed some threshold. The proposed K-means algorithm as adopted in this study is presented in Algorithm 3. Specifically, for this research, the set Z = {low-frequency low-amount, lowfrequency medium-amount, low-frequency high-amount, medium frequency low-amount, medium-frequency medium-amount, medium-frequency high-amount, high-frequency low-amount, high-frequency medium-amount, high-frequency high-amount} denotes the clusters.
The set ) = {transaction_amount, frequency_of_transaction} represents the set of attributes used to generate these clusters.
To compute the probability of an observed sequence, k = k l , k , k m , … , k na with Detecting Electronic Banking Fraud on Highly Imbalanced Data … 325 respect to our Hidden Markov Model o, where p = p l , p , p m , … , p na represents the various hidden states, the definition of the emission transition Matrix is defined as in (11); k|p, o = ql k r q k … qna k na .
The Initial transition vector, π and State Transition Matrix, A are also defined as in (12) and (13); p|o = $ q q,s … qnam,sna (12) and k, p|o = k|p, o p|o .

Training the proposed HMMs
The transaction amounts are categorized into a Low z = 0; 100], Medium V = 100; 500], and High ℎ = 500; Transaction Limit] values. The frequency at which these transactions occur on a particular is also categorized into a Low (Less than 5 times a month), Intermediate (Between 5 and 10 times a month), and High (at least 10 times a month) are also considered by our proposed model. For example, if an accountholder performs about seven (7) transactions with the month with an average value of say 300, then the corresponding observation symbol is medium-frequency medium-amount (mm).
The various transaction types are considered the internal states whilst the transaction amounts combined with the frequency at which they occur denoted as {zz, zV, zℎ, Vz, VV, Vℎ, ℎz, ℎV, ℎℎ} represents the observation symbols of our proposed Hidden Markov Model.
After formulating the hidden states and observation symbol, a hybrid optimization algorithm as presented in Algorithm 4 comprising the Baum-Welch, Particle Swam and Genetic Algorithms is used to effectively train the proposed models.
10. Go to 6 11. After 100 iterations, each solution becomes a chromosome for the genetic procedure and the fitness function ‡|o is applied. 12. A multiple point crossover and mutation is performed to select the best 50 solutions for the next generation which are then positioned as particles in a search space using the PSO technique.
16. Compare the best solution each particle and the best position of the entire group and make appropriate adjustments 17. Termination Criteria Reached? If yes go to 18, otherwise go to 13 18. Output A, B and π

Fraud detection
To effectively classify an incoming transaction as fraudulent or otherwise, sequence of observation symbols, say k = k , k m , . . . , k T are extracted from the training data of an account holder and its probability of acceptance, š is computed by the model as in (30) by employing (20) š = = k , k m , k x , … , k T |o . 30 An incoming transaction occurring at time is converted to an observation symbol denoted as k using (9) is used to replace the first observation symbol, k and its probability of acceptance by the model denoted as š m is also computed as in (31).
The newly generated transaction is classified as fraud and if the difference between š and š denoted as oeš is above a predefined threshold • as in (32).
A genuine transaction is added to the sequence permanently to contribute to determining the validity or otherwise of the next transaction since transaction behavior of an accountholder could be dynamic. Otherwise, the transaction is declined, and the symbol is discarded.
Precision quantifies the number of correct positive predictions made whilst Recall refers to the number of correct positive predictions made out of all possible positive 328 predictions. F-Measure however provides a way to combine both precision and recall into a single measure.

Simulation Results and Discussion
For different number of hidden states, four (4) sets of simulations were performed in two (2) stages using Python programming and their performance compared. For all the four sets of experiments, the proposed hidden Markov models were executed in the second stage. In the first stages of the first and second set of experiments, K-means and the modified DBSCAN algorithms were executed respectively. In first stage of the third set of experiment, both SMOTE and K-means techniques were employed whereas SMOTE and the modified DBSCAN clustering techniques were executed during the last set of experiments. The dataset was loaded and divided into two, 80% of it is used for training and evaluation whilst the rest is held back for validation.

Precision comparison
The precision of the four approaches is presented for different number of hidden states and presented in Figure 1. It is very clear that our proposed approach (SMOTE+DBSCAN+HMM) performed better for the various hidden states. Applying only the modified DBSCAN clustering technique with Hidden Markov Models performed relatively better than that of employing K-Means. It is also worth noting that, relatively higher values of precision scores were recorded when the SMOTE technique is adopted. Detecting Electronic Banking Fraud on Highly Imbalanced Data …

Recall rates comparison
A comparison of the Recall rates of the four (4) different approaches for different numbers of hidden states are presented in Figure 2. It is evident that approaches that employed the SMOTE technique appear to perform relatively better. Similarly the modified DBSCAN clustering technique performed better as compared to the K-means.
It can also be observed that, for higher values of N, all the approaches performed well except for those that employed Only K-means and Hidden Markov models without handling the class imbalance classification.

F-measure comparison
In Figure 3, the F1-score for the four approaches are presented various Hidden states. It is observed that, higher F1-scores are obtained when the modified DBSCAN clustering technique is used as compared to using the K-means. Also, approaches that incorporated the SMOTE technique performed better. Employing both SMOTE and the modified DBSCAN clustering algorithms appears to perform relatively better than the other. All approaches performed relatively better when the number of hidden states is 3.

Conclusion
In this research, an improved electronic banking fraud detection framework based on Hidden Markov Models (HMM) and modified Density Based Spatial Clustering of Applications with Noise (DBSCAN) is proposed and implemented. The Synthetic Minority Oversampling Technique (SMOTE) is also employed due to the highly class imbalance nature of the dataset adopted. With different numbers of hidden states, simulations were performed two stages for four (4) different approaches in Python and their performance compared. For all the four sets of experiments, the proposed hidden Markov models were executed in the second stage. In the first stages of the first and second set of experiments, K-means and the modified DBSCAN algorithms were executed respectively. In first stage of the third set of experiment, both SMOTE and Kmeans techniques were employed whereas SMOTE and the modified DBSCAN clustering techniques were executed during the last set of experiments.
Generally, our proposed approach (SMOTE+DBSCAN+HMM) performed relatively better for all the various hidden states in terms of precision, recall and F1-Scores. Employing the modified DBSCAN clustering technique to determine the spending profile of customers and subsequently performed relatively better than using the K-Means algorithm since it filters out most of the easily recognizable fraudulent transactions before the proposed HMMs are applied. It is also evident from the simulation analysis that, the SMOTE technique effectively handles the class imbalance classification necessary to achieve improved performance.
Detecting Electronic Banking Fraud on Highly Imbalanced Data …