Healthcare provider fraud detection

(An end to end Machine learning case study)


If you have ever paid for insurance premiums you might have wondered why is the cost of these premiums keeps on increasing.There are several reasons behind it but one major reason is when dishonest people take money from insurance companies they don’t deserve ,it results in an increase in premium for all insurance holders. Health insurance is no exception.Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraudulent claims.

Rigorous studies have unveiled few of the tactics followed by healthcare providers to increase the size of their bank accounts some of them include

  • Billing for services not rendered

But nature of fraud is always evolving . For this reason a simple rule based method for detecting fraud would not be sufficient.This case study is based on how we can leverage the available data to detect potentially fraudulent healthcare providers.This case study will be an end to end approach in which we will use the available data to build a machine learning model further which we will be deploying this model in a cloud platform.

Here are the topics we will be covering as a part of this case study:

  1. Understanding the business problem
  • Building a streamlit app

10. Future work

11. References

12.About the author

Without further ado lets get started.

  1. Understanding the business problem

One of the major reasons health insurance premiums keep on increasing is fraud in healthcare domain.For this case study we are given past transaction details of a healthcare provider we need to identify the fraudulent ones.There are few business constraints

  • Both precision and recall must be high i.e. the providers we tag as fraudulent should indeed be fraudulent and we must identify all such providers without skipping any.

2.Applicability of ML and ML problem formulation

Fraud is a problem that is continuously evolving.Simple rule based methods to detect fraud won’t work. But the inherent patterns of a fraud remains more or less the same. To detect such patterns we can use Machine learning.

ML Problem formulation

The data available with us is a labelled one hence we will be using a supervised learning approach and the problem will be a binary classification task i.e to classify as Fraud or Not Fraud. Now lets have a look at the available data to formulate the problem better.

3.Data Source

For this case study we have collected the dataset from Kaggle.

We are given with 4 comma separated value files :

  • Provider : This file contains provider ID and whether that provider is fraudulent or not.

Both inpatient data and outpatient data have a feature called ProviderID. We can use it to merge with class label.

We have details of 5400 healthcare providers out of which only 506 providers are fraudulent. From initial analysis of data we find following details

  • The data doesn’t contain enough positive label i.e. class imbalance.

Now that we have formulated as a ML prolem we will need to define how we will be evaluating our model. Based on business constraints,nature of problem and type of data we will be using following performance metrics

  • F1 Score :As we want both precision and recall to be high

Note: Although we want both precision and recall to be high, in general a task like fraud detection demands recall to be very high.

4.Literature survey : Before diving in to the problem we decided to do a thorough research of existing approaches to similar problems. This helped us in getting useful insights on the problem and also to uncover common pitfalls for such a task. Some of them include

  • Most machine learning algorithms assume that the data is evenly distributed within classes.

Now lets begin with the case study itself.

5.Exploratory data analysis:

After loading the dependencies and the required data..

Inpatient data

Here missing values might represent physician not needed or procedure not done. So we can consider missing values as a class itself.

Outpatient data

Similar to inpatient data..

Beneficiary data

DOD missing means person is alive..

Merging all the dataframes together..

Statewise potential fraud

We see that in states 5,10,33 & 45 transactions are high and fraudulent transactions are also high.

Race vs Potential fraud

  • There are no race 4 persons in dataset.

Potential fraud vs Insurance claim amount deducted

Both for fraudulent and non fraudulent cases annual deductible amount is similar.

Potential fraud vs Insurance claim amount reimbursed

Fraudulent transactions’ insurance claim amount reimbursed seems to be slightly higher than that of non fraudulent ones.

Gender vs Potential fraud

Both genders seem equally prone to fraud.

Now we have enough insights .Lets handcraft some interesting features.

6.Feature engineering

For this case study we have engineered several features such as.

  • Settlement days: Days taken to settle a insurance claim
  • DaysAdmit: Number of days a patient was admitted in hospital
  • Age: Generating age of patient based on date of birth and date of death
  • Num_physician_rq: Number of physician required for a certain patient
  • Tot disease :Total number of disease a patient is suffering from

Handling missing values

For numerical features we input missing values with ‘0’ and for categorical we added a category called ‘Not available’

Outlier analysis

We see that points in red are very separated from the rest of the point exhibiting outlierish behaviour.

Other features too had outliers but they were adding useful information to the data so we decided not to operate on them.

Encoding categorical variables

We used simple label encoding to encode categorical variables.

All the data we have until now are about patient. In order to convert them to provider level we need to aggregate all the features based on provider.

Dummy coding the race feature using pandas get_dummies().

Generating aggregated features..

Similarly we have generated aggregated features for Medical cases,unique physicians,unique diagnosis codes,unique procedure codes,unique states and counties per provider.

Now we concatenate all the features to generate a dataset aggregated to provider level.

7.Feature selection and base line modelling

Before we proceed we need to split our dataset in order to avoid leakage of data.

Now lets find out correlation of features amongst themselves. For doing this we used the pandas df.corr().

Several features seem to have a very high degree of correlation amongst themselves. Lets eliminate those features having correlation greater than 0.99

Now lets visualize the data by projecting it to lower dimensions using a TSNE plot.

We see that many positive class labels are indeed forming clusters but some labels are also found apart. Through this plot we get the idea that based on the available features we should be able to build a decent classifier.

Lets understand which features are contributing to prediction and which are not. For doing this we built a Random forest classifier and plotted the feature importances

We can clearly see that procedure code 5 and 6 are not contributing to the prediction.So we decided to drop those features.

Before we start building models we need to address the class imbalance.

Handling Class Imbalance

We will be trying out below methods to attenuate class imbalance.

  • Re-sampling(Oversampling)

Re-sampling (Oversampling)

For oversampling we used imblearn’s RandomOverSampler that randomly samples datapoints from minority class.

Synthetic Minority Oversampling(SMOTE)

Here in order to oversample minority class we generate synthetic samples.

Class weights balancing

Every model has some parameter to alter class weights. We will be using that.

Cluster based resampling


Take only majority class apply clustering algorithm to group the data into k number of clusters where k being the number of data points in minority class.

For each cluster find its corresponding cluster centroid.

Use k cluster centroids and minority class data to build model

Cluster based sampling and aggregation


Break the majority class labels into L number of clusters.

Combine each cluster and minority class labels and train a model in each of them

Evaluate the model on test data.

After getting predictions we will be doing a majority vote and predict the majority label.

After experimenting with several methods we find simple random oversampling and class weights balancing out performs other complex methods. One possible reason for this could be less amount of data.

Now we are ready to build baseline models..

During building models we will be using imblearn library’s Pipeline as it supports RandomOversampling. Along with it we will also be using sklearn’s RepeatedStratifiedKFold. Using a pipeline helps us to cross validate our data without any leakage.

1. SVM (RBF Kernel)

Below is the tabulated results of all the baseline models we tried as part of the case study.

Now lets move on to ensembles

8.Ensemble modelling

  1. Catboost

2. Custom stacking ensembles

Parameter tuning for best number and combination of base learners..

Lets tune the hyper parameters of individual base learners..

Results of ensembles


Lets create a simple healthcare provider fraud detector app using streamlit that takes in details of healthcare provider and predicts whether it is fraudulent or not.

First save the best model in pickle format.

Then create the file. We have created our as follows

This ‘’ file takes input from user and using the model we stored and predicts whether the given provider is fraudulent or not.The front end part of this app will be handled by streamlit.

Containerization:In order to ensure our program runs reliably on different environments we need to containerize it.Docker is a software used for building containers.

Kubernetes: It is an open source system developed by Google used for managing containerized applications. Using kubernetes we can deploy scalably .Now during peak hours when the traffic increases with the help of kubernetes we can easily scale up or down by adding or removing containers.

After we had deployed our app in Google kubernetes engine it looks something like this:

Generating prediction from the app..

Nice! Our app seems to be working as expected..

10. Future work

  • Try out variational autoencoders or an anomaly detection approach.


12.About the author




Congrats, You have made till the end of the blog. If you are interested you can visit the Github repository for the project. There you will also find files required for deployment. Feel free to drop in any queries or suggestions in the comments section.

Have a good day!!

Mechanical Engineer--AI Enthusiast