If you have ever paid for insurance premiums you might have wondered why is the cost of these premiums keeps on increasing.There are several reasons behind it but one major reason is when dishonest people take money from insurance companies they don’t deserve ,it results in an increase in premium for all insurance holders. Health insurance is no exception.Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraudulent claims.
Rigorous studies have unveiled few of the tactics followed by healthcare providers to increase the size of their bank accounts some of them include
- Billing for services not rendered
- Misrepresentating date,location and duration of service
- Incorrect representation of diagnoses and procedures
- Overutilization of resources
But nature of fraud is always evolving . For this reason a simple rule based method for detecting fraud would not be sufficient.This case study is based on how we can leverage the available data to detect potentially fraudulent healthcare providers.This case study will be an end to end approach in which we will use the available data to build a machine learning model further which we will be deploying this model in a cloud platform.
Here are the topics we will be covering as a part of this case study:
- Understanding the business problem
- Applicability of ML and ML problem formulation
- Data source
- Literature survey of existing approaches
- Exploratory data analysis
- Feature engineering
- Feature selection and baseline modelling
- Ensemble modelling
- Building a streamlit app
- Containerization and deployment in GCP
10. Future work
12.About the author
Without further ado lets get started.
- Understanding the business problem
One of the major reasons health insurance premiums keep on increasing is fraud in healthcare domain.For this case study we are given past transaction details of a healthcare provider we need to identify the fraudulent ones.There are few business constraints
- Both precision and recall must be high i.e. the providers we tag as fraudulent should indeed be fraudulent and we must identify all such providers without skipping any.
- Interpretability should be high i.e. our data driven judgments should make sense in general.
2.Applicability of ML and ML problem formulation
Fraud is a problem that is continuously evolving.Simple rule based methods to detect fraud won’t work. But the inherent patterns of a fraud remains more or less the same. To detect such patterns we can use Machine learning.
ML Problem formulation
The data available with us is a labelled one hence we will be using a supervised learning approach and the problem will be a binary classification task i.e to classify as Fraud or Not Fraud. Now lets have a look at the available data to formulate the problem better.
For this case study we have collected the dataset from Kaggle.
HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS
HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS
HEALTHCARE PROVIDER FRAUD DETECTION ANALYSISwww.kaggle.com
We are given with 4 comma separated value files :
- Provider : This file contains provider ID and whether that provider is fraudulent or not.
- Inpatient data: This file contains details of patients admitted to the hospital and ID of their healthcare provider.
- Outpatient data: This file contains details of patients diagnosed in the outdoor unit of hospital and ID of their healthcare provider.
- Beneficiary details: This file contains details of the recipient of the claim amount of health insurance.
Both inpatient data and outpatient data have a feature called ProviderID. We can use it to merge with class label.
We have details of 5400 healthcare providers out of which only 506 providers are fraudulent. From initial analysis of data we find following details
- The data doesn’t contain enough positive label i.e. class imbalance.
- Training data doesn’t contain enough labels i.e. only 5400 healthcare providers (less data to train on).
Now that we have formulated as a ML prolem we will need to define how we will be evaluating our model. Based on business constraints,nature of problem and type of data we will be using following performance metrics
- F1 Score :As we want both precision and recall to be high
- Confusion matrix:To make the score more interpretable we will be using a confusion matrix along with F1 score.
Note: Although we want both precision and recall to be high, in general a task like fraud detection demands recall to be very high.
4.Literature survey : Before diving in to the problem we decided to do a thorough research of existing approaches to similar problems. This helped us in getting useful insights on the problem and also to uncover common pitfalls for such a task. Some of them include
- Most machine learning algorithms assume that the data is evenly distributed within classes.
- Several methods exist to address class imbalance starting from simple as class resampling to complex like clustering approach.
- We should apply oversampling and undersampling on train data only. Validation and test data should be the type of data we will be getting in real world.
Now lets begin with the case study itself.
5.Exploratory data analysis:
After loading the dependencies and the required data..
Here missing values might represent physician not needed or procedure not done. So we can consider missing values as a class itself.
Similar to inpatient data..
DOD missing means person is alive..
Merging all the dataframes together..
Statewise potential fraud
We see that in states 5,10,33 & 45 transactions are high and fraudulent transactions are also high.
Race vs Potential fraud
- There are no race 4 persons in dataset.
- Patients of race 3 and 5 are more susceptible to fraud
Potential fraud vs Insurance claim amount deducted
Both for fraudulent and non fraudulent cases annual deductible amount is similar.
Potential fraud vs Insurance claim amount reimbursed
Fraudulent transactions’ insurance claim amount reimbursed seems to be slightly higher than that of non fraudulent ones.
Gender vs Potential fraud
Both genders seem equally prone to fraud.
Now we have enough insights .Lets handcraft some interesting features.
For this case study we have engineered several features such as.
- Settlement days: Days taken to settle a insurance claim
- DaysAdmit: Number of days a patient was admitted in hospital
- Age: Generating age of patient based on date of birth and date of death
- Num_physician_rq: Number of physician required for a certain patient
- Tot disease :Total number of disease a patient is suffering from
Handling missing values
For numerical features we input missing values with ‘0’ and for categorical we added a category called ‘Not available’
We see that points in red are very separated from the rest of the point exhibiting outlierish behaviour.
Other features too had outliers but they were adding useful information to the data so we decided not to operate on them.
Encoding categorical variables
We used simple label encoding to encode categorical variables.
All the data we have until now are about patient. In order to convert them to provider level we need to aggregate all the features based on provider.
Dummy coding the race feature using pandas get_dummies().
Generating aggregated features..
Similarly we have generated aggregated features for Medical cases,unique physicians,unique diagnosis codes,unique procedure codes,unique states and counties per provider.
Now we concatenate all the features to generate a dataset aggregated to provider level.
7.Feature selection and base line modelling
Before we proceed we need to split our dataset in order to avoid leakage of data.
Now lets find out correlation of features amongst themselves. For doing this we used the pandas df.corr().
Several features seem to have a very high degree of correlation amongst themselves. Lets eliminate those features having correlation greater than 0.99
Now lets visualize the data by projecting it to lower dimensions using a TSNE plot.
We see that many positive class labels are indeed forming clusters but some labels are also found apart. Through this plot we get the idea that based on the available features we should be able to build a decent classifier.
Lets understand which features are contributing to prediction and which are not. For doing this we built a Random forest classifier and plotted the feature importances
We can clearly see that procedure code 5 and 6 are not contributing to the prediction.So we decided to drop those features.
Before we start building models we need to address the class imbalance.
Handling Class Imbalance
We will be trying out below methods to attenuate class imbalance.
- Synthetic minority oversampling (SMOTE)
- Class weights balancing ()
- Clustering based resampling
- Cluster based sampling and aggregation
For oversampling we used imblearn’s RandomOverSampler that randomly samples datapoints from minority class.
Synthetic Minority Oversampling(SMOTE)
Here in order to oversample minority class we generate synthetic samples.
Class weights balancing
Every model has some parameter to alter class weights. We will be using that.
Cluster based resampling
Take only majority class apply clustering algorithm to group the data into k number of clusters where k being the number of data points in minority class.
For each cluster find its corresponding cluster centroid.
Use k cluster centroids and minority class data to build model
Cluster based sampling and aggregation
Break the majority class labels into L number of clusters.
Combine each cluster and minority class labels and train a model in each of them
Evaluate the model on test data.
After getting predictions we will be doing a majority vote and predict the majority label.
After experimenting with several methods we find simple random oversampling and class weights balancing out performs other complex methods. One possible reason for this could be less amount of data.
Now we are ready to build baseline models..
During building models we will be using imblearn library’s Pipeline as it supports RandomOversampling. Along with it we will also be using sklearn’s RepeatedStratifiedKFold. Using a pipeline helps us to cross validate our data without any leakage.
1. SVM (RBF Kernel)
Below is the tabulated results of all the baseline models we tried as part of the case study.
Now lets move on to ensembles
2. Custom stacking ensembles
Parameter tuning for best number and combination of base learners..
Lets tune the hyper parameters of individual base learners..
Results of ensembles
Lets create a simple healthcare provider fraud detector app using streamlit that takes in details of healthcare provider and predicts whether it is fraudulent or not.
First save the best model in pickle format.
Then create the app.py file. We have created our app.py as follows
This ‘app.py’ file takes input from user and using the model we stored and predicts whether the given provider is fraudulent or not.The front end part of this app will be handled by streamlit.
Kubernetes: It is an open source system developed by Google used for managing containerized applications. Using kubernetes we can deploy scalably .Now during peak hours when the traffic increases with the help of kubernetes we can easily scale up or down by adding or removing containers.
After we had deployed our app in Google kubernetes engine it looks something like this:
Generating prediction from the app..
Nice! Our app seems to be working as expected..
10. Future work
- Try out variational autoencoders or an anomaly detection approach.
- Try to integrate online model retraining aspect to the deployed app.
- Add more features to the app to make it more functional as well as visually more attractive.
- Deployment tutorial: https://towardsdatascience.com/deploy-machine-learning-model-on-google-kubernetes-engine-94daac85108b
- (Book) Building Machine learning powered application by Emannuel Ameisen
- (Course) Applied AI Course
- Code snippets for this blog were created using Carbon
12.About the author
Congrats, You have made till the end of the blog. If you are interested you can visit the Github repository for the project. There you will also find files required for deployment. Feel free to drop in any queries or suggestions in the comments section.
Have a good day!!