Leveraging real-time scoring through Big Data to detect insurance fraud


Fraud risk exposure from claims is a major concern for the insurance industry, and it’s growing. But timely detection of attempted fraud is challenging.

Effective fraud detection must sift through huge volumes of data coming from many different sources, both inside and outside the company. In many cases, even the internal information is stored in multiple silos, preventing existing information assets from being leveraged to identify hidden connections. Additionally, companies must be able to respond rapidly to claims submissions to prevent payouts to fraudulent claimants.

Can Big Data help?

On the one hand, we have this complex problem. And on the other, we have Big Data’s promise of complexity management. So how can we bring the two together to resolve the problem?
That’s the question that led us to develop a prototype, based on Big Data technologies, for real-time scoring that can be used as a supplementary tool for fraud detection. The idea behind the prototype is simple: to provide a flexible tool to discover connections between people involved in a claim and providing prompt feedback on those connections to immediately trigger further investigation of potential fraud.

Initial results suggest that this approach can provide a powerful and efficient tool.


So how does real-time scoring work?

Video: How does insurance fraud detection work?

Through the combined use of Apache Hadoop, the ElasticSearch search engine,and the OrientDB graph-oriented database, we can acquire, link, and index all of a company’s customer information. We use all available data sources – internal (claims, CRM, and portfolio systems) and external (a blacklist to give the company the option both to execute antifraud rules to identify suspicious links and to graphically visualize those connections.  In the prototype, more than 60 million documents were indexed and about 15 million relationships appeared in the final network.

As soon as a user enters all the information about people involved in the claim, the prototype queries its datasets looking for documents and relationships involving the people listed and calculates a score by applying a set of predefined rules to the results.
The score represents the risk profile of the claim: the higher the score, the higher the probability of the claim being fraudulent.
The whole process takes just a few seconds to complete.

How is this different from what’s available now?

First, by using search engine technology to store the datasets, the prototype is able to analyse and process in real-time all information coming in from all sources. This makes it possible to detect potential fraud as soon as the claim is opened and react quickly to avoid financial loss.

Additionally, the solution uses of a graph-oriented model designed to discover possible relationships between the involved parties. This can be also exploited to run more sophisticated real-time queries to further increase the speed and accuracy of fraud risk detection.

The use of NoSQL schema-less databases also makes it easier to integrate new information sources into the tool and develop new rules to best fit the needs of individual companies without the need to change the data model.

Last but not least, our prototype is entirely built on open source stack software, making it easy to integrate into existing enterprise systems and adapt to individual companies’ antifraud rules.

As can be seen from this example, by bringing together all information that already exists in the company, Big Data makes it possible to discover new facts and evidence, adding value to the company’s information assets.

How can this solution become an even more complete antifraud system?

We’re working on two primary features to enhance the prototype.
First, we plan to support the addition of new information sources, not only new data streams from internal / external systems but also results of previous investigations. This could lead, for example, to the creation of dynamic blacklists, which would help to avoid false positives. This would also enable investigators to keep abreast of the evolution of the claims, because in the real world a claim is constantly evolving. Identifying suspicious changes in the «history» of the claim also helps to improve the fraud detection capabilities.
Secondly, we are working on extracting statistical information about the structure of the relationships network in order to implement statistical rules for fraud prevention.

Conclusions

Big Data technologies can be the enabling factor for the development of new and more sophisticated antifraud systems that will empower insurance companies to exploit the wealth of information available to them to minimize fraud risk exposure and consequent financial losses.