The Fraud Examiner

Leveraging Machine Learning in Fraud Incident Response
Share |

Jeremy Clopton, CFE, CPA, ACDA, CIDA
Lanny Morrow, EnCE  

As the office CFE, you recently learned of fraud allegations within your organization. Apparently, someone has been embezzling funds and using the accounts payable system to do so. You now have to begin the investigation to figure out who may or may not be involved — what do you do? You could go the usual route and start with interviews, document review or a flip through the paper bank statements. Or, you could use machine learning to leverage your organization’s data to identify the individuals that exhibit the elements of the fraud triangle and take a focused approach. Which would you choose?

Machine learning can improve the effectiveness and efficiency of your fraud incident response efforts. This article explains the concept of machine learning, discusses applications in structured and unstructured data, touches on the how the fraud triangle comes into play and concludes with ideas for expanding the detection horizon.

Machine learning vs. Rule-based systems

Traditional analytics rely on rule-based methods to detect anomalies. These decision rules follow simple Boolean logic: if a vendor address matches an employee address and its wire transfer account matches the employee bank account, then it is likely a fictitious vendor. While effective, rule-based systems are inherently subjective, geared toward known frauds and limited to a few attributes and exact matching of criteria, machine learning offers a supplement to traditional detection methods.

Machine learning (ML) is a useful type of AI that has the ability to learn without pre-defined decision rules. Machine learning constructs its own decision tree based on meta-tagged data, e.g., “red flag” or “not,” to determine how “red flag” transactions are related. ML applies the learned logic to new data and is remarkably adept at making the right decision. ML’s ability to learn from a complex array of data rather than just a few variables leads to greater accuracy.

Another type of ML, called “unsupervised learning,” constructs decision trees without meta-tagged data; it identifies patterns of interest and anomalies using its own decision-making criteria. This allows fraud examiners to identify new forms of fraud not previously detected or codified into rules-based methods. Both supervised and unsupervised ML systems are self-refining, in that they become more accurate as more data is encountered.


Applications in Structured Data

Machine learning is frequently used to spot “red flag” patterns in structured data (data with a predictable structure, like spreadsheets, databases, and financial data formats). Examples include identifying suspicious insurance claims, unusual banking transactions and credit card activity. It is also useful in network relationship analysis, which is the exploration of connections between individuals and entities. Often complex, relationship networks can be quickly quantified with an unsupervised learning approach called “clustering” allowing the examiner to efficiently identify key relationships and the web of communications and influence. The source of such data is often corporate email, but may also include phone records and social media.

Machine learning also enhances basic attribute matching. Rather than creating a complex set of rules for matching of names, addresses, and other identifying attributes, ML-based systems learn what a match looks like and applies this logic to the data, resulting in a higher degree of accuracy.

In a more specific example, due to the unique and creative ways employees “game” the system with purchasing cards and expense reimbursements, machine learning is also useful in identifying previously unknown schemes in these areas. Case experience has shown ML to be effective in identifying purchasing behavior such as frequent vendors, unusually consistent amounts, certain transactions that occur in tandem with each other on a frequent basis and items occurring with unusual regularity. Similarly, expense reimbursements may be flagged as unusual by an ML-based system that may not conform to typical rules-based observations such as rounded amounts, “just under” threshold amounts and the traditional Benford’s Law analysis.

Applications in Unstructured Data

Fraud The most exciting use of ML in investigations relates to unstructured textual data. For example, Stanford University’s study of earnings conference call transcripts from publicly traded companies revealed that the choice of general vs. specific adjectives, and use of 3rd person language vs. 1st person had predictive power in assessing whether an organization was hiding something or engaging in questionable activities. The now-famous Enron email data set revealed similar patterns.

The most powerful feature of ML in analyzing textual data is its ability to identify emotional expressions. Common emotions of interest in examinations include anger/frustration, tension/anxiety, vagueness and evasive/conspiratorial tones. A properly trained ML model can quickly find these types of communications before the examiner reads a single email, allowing them to focus on more relevant content early in the investigation. For incident response, identifying relevant content early in the case increases the effectiveness of the investigation approach.

ML also allows the examiner to perform “find more like this” searches across vast document and email collections. By identifying just a few relevant communications or documents early on in an investigation, the examiner can pass those to machine learning to learn what the emails are about at a conceptual level, then rapidly find similar items, regardless of keywords. This has proven vital in identifying interview targets and topics early in an investigation.

The Fraud Triangle, revisited

Previous studies have demonstrated that communications possess keywords that tie directly to aspects of the fraud triangle. Training a ML tool to identify these communications untethers the analysis from rules-based keywords, focusing instead of concepts and emotional tones.

ML is not limited to email communications — public facing social media, freeform text public records and internet history can also be included in the analysis. Leveraging these sources may reveal stressors such as gambling habits, drug or alcohol abuse, or evidence of living beyond means. Various forms of expression via social media posts have long-since been correlated to personality types and behavioral dispositions, and the automated extraction of name, place, and date information from these sources helps build a “profile” of various subjects in an investigation, or spot anomalies and red flags within a larger department or entire entity.

Expanding the Detection Horizon

Fraud Using ML in the investigation and detection strategy expands detection horizon (the conceptual border between what traditional detection approaches can identify and what they cannot) outward to identify obscure or more complex red flags. Further, through unsupervised learning and reverse engineering supervised learning “thought processes,” we can discover new anomalies and threats not previously published or encountered. The most valuable application of this concept is pushing the detection horizon back to the point where a fraud can be detected before it occurs.


So which will it be — document review and interviews or leveraging your data through machine learning? The value of incorporating machine learning into fraud incident response is hard to overstate. Improving the effectiveness and timeliness of an investigation is invaluable to an organization. Including unstructured data sources and use of machine learning into the control structure could prove even more critical by identifying and detecting threats earlier than previously thought possible and detecting fraud schemes before they are known to exist.

Contact the ACFE
For more information, contact Sarah Hofmann, Public Information Officer, at (512) 478-9000 ext. 324 or