Fraud Edge

Embracing complexity via text mining

Please sign in to save this to your favorites.
Date: September 1, 2014
Read Time: 6 mins

Fraud EDge: A forum for fraud-fighting faculty in higher ed

In the last column, we began a disussion of leading-edge technologies that have the potential to provide significant amounts of useful information to fraud examinations. We also introduced the collaboration between data mining and digital forensics, which is driven by the increasing volume of structured and unstructured data that can account for as much as 80 percent of the total data in an organization. Now we address text mining (sometimes called text analytics) and discuss its characteristics and possible areas of application in fraud cases. We look at the components of text mining and how practitioners might utilize these methods to analyze large data sets to provide information that achieves the fraud examiner's goals.

SOURCES OF TEXT USED IN FRAUD EXAMINATIONS

Some of the more commonly used sources of unstructured data in an examination include:

  • Email communications (corporate and web-based).
  • Documents.
  • Social media.
  • Chat, texting and instant messaging.
  • Websites and blogs.
  • Contents of computer hard drives, mobile devices and cloud storage. 

While there are many other potential sources, experience has shown these to be the most common in corporate examinations.

Email, chief among unstructured data sources in fraud examinations, not only contains word-for-word communications but also possesses a date/time element, metadata and even emotional tones expressed through idioms, phrases and adjectives. Fraud examiners can use these components to analyze the personalities of the communications and the communicators.

The contents of computer hard drives include not just email but also documents, audio and video, caches of Internet activity, discarded instant messaging and chat sessions, deleted content and overlooked backup and temporary copies of items. Digital forensics technologies can preserve, identify and produce these obscure items.

TEXT MINING TOOLS AND PROCESSES

Handling the sheer volume and complexity of unstructured data requires special tools and processes. Often, the majority of useful, relevant material is human communications. Therefore, analysis shouldn't be limited to mere keyword searches. The extraction of meanings and topics; emotional tones of conversations; and creation of relationship networks to visualize how key players and topics interact, influence and evolve over time can provide fraud examiners critical information not otherwise apparent.

Figure 1 provides a conceptual overview of the family of processes that make up the core of text mining. These components encompass the science of "natural language processing" and related concepts of "latent semantic analysis" and "concept searching," among others. Experience has shown us that these components, when working together, can be an effective toolset in the identification of relevant evidence in a fraud examination.

inline    

        Figure 1: Text-mining overview

TEXT MINING COMPONENTS

  1. Predictive coding incorporates artificial intelligence (AI) to assist analysts in finding related and similar documents in a massive collection of text. AI is capable of determining the underlying concepts in documents or emails, so analysts can perform predictive coding independently of traditional methods that rely on keyword searches. Predictive coding can find highly relevant content rapidly (the proverbial "smoking gun"), but more importantly it can reduce the volume of material that has to be manually reviewed by as much as 95 percent in an investigation. The AI and human analyst leveraging each other's strengths to achieve "augmented intelligence" makes this possible.
  2. In addition, the technology can dissect communications into their grammatical subcomponents, creating topic maps like those shown in Figure 2. This example illustrates "a tale of two finance departments" — it doesn't take much imagination to tell which department may have some issues.
  3. Part of speech tagging is the process of a computer program breaking text into its grammatical parts. Leveraging this function, one of the more useful and exciting types of analysis — tone detection — is possible.
  4. Tone detection uses adjectives, idioms and phrases found in communications to assess the emotional tone of the communication. This ability has powerful implications — a fraud examiner can quickly hone in on red flags without even having an initial theory or starting point in an investigation. Common tones to search for include tense, vague, nervous, low-esteem and conspiratorial, among others.
  5. Named Entity Extraction (NEE) provides a particularly powerful analysis to the fraud examiner. Text-mining tools can identify grammatical components, so they're very adept at identifying proper names, places and events. Because names and events can be pulled from email communications, NEE is useful in relationship mapping, the process of graphically representing relationships among the various subjects of an investigation. For example, without NEE, a map may only show relationships between the sender and recipient of an email communication. By adding extracted topics, names and places from the message body, the relationship map takes on a completely new dimension. Some maps become rather complex as Figure 3 illustrates.
inline    

Figure 2: Topic maps

TEXT MINING AND DATA ANALYTICS

The data and information gathered, analyzed and produced using text mining provides even more value to a fraud examiner when used in combination with other procedures related to data analytics related to structured data. Named entities, email recipients/senders and relationships provide further insights into how employees, customers and vendors interact. The relationship map, Figure 3, becomes even more robust by adding the identified structured data relationships (employee-vendor, employee-customer, vendor-customer, etc.) based on common attributes such as name, address, phone number or tax identification number. In some instances, analyzing structured data and unstructured data independently provides two interesting, but possibly incomplete, results. Combining the results of both into a single analysis provides a more complete picture.

Incorporating dates/times of email, document creation, social media postings and computer-based activities (downloads, uploads, deletions, etc.) provides chronological events that can be useful in further analyzing transactions to identify possible correlations and/or causations. For example, analyzing communications from a purchasing director involving a request for proposal (RFP) process can be used to find indications of potential bid rigging. Red flags may include email or phone communications to the winning vendor minutes before the submission deadline, or a specific vendor winning each time a particular individual is on the RFP evaluation committee.

inline    

Figure 3: Complex relationship mapping

EMBRACING COMPLEXITY

The growth of unstructured data is a key driver in the need for the collaboration of data analytics and digital forensics. Text mining is the overarching name for the family of functions used to analyze unstructured data and isolate the useful data elements for inclusion in data analytics processes. Initially, the incorporation of unstructured data into investigations is a daunting task. The volume and complexity of the data to be analyzed is a challenge, a challenge best conquered by the collaboration of data analytics and digital forensics. The result of this collaboration is a comprehensive analysis — fully integrating both structured and unstructured data. It's a process well worth the effort. Eric Berlow, ecologist, network scientist and Technology, Entertainment and Design (TED) Senior Fellow, summarized the need to embrace complexity in his July 2010 TED Talk, "We're discovering in nature that simplicity often lies on the other side of complexity. So for any problem, the more you can zoom out and embrace complexity, the better chance you have of zooming in on the simple details that matter most." (See the video.)

LOOKING TO THE FUTURE

In the next issue of Fraud Magazine, we'll focus on two methods that address the need to leverage technology in an investigation. Augmented intelligence addresses leveraging machine learning to analyze unstructured data more effectively and efficiently. Data visualization relates to the use of visual analytics for analysis and communication of results to end users.

Les Heitger, Ph.D., Educator Associate, is BKD Distinguished Professor of Forensic Accounting in the School of Accountancy at Missouri State University in Springfield. He's chair of the ACFE Higher Education Advisory Committee.

Jeremy Clopton, CFE, CPA, ACDA, is senior managing consultant in the Forensics Practice of BKD, LLP.

Lanny Morrow, EnCE, is a managing consultant in the Forensics Practice of BKD, LLP.

 

  

Begin Your Free 30-Day Trial

Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.