
How to be the team leader who gets results
Read Time: 7 mins
Written By:
Ken Bailey, Ph.D., CFE, MHP
Fraud examiners have used Benford's Law for years to detect anomalies and possible fraud in organizations' ledgers and journals. Now the author is researching a method to analyze letters of the alphabet in business documents — "Letter Analytics" — to find fraud patterns. His new approach could expedite fraud examinations.
As fraud examiners, we analyze three types of data: numbers, dates and text. We've been trained to "follow the money" (those number fields), which is sage advice because it's the simplest path to identifying positive returns for the organizations that employ us. However, when we look at the world of information — be it written, spoken or on the Internet — we need to move from our habitual desire to look only at the 0s and 1s and realize that the majority of data is text. Otherwise, our scope of examination can be missing 50 percent to 85 percent of available data.
Imagine for a moment an accounts payable audit in which we obtain a vendor masterfile, a purchase order file and an invoice payment table. As calculated below, the textual data averages to nearly 70 percent of the data received for examination (85 percent + 70 percent + 50 percent / 3) with the "money"/numeric fields comprising nearly 10 percent of the data:
I can understand why organizations don't analyze most text fields. They can easily and simply handle numbers via calculators and plug-and-play 24/7 analysis programs. But they need complex tools and approaches to slice and dice textual data and refine the inherent knowledge.
Let's look at some of the approaches we utilize in audit software by data type in Figure 1 below.
(Figure 1: Comparison of data types.)
Many of you deftly use software for analyzing numbers and dates, but textual analytic tools probably aren't your standard tools. You then have to purchase those applications and hire a trained technician to use them.
The 70 percent average in Figure 1 is for structured accounting system data. However, the scope of textual data jumps to more than 90 percent if it includes unstructured data (emails, Word and Adobe documents, Excel files, network log files, etc.). Therefore, we're seriously limiting our fraud examinations if we don't review the full scope of data presented to us.
Many fraud examiners look at the term big data as another management buzz phrase. However, with the increasing ability of computer hardware and software, clients now invariably ask how you'll analyze big data — not just externally in customer and product trends but internally. You can impress them if you show them you can tap the 70 percent to 90 percent of internal textual data.
If Shakespeare only had numbers and dates to work with, he couldn't have conveyed emotion or meaning in his plays and sonnets. Unlike numeric data that can only trend up or down, we can use textual data to tell stories and provide context to the historical record — changes in a business process or the market or a discussion between two employees about an ongoing fraud.
Here are examples of when telling a story with words can be relevant to the examiner:
Reviewing new or increasing word usage over time to understand what's trending upward and downward in discussions:
Assessing changes in financial description fields over a period of time that might highlight new risks within the business process:
Searching for the existence of keywords and phrases in a data population and trending word usage over time:
Looking for falsified information:
Because words are key to telling a story, the trick to making them useful is to "numericize" them through summarization, trending and linkage analysis. Therefore, within textual analysis we categorize, count and determine word placement with a variety of techniques. However, given the brute force needed to analyze words, it can be time-consuming to find word deviations while also missing important subtleties in the analyzed text.
We can develop "wordles" (also called word clouds and weighted lists) that display words in various sizes comparative to how many times we find them in written evidence. However, it's difficult to see trends in wordles because we can't always ascertain the size of the words and some of the words might not even be in the illustration because they might not have occurred enough times to be measured.
Because of the inherent difficulties of working with words, I arrived at a new analytic technique of analyzing letters within words — most notably the first and last letters — by applying Benford's Law principles.
Benford's Law is the theory of using the first digits of random numbers, which follow a natural law as rediscovered by Frank Benford in the 1930s. (American astronomer Simon Newcomb originally identified the law in 1881.)
The law then found relevance in the audit and fraud detection world based on research by Mark Nigrini, Ph.D., Educator Associate (and frequent speaker at ACFE events), who expanded the work to various additional principles including last-two digit testing (to look for round number and odd last-digit patterns) and the "My Law" concept (relating digits in one period to another period; see below), to name a few. (See Forensic Analytics, by Mark Nigrini, 2011, Wiley & Sons.)
Benford's Law states that in a random set of numbers the first digits of the numbers will follow an established trend. We can apply this trend to data sets to identify non-random occurrences of numbers by using this table of first digits as a benchmark:
(Source: Forensic Analytics, p. 88, by Mark Nigrini, 2011, Wiley & Sons.)
I've found some relevant traits in using Benford's Law for years:
So, just as a fraud examiner analyzes the first and last digits of a collection of numbers to quickly understand the population, I applied this same approach to words with great consistency, ease and excitement. I realized that I now could analyze 70 percent of data just as quickly as the 20 percent of numeric data.
One of my clients recently tasked me with understanding the types of journal entries it was processing. I resorted to the usual methods of analyzing account activity using horizontal and vertical financial statement analysis. The relative change in accounts over time in value and volume usage proved interesting, yet it didn't wholly tell the story of what was happening during the periods under review.
So, within the concept of "Accruals" I categorized the text in the journal and journal-line descriptions into further concepts such as "Payroll accrual" or "Accrue month end expense." However, I was immediately overwhelmed in trying to categorize roughly 40,000 unique words and 11 million word occurrences in the million or so journal entry descriptions. To calculate the magnitude of the word volumes, I first needed to write a small program to split the descriptions into their individual words and then summarize the words to get frequency occurrences.
As I was summarizing word occurrences, it dawned on me that I could bring the population down to 26 first letters. I then realized that the position of the first letter in each word in the description followed a pattern from one year to the next. Therefore, when I looked at the first 10 words in the description for two years I was able to analyze 520 data points (first 10 words in the description x 26 letters x two years). That's still a lot of data points, but it was far less than 40,000 words and 11 million occurrences. To illustrate the simplicity of the 520 points, I drafted an Excel chart similar to Figure 2 below.
(Figure 2: Analysis of 520 data points.)
By visually analyzing a chart of all letters (and spinning the chart in Excel using its 3D Rotation feature), I quickly found myself researching about 15 deviations (I specifically note three in Figure 2) out of the possible 520 or roughly 3 percent of the data points.
This led to my investigation of the words associated with the changing letters and word order. For example, the letter "p" in the 2015 first-word occurrences in the journal descriptions (data point No. 1 in Figure 2) was more than double than 2014 due to an increasing level of "Payroll" accruals. Letter analytics enabled me to quickly detect deviations and proved that the journal descriptions for this company were displaying a relative letter fingerprint 97 percent of the time between 2014 and 2015.
This isn't the first time letter frequency has been useful. Julius Caesar used his own system to make coded messages. He would, say, shift the letters over by three in the alphabet — "a" would become "d", for example — in a message and give the code to the recipient. (See the Caesar Cipher in the Khan Academy video.)
Linguists also use letter frequency analysis as an elementary technique for identifying a given language based on its characteristic letter distribution.
I used the journal entry work for the client to develop a quick "My Law" of letters using the first letter of each word that's useful when I'm looking at an organization's set of data from one period to the next. Now, I just needed to find a worthy database of the English language to act as a benchmark over time — similar to the classic lists of numbers used to calculate Benford's Law.
After some research, I obtained the benchmark of letters from the Corpus of Contemporary American English (COCA). The COCA is the largest available balanced corpus of all types of English, including American English. Mark Davies, Ph.D., professor of linguistics at Brigham Young University, created the corpus, which contains more than 450 million words of text — equally divided among spoken, fiction, popular magazines, newspapers and academic texts. Because of its design, it's perhaps the only corpus of English that's suitable for looking at current, ongoing changes in the English language.
(For more information on the data validity of the corpus, see The Corpus of Contemporary American English as the first reliable monitor corpus of English, by Mark Davies, Volume 25, Issue 4, 2010, Literary and Linguistic Computing.)
After I analyzed the COCA data for the years 1990 through 2011, I then analyzed the relative change in first and last letter ranks over time. As shown in the simplified rank example in Figure 3 below, the first letters have been consistent over this 22-year period with many letters displaying no change from 1990 to 2011. (Please note the far-left ranking is the average rank over all 22 years; then moving to the right, the rank each year is displayed for comparative review.)
(Figure 3: Rank by year of first letter in the Corpus of Contemporary American English.)
When I developed a rank and frequency percentage of COCA's 450 million words from the last 22 years, a relative pattern of first, last, first two and last two letters developed. While this might not seem significant in understanding the volume of potential words, we need to understand that the average word length in COCA was 10 letters. This masks the number of occurrences of words that are less than 10. As shown from the calculations in Figure 4 below, 90 percent of the COCA's word occurrences were eight or less letters; when you look at the first two and last two letters of these words, you arrive at 80 percent of the English language word occurrences.
(Figure 4: COCA's word occurrences.)
Here are some relevant takeaways from my analysis:
Once I completed the analysis and studied many additional case studies (detailed further in my research paper for the International Institute for Analytics), I coined the term "Letter Analytics," and further defined the Lanza Approach to Letter Analytics (LALA) as follows:
LALA: Focuses on identifying word deviations swiftly by applying letter frequency rates of the (a) English language and (b) prior period letter occurrences as expected benchmarks when analyzing the data set at hand. LALA primarily focuses on the following four measures given their consistent nature over time:
In the second and final article in this series in the May/June issue of Fraud Magazine, I'll apply these letter analytic techniques to a variety of examples including keyword searches and predictive analysis using past text benchmark data.
Rich Lanza, CFE, CPA, CGMA, has nearly 25 years of audit and fraud detection experience with specialization in data analytics and cost recovery efforts. The ACFE awarded Lanza the Outstanding Achievement in Commerce honor for his research on proactive fraud reporting. He has written eight books, co-authored 11 others and has more than 50 articles to his credit. Lanza wrote the "Fear Not the Software" column in Fraud Magazine for six years. He consults with companies to integrate continuous monitoring of their data for improved process understanding, fraud detection, risk reduction and increased cash savings. His website is: richlanza.com. His email address is: rich@richlanza.com.
Some of the information here is contained in different forms in the research paper, The Lanza Approach to Letter Analytics to Quickly Identify Text Analytic Deviations, by Rich Lanza, Sept. 24, 2015, and the research brief, Pinpointing Textual Change Swiftly Using Letter Analytics, by Rich Lanza, September 2015, both from the International Institute for Analytics website, used with permission. — ed.
Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.
Read Time: 7 mins
Written By:
Ken Bailey, Ph.D., CFE, MHP
Read Time: 14 mins
Written By:
Dick Carozza, CFE
Read Time: 11 mins
Written By:
Mitchell R. Davidson, CFE, CPA, CMA, CIA
Read Time: 7 mins
Written By:
Ken Bailey, Ph.D., CFE, MHP
Read Time: 14 mins
Written By:
Dick Carozza, CFE
Read Time: 11 mins
Written By:
Mitchell R. Davidson, CFE, CPA, CMA, CIA