
Too much of a good thing?
Read Time: 10 mins
Written By:
Bret Hood, CFE
In part 1 of this article in the March/April issue, we transitioned from our number-focused tendencies to the new perspective of investigating for possible fraud in text, which accounts for the majority of data. Via letter analytics — a term I coined in my Lanza Approach to Letter Analytics (LALA)™ — I provided a swift means to analyze texts in current and previous periods, compare them to an English-language benchmark and understand anomalies. The trick for speeding a review is to focus on the first and last few characters in words, which facilitates trending and isolates change more easily.
Here we'll focus on specific research case studies studying letter analytics: (1) over time for deviations (2) for patterns displayed by specific authors and in (3) improving the scope and speed of keyword search reviews.
I aim to show how fraud examiners can utilize text as easily as numbers in their analyses by turning the unstructured nature of text into structured sets of charts for visual identification of change. Structuring the text analytics allows for fraud examiners to discover new truths within data more quickly — as we'll see in the first two case studies — or to explore data with a specific intent in mind, as demonstrated in the third keywords search case study.
I'm still in the process of analyzing text files from public companies that will indicate fraud. However, we still can find deviations in any texts that would indicate change and would test my hypothesis. I chose: (1) an exhaustive list of British pop songs (2) William Shakespeare's plays and (3) an exhaustive keyword dictionary.
In part 1, I discussed my initial research with LALA on the Corpus of American English Language to show how language hasn't changed much in the last 22 years, especially when we're focusing on the first few characters of words. After I analyzed American English, our sights turned across the pond to determine if Queen's English held the same similar patterns over time. (It does.)
I then applied LALA to a random list of British pop songs from 1960 to 1999 in an online database (Britburn Project) with an aim to determine if the song titles themselves could provide a historical perspective of changes over time. This list would be akin to a vendor masterfile, purchase order file or an invoice payment file.
After I downloaded and summarized the song title data from 1960 to 1999 (with permission from the Britburn Project), I identified nearly 13,000 unique words and more than 155,000 word occurrences. I categorized the words into two time frames: 1960 to 1979 and 1980 to 1999. Instead of only focusing on the first letters of the word, as I did in part 1 of this article, I developed a unique dashboard approach to review all first two letters of each word.
My objective of the first two letter frequencies on the dashboard was to identify deviations over time in the 1960s through 1970s (blue in Figure 1 below) and 1980s through 1990s (orange in Figure 1) song titles. What was surprising was how few changes there were between the time frames (red arrows) and how quickly I could review the entire data set of 155,000 word occurrences in one quick view for any deviations over time.
To improve the fit of the analysis, I analyzed each letter by frequency percent so a letter such as "T" (most occurring) wouldn't overwhelm the analysis of a less occurring letter such as "X." By expanding the analysis to the first two letters within that particular letter, as I did in Figure 1 and Figure 2 (below), I was able to quickly identify deviations within the relative use of each first two letters.
One such deviation was in the letters "RE" in the letter R chart in Figure 1. (See red arrow in Figure 1 and the expanded view in Figure 2.) I drilled down to the words leading to the letter trending and found that the word "Remix" jumped from three occurrences in 1960/1970 to 345 occurrences in 1980/1990. After more review, I found that song remixes began to show up more in the 1980s and 1990s with the advancement of pop and hip-hop music, which popularized "cutting" and "scratching" tracks to create collages of music from many songs. Furthermore, the technology simply didn't exist in the 1960s and 1970s to quickly make remixes when much of recorded music was placed on vinyl records with no way to re-record them without expensive production equipment.
Here are some relevant bits from this analysis:
For more information on this research case, I completed a more detailed analysis of the Britburn song titles as part of a research paper for the International Institute for Analytics.
In this case, I focused on the plays of William Shakespeare. Authorship theories of his works range widely from the possibility that he never wrote any of the texts to the hypothesis that many authors wrote under Shakespeare's name. I didn't intend to prove or disprove any of those theories; my analysis focused on the comparative differences among the writing of the plays that could indicate authorship by showing patterns. However, instead of only identifying deviations over time, I focused on time and the categorization of the play with a three-dimensional review of the text as (1) first two letters, (2) the century of authorship in 1500 or 1600 and (3) the type of play as a comedy, history or tragedy.
In a business setting, this exercise would be synonymous — after you find a baseline for authorship — with assessing a department's emails or processing documents from this year compared with last. Then if you categorize the emails or process the documents by a specific topic you can see how they change over time similar to Shakespeare's plays.
I obtained the texts from "The Complete Works of William Shakespeare," created by Jeremy Hylton and operated by The Tech, a Massachusetts Institute of Technology (MIT) newspaper.
With some help from James Patounas, an experienced data scientist at Source One Management Services LLC — and his skills with Python software — I was able to quickly "web scrape" the text data on the MIT pages and organize it for analysis. I directed my analysis at the first letter of each word and by the type of play and century when it was written. (See Figure 3 below.)
Given the difficulty in understanding variances by first letter — even when I categorized by Shakespeare's plays and time frame when he wrote them — I needed to precisely assess the first two letters in relation to the specific type of play and year so we can more quickly see deviations. To understand more on why this is the case, I'll turn to a couple of stratifications and general statistics.
On the face of it, it would seem that each letter has a chance of being used 3.85 percent (100 percent/26) of the time, but that's not correct because the words in Shakespeare's plays aren't evenly distributed among the 26 available letters in English. As we can see in Figure 4 (below) I found, for example, the letter T, (highlighted in orange in Figure 4), occurred in 14 percent of Shakespeare's words while 11 letters (E, G, J, K, P, Q, R, U, V, X, and Z), were each under 3 percent of word occurrence volume and when combined are still less than the letter T in the occurrences of words (highlighted in blue in Figure 4).
Furthermore, nearly 80 percent of word occurrences in Shakespeare's plays (fifth column in Figure 5 below and highlighted in orange) make up a little more than 3 percent of the actual count of words (931 words in strata 03, 04 and 05 out of all 27,682 potential words as seen in Figure 5). So, if we focused on top words only, we would be looking at only 3 percent of the actual words in Shakespeare's plays and thereby missing the small trends or deviations that exist in the remaining 97 percent of words.
I knew that my analysis using words alone as a way of summarizing the text data would be skewed negatively and mask the true set of deviations existing in the data set under review. The summarization on words usually leads to focusing on many "stop words" or words that are common in any sentence and don't explain anything useful about the underlying meanings within the text. As seen in Figure 6 (below), the top 10 words, which make up the entire strata 05 in Figure 5 (above), are all common stop words.
The LALA approach — now isolated by time frame (century) and play type (comedy, history or tragedy) — allows for a visual identification of 30 deviations (see red arrows in Figure 7 below) out of more than 1,700 letter/century/play types, representing about 1.8 percent of the total possible first two letter combinations. This allows for us to ignore 98.2 percent of the word deviations because of their similarity to the previous time frame and focus instead on the topic deviations.
When using the LALA approach, we don't need to remove stop words in the review. They actually can also be part of the analysis because they might identify useful themes. As we see in Figure 8 below, the first letters "TH" that go into the making of the word "The" (the No. 1 occurring word) are isolated in the analysis. Note a small deviation between the 1500s and 1600s in the history type of plays; he wrote nine historical plays related to kings in the 1500s and only one in the 1600s: "Henry VIII."
After I further reviewed the 30 deviations (red arrows in Figure 7), I developed a few themes from the text:
Here are some summary points for research case No. 2:
Fraud examiners commonly use textual analytic techniques to identify frequency rates of selected keywords in sets of text data. They look for Foreign Corrupt Practice Act trigger words such as "gift" or "facilitation payment" and financial statement trigger words such as "correct," "plug" or "cookie jar."
Fraud examiners can go further and align text to a dictionary of the entire English language to identify specific positive ("pleased," "agreed" etc.) or negative ("anguish," "fear" etc.) words to help calculate subjects' emotive moods.
To do this for yourself, first compile a list of focused words and phrases for the case. To accelerate this process, see the 4,000 plus words in the Key Words Survey released by AuditNet® in 2015. Then, search the data with specialized tools such as this Microsoft Excel macro add-in with an accompanying video for finding keywords and phrases at AuditSoftwareVideos.com.
In Excel, select a list of keywords in one column and a description (i.e., travel and entertainment business purpose) in another column and the macro will extract all rows with matching keywords plus create a summary of the number of times it detects each keyword in the data file.
Most organizations limit their searches to 50 to 100 keywords because the process consumes so much time. However, LALA allows you to summarize the matched results of thousands of keywords. The trick to reducing analysis time is to summarize the keyword results by the first and last letters to quickly gain analytic coverage of the population. Deviations should be few for a company over time because they should display the same usage of keywords based on the research in this article.
To calculate first letters with the AuditSoftwareVideos.com Excel macro, use the LEFT( ) function as shown in Figure 10 below. The LEFT( ) function calculates the leftmost characters of the matching word, which in the figure are stated for the first two letters. (The red arrow shows how we obtain the first two letters by including the numeral "2" after the comma in the function.) Once you've calculated these, you can summarize the keyword results using a Pivot Table in Excel and chart them using the associated Pivot Chart feature.
Fraud examiners can use new textual analytics applications to find anomalies in data such as vendor masterfiles, purchase orders and invoice payments, but the tools are often too complex for non-statisticians. However, my new approach — using letter analytics via LALA — simplifies and accelerates the process of identifying trends that could indicate fraud by charting and calculating statistics around first and first two letter changes in text data.
We can use letter analytics via LALA to develop baselines in current periods under review and use them as guides to help discover new trends in data, especially when we categorize data using a third dimension.
We can use letter baselines to predict future outcomes and research any deviations. We generally find fraud in the small changes by isolating specific letter patterns as compared to trying to understand more unwieldy word changes.
My research is preliminary, but it shows promising possibilities for detecting fraud in the words.
Rich Lanza, CFE, CPA, CGMA, has nearly 25 years of audit and fraud detection experience with specialization in data analytics and cost recovery efforts. The ACFE awarded Lanza the Outstanding Achievement in Commerce honor for his research on proactive fraud reporting. He has written eight books, co-authored 11 others and has more than 50 articles to his credit. Lanza wrote the "Fear Not the Software" column in Fraud Magazine for six years. He consults with companies to integrate continuous monitoring of their data for improved process understanding, fraud detection, risk reduction and increased cash savings. His website is: richlanza.com. His email address is: rich@richlanza.com.
Some of the information here is contained in different forms in the research paper, The Lanza Approach to Letter Analytics to Quickly Identify Text Analytic Deviations, by Rich Lanza, Sept. 24, 2015, and the research brief, Pinpointing Textual Change Swiftly Using Letter Analytics, by Rich Lanza, September 2015, both from the International Institute for Analytics website, used with permission. — ed.
Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.
Read Time: 10 mins
Written By:
Bret Hood, CFE
Read Time: 6 mins
Written By:
Patricia A. Johnson, MBA, CFE, CPA
Read Time: 12 mins
Written By:
Richard B. Lanza, CFE, CPA, CGMA
Read Time: 10 mins
Written By:
Bret Hood, CFE
Read Time: 6 mins
Written By:
Patricia A. Johnson, MBA, CFE, CPA
Read Time: 12 mins
Written By:
Richard B. Lanza, CFE, CPA, CGMA