Blazing a trail for the Benford's Law of words, part 2

Written by: Richard B. Lanza, CFE, CPA, CGMA

Date: May 1, 2016

Read Time: 12 mins

Data Analytics

In part 2, the author applies his unique letter analytic techniques to a variety of examples, including keyword searches and predictive analysis using benchmark data. The aim? Finding fraud in the words.

In part 1 of this article in the March/April issue, we transitioned from our number-focused tendencies to the new perspective of investigating for possible fraud in text, which accounts for the majority of data. Via letter analytics — a term I coined in my Lanza Approach to Letter Analytics (LALA)™ — I provided a swift means to analyze texts in current and previous periods, compare them to an English-language benchmark and understand anomalies. The trick for speeding a review is to focus on the first and last few characters in words, which facilitates trending and isolates change more easily.

Here we'll focus on specific research case studies studying letter analytics: (1) over time for deviations (2) for patterns displayed by specific authors and in (3) improving the scope and speed of keyword search reviews.

I aim to show how fraud examiners can utilize text as easily as numbers in their analyses by turning the unstructured nature of text into structured sets of charts for visual identification of change. Structuring the text analytics allows for fraud examiners to discover new truths within data more quickly — as we'll see in the first two case studies — or to explore data with a specific intent in mind, as demonstrated in the third keywords search case study.

I'm still in the process of analyzing text files from public companies that will indicate fraud. However, we still can find deviations in any texts that would indicate change and would test my hypothesis. I chose: (1) an exhaustive list of British pop songs (2) William Shakespeare's plays and (3) an exhaustive keyword dictionary.

Research case No. 1: Identifying slight yet relevant deviations

In part 1, I discussed my initial research with LALA on the Corpus of American English Language to show how language hasn't changed much in the last 22 years, especially when we're focusing on the first few characters of words. After I analyzed American English, our sights turned across the pond to determine if Queen's English held the same similar patterns over time. (It does.)

I then applied LALA to a random list of British pop songs from 1960 to 1999 in an online database (Britburn Project) with an aim to determine if the song titles themselves could provide a historical perspective of changes over time. This list would be akin to a vendor masterfile, purchase order file or an invoice payment file.

After I downloaded and summarized the song title data from 1960 to 1999 (with permission from the Britburn Project), I identified nearly 13,000 unique words and more than 155,000 word occurrences. I categorized the words into two time frames: 1960 to 1979 and 1980 to 1999. Instead of only focusing on the first letters of the word, as I did in part 1 of this article, I developed a unique dashboard approach to review all first two letters of each word.

My objective of the first two letter frequencies on the dashboard was to identify deviations over time in the 1960s through 1970s (blue in Figure 1 below) and 1980s through 1990s (orange in Figure 1) song titles. What was surprising was how few changes there were between the time frames (red arrows) and how quickly I could review the entire data set of 155,000 word occurrences in one quick view for any deviations over time.

Figure-1---Deviations-in-British-pop-song-titles

To improve the fit of the analysis, I analyzed each letter by frequency percent so a letter such as "T" (most occurring) wouldn't overwhelm the analysis of a less occurring letter such as "X." By expanding the analysis to the first two letters within that particular letter, as I did in Figure 1 and Figure 2 (below), I was able to quickly identify deviations within the relative use of each first two letters.

Figure-2---Deviation-in-the-letters-'RE'

One such deviation was in the letters "RE" in the letter R chart in Figure 1. (See red arrow in Figure 1 and the expanded view in Figure 2.) I drilled down to the words leading to the letter trending and found that the word "Remix" jumped from three occurrences in 1960/1970 to 345 occurrences in 1980/1990. After more review, I found that song remixes began to show up more in the 1980s and 1990s with the advancement of pop and hip-hop music, which popularized "cutting" and "scratching" tracks to create collages of music from many songs. Furthermore, the technology simply didn't exist in the 1960s and 1970s to quickly make remixes when much of recorded music was placed on vinyl records with no way to re-record them without expensive production equipment.

Here are some relevant bits from this analysis:

Summarizing two decades against another two decades of British pop-song titles led to few deviations when organizing the song titles by the first two letters.
The isolation of each letter in the analysis allowed 13,000 unique words to be trended in a manner that identified specific small trends over time more readily (348 total occurrences out of roughly 155,000 total words or a .2 percent deviation). The letters "RE" led to a visual trend identification of the historical theme that remixes began to occur in the 1980s.
Fraud in business information exists in the same small threads of deviations like the new occurrence of the word "Remix." This can be related to past frauds such as the MCI/WorldCom event when "Line Costs" were more frequently used in journal entry descriptions. (See Securities and Exchange Commission v. WorldCom Inc.)

For more information on this research case, I completed a more detailed analysis of the Britburn song titles as part of a research paper for the International Institute for Analytics.

Research case No. 2: Authorship and writing deviations in Shakespeare's plays

In this case, I focused on the plays of William Shakespeare. Authorship theories of his works range widely from the possibility that he never wrote any of the texts to the hypothesis that many authors wrote under Shakespeare's name. I didn't intend to prove or disprove any of those theories; my analysis focused on the comparative differences among the writing of the plays that could indicate authorship by showing patterns. However, instead of only identifying deviations over time, I focused on time and the categorization of the play with a three-dimensional review of the text as (1) first two letters, (2) the century of authorship in 1500 or 1600 and (3) the type of play as a comedy, history or tragedy.

In a business setting, this exercise would be synonymous — after you find a baseline for authorship — with assessing a department's emails or processing documents from this year compared with last. Then if you categorize the emails or process the documents by a specific topic you can see how they change over time similar to Shakespeare's plays.

I obtained the texts from "The Complete Works of William Shakespeare," created by Jeremy Hylton and operated by The Tech, a Massachusetts Institute of Technology (MIT) newspaper.

With some help from James Patounas, an experienced data scientist at Source One Management Services LLC — and his skills with Python software — I was able to quickly "web scrape" the text data on the MIT pages and organize it for analysis. I directed my analysis at the first letter of each word and by the type of play and century when it was written. (See Figure 3 below.)

Figure-3---Analysis-of-first-letter

Given the difficulty in understanding variances by first letter — even when I categorized by Shakespeare's plays and time frame when he wrote them — I needed to precisely assess the first two letters in relation to the specific type of play and year so we can more quickly see deviations. To understand more on why this is the case, I'll turn to a couple of stratifications and general statistics.

On the face of it, it would seem that each letter has a chance of being used 3.85 percent (100 percent/26) of the time, but that's not correct because the words in Shakespeare's plays aren't evenly distributed among the 26 available letters in English. As we can see in Figure 4 (below) I found, for example, the letter T, (highlighted in orange in Figure 4), occurred in 14 percent of Shakespeare's words while 11 letters (E, G, J, K, P, Q, R, U, V, X, and Z), were each under 3 percent of word occurrence volume and when combined are still less than the letter T in the occurrences of words (highlighted in blue in Figure 4).

Benford Figure 4

Furthermore, nearly 80 percent of word occurrences in Shakespeare's plays (fifth column in Figure 5 below and highlighted in orange) make up a little more than 3 percent of the actual count of words (931 words in strata 03, 04 and 05 out of all 27,682 potential words as seen in Figure 5). So, if we focused on top words only, we would be looking at only 3 percent of the actual words in Shakespeare's plays and thereby missing the small trends or deviations that exist in the remaining 97 percent of words.

Benford Figure 5

I knew that my analysis using words alone as a way of summarizing the text data would be skewed negatively and mask the true set of deviations existing in the data set under review. The summarization on words usually leads to focusing on many "stop words" or words that are common in any sentence and don't explain anything useful about the underlying meanings within the text. As seen in Figure 6 (below), the top 10 words, which make up the entire strata 05 in Figure 5 (above), are all common stop words.

Benford Figure6

The LALA approach — now isolated by time frame (century) and play type (comedy, history or tragedy) — allows for a visual identification of 30 deviations (see red arrows in Figure 7 below) out of more than 1,700 letter/century/play types, representing about 1.8 percent of the total possible first two letter combinations. This allows for us to ignore 98.2 percent of the word deviations because of their similarity to the previous time frame and focus instead on the topic deviations.

Figure-7---Identification-of-30-derivations

When using the LALA approach, we don't need to remove stop words in the review. They actually can also be part of the analysis because they might identify useful themes. As we see in Figure 8 below, the first letters "TH" that go into the making of the word "The" (the No. 1 occurring word) are isolated in the analysis. Note a small deviation between the 1500s and 1600s in the history type of plays; he wrote nine historical plays related to kings in the 1500s and only one in the 1600s: "Henry VIII."

Figure-8---'TH'-are-isolated

After I further reviewed the 30 deviations (red arrows in Figure 7), I developed a few themes from the text:

Shakespeare appears to have used the same basic language throughout his works, so the only deviations over time and play types were for character names such as "JU" for Juliet, "RO" for Romeo, "BR" for Brutus or "RI" for Richard. The character names — though visually identifiable in the letter charts — were in the low hundreds as compared to the 825,000 total words.
The word "king" appeared in many forms (king, kings, kingdom, etc.): 1,961 times in the nine historical plays of the 1500s versus 225 times in the one history play, "Richard III," in the 1600s. This represented .3 percent of the total number of word occurrences. Yet, as we see in Figure 9 below, it's a noticeable 11 percent deviation for the letter K which only represented about 1 percent of the total words. Therefore, by refining the analysis by century, type of play and to the singular letter K, the deviation of KI readily points to the use of the word "king" as a deviation in the historical plays in the 1500s.

Here are some summary points for research case No. 2:

While it's nearly impossible to prove authorship from the analysis, it's uncanny to see the similar trends in letters between time frames and types of play (Figures 3 and 7). The figures suggest that one author probably had authored the plays, or if multiple authors existed they were at a minimum using the same corpus of language for the time frame 1591 to 1613, when Shakespeare was believed to have written his plays.
The main change in words was simply the use of names and associated terms with names leading to only 1.8 percent of the two-letter combinations having noticeable visual change.
The LALA approach can nullify stop words and, more importantly, we can use them as part of an analysis to identify possible themes.
My isolation of each letter before further analyzing the two-letter combinations led to an ability to detect deviations in lower occurring letters (i.e., the letter X), rather than having high-occurrence letters (i.e., the letter T) dilute the analysis.
Unlike the British song titles analysis, I categorized the Shakespearian texts by century and play type, which led to a more refined 3D analysis versus the 2D song analysis. Given there were three play types, this segmented the two-letter combinations by three versus what was available in the British song analysis that focused only on time-frame change.

Research case No. 3: Categorizing text data using an extensive keyword dictionary

Fraud examiners commonly use textual analytic techniques to identify frequency rates of selected keywords in sets of text data. They look for Foreign Corrupt Practice Act trigger words such as "gift" or "facilitation payment" and financial statement trigger words such as "correct," "plug" or "cookie jar."

Fraud examiners can go further and align text to a dictionary of the entire English language to identify specific positive ("pleased," "agreed" etc.) or negative ("anguish," "fear" etc.) words to help calculate subjects' emotive moods.

To do this for yourself, first compile a list of focused words and phrases for the case. To accelerate this process, see the 4,000 plus words in the Key Words Survey released by AuditNet® in 2015. Then, search the data with specialized tools such as this Microsoft Excel macro add-in with an accompanying video for finding keywords and phrases at AuditSoftwareVideos.com.

In Excel, select a list of keywords in one column and a description (i.e., travel and entertainment business purpose) in another column and the macro will extract all rows with matching keywords plus create a summary of the number of times it detects each keyword in the data file.

Most organizations limit their searches to 50 to 100 keywords because the process consumes so much time. However, LALA allows you to summarize the matched results of thousands of keywords. The trick to reducing analysis time is to summarize the keyword results by the first and last letters to quickly gain analytic coverage of the population. Deviations should be few for a company over time because they should display the same usage of keywords based on the research in this article.

To calculate first letters with the AuditSoftwareVideos.com Excel macro, use the LEFT( ) function as shown in Figure 10 below. The LEFT( ) function calculates the leftmost characters of the matching word, which in the figure are stated for the first two letters. (The red arrow shows how we obtain the first two letters by including the numeral "2" after the comma in the function.) Once you've calculated these, you can summarize the keyword results using a Pivot Table in Excel and chart them using the associated Pivot Chart feature.

Detecting fraud in the words

Fraud examiners can use new textual analytics applications to find anomalies in data such as vendor masterfiles, purchase orders and invoice payments, but the tools are often too complex for non-statisticians. However, my new approach — using letter analytics via LALA — simplifies and accelerates the process of identifying trends that could indicate fraud by charting and calculating statistics around first and first two letter changes in text data.

We can use letter analytics via LALA to develop baselines in current periods under review and use them as guides to help discover new trends in data, especially when we categorize data using a third dimension.

We can use letter baselines to predict future outcomes and research any deviations. We generally find fraud in the small changes by isolating specific letter patterns as compared to trying to understand more unwieldy word changes.

My research is preliminary, but it shows promising possibilities for detecting fraud in the words.

Rich Lanza, CFE, CPA, CGMA, has nearly 25 years of audit and fraud detection experience with specialization in data analytics and cost recovery efforts. The ACFE awarded Lanza the Outstanding Achievement in Commerce honor for his research on proactive fraud reporting. He has written eight books, co-authored 11 others and has more than 50 articles to his credit. Lanza wrote the "Fear Not the Software" column in Fraud Magazine for six years. He consults with companies to integrate continuous monitoring of their data for improved process understanding, fraud detection, risk reduction and increased cash savings. His website is: richlanza.com. His email address is: rich@richlanza.com.

Some of the information here is contained in different forms in the research paper, The Lanza Approach to Letter Analytics to Quickly Identify Text Analytic Deviations, by Rich Lanza, Sept. 24, 2015, and the research brief, Pinpointing Textual Change Swiftly Using Letter Analytics, by Rich Lanza, September 2015, both from the International Institute for Analytics website, used with permission. — ed.

Begin Your Free 30-Day Trial

Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.

Learn More Already a member? Sign In