Be careful when you’re selecting keywords so you won’t include terms that are too generic. These will likely produce a large volume of irrelevant documents that you’ll need to review to determine relevance.
Finding the proverbial electronic evidence needle in a haystack is a major concern when analyzing electronic evidence obtained in a fraud examination. Organizations maintain huge amounts of information – from financial records to e-mails – all in electronic format.
As an investigation begins, the examiner should first be concerned with preserving digital evidence before anyone can alter or destroy it. Preserving digital evidence is achieved by acquiring a forensically sound copy of electronic media where evidence could reside. Using specialized forensic tools, the examiner can then search the forensic images for information that might be relevant to the case. Relevant information can then be extracted for further review by the examiners.
THROUGH A FILTER FINELY
With the large volumes of data collected in many investigations today, it’s essential we filter the data to a volume that we can review. It would make no sense for a fraud examiner to read through all of upper management’s e-mails in a kickback investigation or review all the documents stored on an organization’s network to identify links to a suspected fraudulent billing scheme.
To maximize review time, one of the most efficient techniques is to construct a list of terms that can be used to search through digital evidence to identify the most relevant data. We can then review those documents that match the search terms to determine if they’re relevant.
SEARCH TERM SELECTION
Keyword lists that contain too many generic terms will yield many irrelevant documents. But if our keywords are too strict we might miss critical evidence.
Be careful when you’re selecting keywords so you won’t include terms that are too generic. These will likely produce a large volume of irrelevant documents that you’ll need to review to determine relevance. So we can’t treat all cases the same; we need to work closely with all parties to establish a keyword list that meets the objectives of the investigations. (Regardless, be prepared to conduct some manual exploring in the culled evidence.)
To highlight the complexities and pitfalls of keyword selection, let’s examine a case involving theft of intellectual property by an employee. The investigators would like to identify all the documents found on the employee’s workstation that contain the term “confidential” to determine sensitive information the employee might have accessed. However, this organization includes an automatically generated disclaimer at the bottom of all its e-mails that contains the sentence, “The content of this message is CONFIDENTIAL.” Although the “confidential” keyword would meet the objective of identifying sensitive documents that contain this term, all e-mails bearing the disclaimer would also be flagged.
Short words, such as abbreviations, might produce thousands of search hits because these terms might be contained in random text patterns such as those contained in remnants of deleted documents or binary system files found on the computer system.
Also, certain search terms might correspond to technical terms found on computer systems. For example, a fraud examiner conducting an investigation into prescription fraud might wish to search for the word “script” when examining a computer seized as part of an ongoing investigation. However, the word “script” turns up often in the code that’s used to generate a webpage. CNN’s home page contains that term more than 50 times, and every time a user visits that page, a copy of it is downloaded to the browser cache stored on the computer’s hard drive. So, when searching the computer for the term script, every webpage – or remnant of such page – that contains the term script would be identified by the search tool.
Furthermore, short words might be “embedded” in other terms. For example, if we’re searching for the word “car” as part of a scheme involving the use of company or rental cars, the term would flag documents containing “North Carolina,” “South Carolina,” “carriage,” “carries,” “Carnegie Hall,” and thousands more.
NUMBERS AND PATTERNS
Search terms might also include numbers and patterns of characters when searching for documents containing bank account information, invoices, phone numbers, IP addresses, and shipment tracking, among others. Patterns of characters might be used to represent more than one of the search terms. In an investigation of identity theft or anomalous transactions, the term “####-####-####-####,” in which “#” represents any number (0-9), can represent a 16-digit credit card number separated by hyphens.
We can also use “stemming” – searching for a stem (or root word) to find multiple variations. For example, the stem “fish,” would also find “fishing,” “fishery,” “fisheries,” “fisherman,” etc. Another technique is proximity searching: looking for documents in which two or more separately matching term occurrences are within a specified distance. This method increases the relevance of search hits and allows the examiner to use terms that might have been too generic on their own but are still highly relevant and generate few false positives when placed together. Getting back to our prescription fraud investigation, we could specify that we are looking for the terms “OxyContin,” “Valium,” etc. within five words of “script” to get relevant results.
CORPORATE CULTURE
We might also need to search for corporate lingo – words derived from internal acronyms or inside language that only employees might use to describe elements specific to the organization.
Finally, be aware that employees might change the meaning of a term to cover the meaning of a conversation. I was recently involved in an investigation in which individuals delivered what they called “boxes of chocolates” – briefcases containing payoff cash.
‘DEDUPLICATION’ OF SEARCH RESULTS
In large-scale investigations, there’s a high likelihood that we’ll identify duplicate documents and e-mails. Software tools allow us to automatically “deduplicate” all copies.
Deduplication might speed up the review process, but we should remember that knowing who held which piece of information when might also be critical to the investigation. The examiner might elect to “dedupe” results across each involved individual.
A TEAM EFFORT
The keyword selection process should be a joint effort by those involved in the investigation. This will ensure that adequate terms are selected and that they meet the objectives of the investigation. These lists will evolve as examiners discover new information via document analysis and posit new search terms.
NEXT ISSUE
In the next column, we’ll examine how banking Trojans work.
Jean-François Legault is a senior manager with Deloitte’s Forensic & Dispute Services practice in Montreal, Canada.