Featured Article

Digging for golden evidence: Data mining and fraud examination

Please sign in to save this to your favorites.
Date: January 1, 2005
read time: 17 mins

Fraud examiners need only basic computer skills and the help of systems administrators to mine for gold and find vital evidence for fraud examinations.  

A routine accounts payable audit uncovered a split purchase for expense office furniture made with a travel and entertainment credit card. Further review indicated that Sam, a veteran employee, had actually been making a lot of expensive purchases through a local supplier through several methods even though the company had a preferred national furniture vendor. Monica, a Certified Fraud Examiner, had strong suspicions that Sam may be involved in procurement fraud but lacked conclusive evidence.

To the uninitiated, a simple split purchase made with an improper card, or an invoice discrepancy might not have much meaning, but to a CFE skilled in data mining, every nook and cranny might be a treasure trove of valuable evidence. Even though she had only basic computing skills, Monica was able, with the help of her systems administrator, to search and analyze her firm's databases and discover the incriminating material she needed. Like Monica, you can put on your mining helmet and dig for gold. 1  

Business or nonprofit databases can be minimal in size or grow to several terabytes. Any of these data sources may hold hidden yet vital information you may be seeking during a fraud examination. No matter your background you can become skilled at data mining. In fact, it''s no longer an optional skill set.

Data mining has also been called database reporting, data analysis, and other similar names but generally can be defined as "the automated extraction of hidden predictive information from databases." The tools discussed here will assist with automated extraction.

With the explosion of databases that track customers, sales, and other information, data mining is now being used to increase revenues - through improved targeted marketing - and to reduce costs - through detecting and preventing waste and fraud.

As a fraud examiner beginning any investigation, you always have some pre-work that includes deciding what data that you want to mine, for what period of time, and from what system. Before you use the actual mining tools, you need to identify the problem to be solved; identify, collect and prepare the right data; interpret the data; and monitor the results. The real key to success, however, is having a thorough understanding of your data.

Database basics
Some people think that extracting data from their system is as easy as pulling it up on their screens via the software. However, there are some important differences between retrieving what is called "transactional" data (such as customer account information) and data from robust data warehouses that contain historical data.

Typically business applications store recent data in tables for quick and easy retrieval but when one needs to find more complex or older data it may be hard to fully extract and analyze because it's kept in separate tables that may not appear to be related. A database is an organized collection of information that can be easily accessed, managed, and updated. A database management system (DBMS) such as Access, FileMaker Pro, SQL Server, or Oracle provides you with the software tools you need to organize data in a flexible manner. It includes facilities to add, modify, or delete data from the database, ask questions (or queries) about the data stored in the database and may have internal tools to produce reports summarizing selected contents.

A table is where the data is actually stored in the database. If you were to look at a table behind the scenes, it's similar to a spreadsheet because it's organized through the use of rows and columns. Individual records are represented horizontally and fields are shown vertically.

Relational databases organize data in multiple related tables. Relational databases are NOT "super" spreadsheets but have some similar functionalities. A relational database may be made up of hundreds of tables that organize different types of information. For instance, in a typical retail operation there are tables that keep track of customers and others that keep track of orders, but there's not a single table that contains all customer information. One table has basic customer information such as names, addresses, and phone numbers, another tracks customers' credit histories, and yet another tracks individual contact information such as the name of the contacts, the types of discussion, and customers' special requests. These tables are tied together using a key field such as a customer number.

Designing databases is a detailed science that's outside the scope of this article. But suffice it to say that a database administrator has techniques that make it easy to pull up information quickly and keep "normalized" data in tables in specific locations. One rule of normalization dictates that data should be stored in only one place within a database. Imagine the confusion if a customer's credit limit was stored in two separate tables or could be updated by multiple processes. It would be hard to determine that customer's actual credit limit if the data didn't match exactly.

Transactional databases, which have relatively simple record structures, allow many users to read and make changes. It's easy to quickly pull up on a computer screen standard information on a transactional database such as a customer's order history or other business application. But it's difficult to reconstruct historical or trend information. Most entities' systems record virtually all transactions, but the data isn't typically located in one single, comprehensive location.

For mining purposes, data can be pulled from these transactional tables just as if they were stored in a single table by linking them together through fields that are common to both tables. Fields to be linked should be of the same data type, and the string/text fields should be the same length. These links are created through the use of keys.

By using data mining tools, users connect remotely to their database through a method of Open DataBase Connectivity (ODBC). After the data has been connected through one of the reporting tools and linked correctly, you'll need to narrow the data down to the information that will be most helpful. A query is a set of criteria that you specify to identify and retrieve data records that meet those criteria. Structured Query Language (SQL) is a standard interactive and programming language for obtaining information from databases and updating them. SQL allows users to access data in relational database management systems, such as Oracle, Sybase, Informix, Microsoft SQL Server, Access, and others by allowing users to describe the data that the user wants to see. Many database reporting tools assume you don't know SQL and write the SQL query for you.

Preparation for mining
Before starting your hunt for the gold, there are several things to be aware of that may affect both the length of your journey and its success. If you're trying to mine data from a large database, one of the most important things you should do is to attempt to locate the database schema. [This may also be called a data dictionary, entity relationship diagram (ERD), or database diagram.] You may want to work directly with your data administrator to save valuable time and better understand the data structure. The database schema may also be represented as a large chart that visually represents the relationships between fields and corresponding links between tables. It's important that the database schema includes field name, table name, data type, and any links. The schema will help you link tables correctly so that you'll end up with the right data in your report.

Other considerations include:  

  •  
  •  
    data availability and lead time needed for data access;
  •  
  •  
    integrity of data source;
  •  
  •  
    tools and hardware that will be used to analyze data;
  •  
  •  
    time and effort needed to prepare and scrub the data;
  •  
  •  
    understanding the data and its limitations, data labels and definitions, and the process that the data is related to; and
  •  
  •  
    confirming data input and output accuracy and integrity.

    Also look beyond the easily accessible databases that appear to have everything you need but may contain only about six months of data. Your company may have "warehoused" multiple years of data in another location that could be quite useful.

    Typical data sources used in fraud investigations
    Depending upon the type of fraud you're investigating, you may need to retrieve data related to all aspects of your subject's activities. You may begin with the main financial accounting system tables. But if you're investigating overtime fraud, vendor fraud, kickbacks, or sales frauds, you may need to access all other systems to tell the whole story. Data sources that may be overlooked are: physical access (card key) logs, phone logs, network login logs, time reporting records, Remote Access Server/Virtual Private Network (RAS/VPN) logs, vendor setup documents, visitor logs, pst files (e-mail storage files), and Web proxy logs. Typically, you'll need to request access to the databases, or request data extracts from the owner of those databases, or run queries from these systems to place the appropriate information into a separate file. It's important to prepare the data to be linked with the other data sources by putting each input data source into a separate table. The data can be inserted into the respective file or table by one of the following methods:

  •  
  •  
    manual data entry;
  •  
  •  
    cut and paste from another application;
  •  
  •  
    import from MS Excel;
  •  
  •  
    import fixed width or delimited text files (comma, tab, etc.);
  •  
  •  
    import from other MS Access databases; and
  •  
  •  
    link via ODBC.

    For instance, in a typical overtime fraud examination involving employees who work in an office environment, you'll need card key logs, RAS/VPN logs, network logs, operating system event logs, application event logs, phone records, hourly reporting records, and pst files (where you can determine beginning and ending sent items each day). Once you retrieve these raw data files, you can start the mining process. Data commonly takes many formats and your first step is to understand what you've received. Typically data comes in the form of text files or Excel files that consist of simple columns and rows. Once the data set is extracted into its own file it should be descriptively named (such as "ntlogsmjones.xls") so you can later find your original data. Although they're easy to open, they're not easy to use for comparison, calculating, etc. Also, when you're working directly with the raw data source, you risk the chance of inadvertently altering the data so it's important to choose an analysis tool that's read-only or view your data extract in read-only mode if you're unsure of your techniques. Additionally, make back up copies of your original raw data extracts to save the time and effort required to pull your data again if something should happen to corrupt your data.

    Tools that help you mine for gold
    Plenty of third-party reporting tools can help you analyze large amounts of data. But here we'll discuss just two, Crystal Reports and Access 2003, because they're easy to use, relatively inexpensive, and perform well.

    Crystal Reports
    Crystal Reports has been the gold standard for database reporting. It has the flexibility to connect to almost any data source with a user-friendly interface that makes the digging much easier. The end report is appealing because Crystal Reports can graph the data visually at the same time the query returns. It's a tenth-generation reporting tool originally developed by Seagate Software and now owned by Business Objects. Crystal Reports can connect to virtually any data source through ODBC or PC data connections. Most importantly, the connection is read-only when connecting through ODBC. This means that no matter what you do in your Crystal report, there's no possible way you can corrupt or change your data.

    Among other highlights, Crystal Reports:  

  •  
  •  
    costs less than $600 for version 10 Professional or Developer editions;
  •  
  •  
    can connect to virtually any data source in a read-only state;
  •  
  •  
    can map, chart, and graph at query time, presenting a visual representation on the fly;
  •  
  •  
    is a user-friendly graphical interface, similar to Office products;
  •  
  •  
    can be used as stand-alone or as part of an enterprise-wide reporting solution that allows securing sharing of reports over the Web, scheduling, and alerting;
  •  
  •  
    has wizards to guide you through report building;
  •  
  •  
    can connect to multiple (non-similar) data sources at the same time and combine the data into one report; and
  •  
  •  
    produces a report that can be exported to a variety of formats.

    Crystal Reports allows you to create reports quickly by dragging your fields onto a blank report, group and sort based on database fields or calculated fields, and format a presentation quality report in a matter of minutes. The report can be quickly refreshed as data is added.

    Microsoft Access
    Microsoft Access provides users with one of the simplest and most flexible database solutions on the market today. Regular users of Microsoft products will enjoy the familiar Windows "look and feel" as well as the tight integration with other Microsoft Office family products. An abundance of wizards lessens the complexity of administrative tasks. Access' three major components that most database miners will require are tables, queries, and reports. Unlike Crystal Reports, MS Access allows you to update the source data and change the data through queries and manual data entry into tables. Be aware that when you're within MS Access, you can intentionally or accidentally change the data. Therefore, while Crystal Reports is purely a reporting and analysis tool, MS Access is actually a database application that can contain other features to control data rather than a simple connection to it. Because of this feature, it's important to understand that you may be able to challenge the data integrity by adding and updating data.

    Among other highlights, MS Access: 

  •  
  •  
    is inexpensive and included within the suite of MS Office products;
  •  
  •  
    allows analysis of data beyond the 65k record limitation of MS Excel;
  •  
  •  
    is a user-friendly graphical interface, similar to the other Office tools;
  •  
  •  
    allows users to become familiar with relational database concepts without having to learn Structured Query Language (SQL); and
  •  
  •  
    if required, users can easily add additional data and/or queries without leaving the application.

    General hints on getting started

    Regardless of the tool that you select to use in data mining, you'll follow the same general process. In Crystal Reports you will be using the tool interface to make choices regarding fields to include, grouping, sorting, record selection, formulas, summary fields (averages and totals), and charts. You then can add items such as text labels, page numbers, trend-lines, highlighting, and drill-down to make it presentation quality for your readers. In Crystal Reports, you can use the interface to create a select query in the background that will execute. You don't need to know SQL, but if you do and would like to write your own queries there are locations where your SQL can be inserted.

    In MS Access you have similar functionality but it's broken into the Query area and the Reports area. You have more options for working with data in the form of queries (Select and Make Table are the most popular) so you can first write your queries and then format the output into a report.
    Select queries allow you to select data from a table or combination of tables with records that meet your specified criteria (such as sales between 1/1/98 and 12/31/98 for the ABC Company).

    Make Table queries allow you to create a new table to store the output of a query according to the criteria you specify. When using this type of query you're duplicating data within the database, which both increases the database size and creates the possibility for confusion among the data sets.

    Regardless of the tool you're using, here are some first steps to mining your data:  

  •  
  •  
    Choose the fields that are most important to your investigation. Typically they're important dates and times, approvers, amount, user IDs, and description fields.
  •  
  •  
    Sort your data on the most important date and currency fields and get an overall lay of the land.
  •  
  •  
    Group your data for a selected and meaningful period of time, such as month, quarter, year, week, or day.
  •  
  •  
    Summarize (count, total, average) numerical fields such as amount to get a general size of the transactions for the time period.
  •  
  •  
    Focus attention on transactions at either end of a range (very large, very small, late in the quarter, etc).
  •  
  •  
    Group by entity such as customer, employee, cost center, vendor, etc.
  •  
  •  
    Grouping and/or sorting can also highlight possible symptoms of fraud:
  •  
  •  
    transaction dates that are very old, or future dates;
  •  
  •  
    transaction values outside the normal range;
  •  
  •  
    payees or vendor names starting with blanks or unusual characters; and
  •  
  •  
    records with blank field values.

    Working around common data analysis challenges
    There are several common challenges that can turn into problems for even the most experienced data miners when preparing their data for reporting. It's a best practice to review the following list with each new data set that you get because it will save you time when joining tables, summarizing data, and conducting other data mining activities. With a little data "massaging" and awareness of some of these items (converting field types, extracting fields that are completely blank, etc.) you can actually "prepare" the data for the report:

  •  
  •  
    improperly imported data field formats;
  •  
  •  
    querying and calculating numbers that have been imported as text or other format;
  •  
  •  
    querying and calculating dates and ranges that have been imported as text or other format;
  •  
  •  
    thinking that just because the query executes without error that it has run correctly;
  •  
  •  
    not using control totals;
  •  
  •  
    trying to accomplish too much in a single query (in MS Access);
  •  
  •  
    dealing with large data sets;
  •  
  •  
    ensuring CSV files import correctly by removing commas from data fields;
  •  
  •  
    desired data is a portion of data in an existing field or is broken into multiple fields (such as MMDDYY);
  •  
  •  
    queries are too complicated or long to perform singularly;
  •  
  •  
    sharing database analysis results with other people;
  •  
  •  
    queries are too numerous to easily keep track of (MS Access);
  •  
  •  
    compacting databases to save disk space;
  •  
  •  
    making sure you have sufficient hardware for analyzing large data sets including:
  •  
  •  
    a speedy processor;
  •  
  •  
    sufficient RAM to enhance processing speed and stability; and
  •  
  •  
    enough data storage capacity to store on the hard-drive disk.
  •  
  •  
    not spot checking query output with expected output from originating table;
  •  
  •  
    only pulling the data you need by segmenting your request (such as by cost center or specific organization identifier) and avoiding the desire to "have it all"; and
  •  
  •  
    splitting raw data files using a text editor (MS Word) or another database product (SQL Server or ACL).

    The following are simple ideas for queries to get you started. The query may need to be modified slightly depending upon your data and tool selected.

    Procurement fraud queries using either tool
    Using the "Find Duplicates" functionality within the MS Access Query Wizard, follow the steps to identify cases where a vendor may have submitted the same invoice twice. Within Crystal Reports users can find duplicates by sorting and grouping and then changing fields used for sorting and group. (Group by vendor, then sort first by date, then by amount, etc.)

  •  
  •  
    Examine the data for same vendor, same amount, and same date (duplicate information to allow for vendor submitting the same invoice but changing the invoice number).
  •  
  •  
    Examine the data for same vendor, same amount, and same invoice number (duplicate identification to allow for vendor submitting same invoice but changing the date).
  •  
  •  
    Examine the data for same vendor and same amount, which will result in the largest resultant data set.
    ATF and Split Purchases
  •  
  •  
    Examine the data for split purchases (just under a cut-off dollar amount) and multiple transactions on the same day to the same vendor within a few dollars of each other.
  •  
  •  
    For after-the-fact purchases compare the purchase order date and the invoice date. Create an expression that calculates the number of days between the two dates.

    Expense report fraud queries using either tool  

  •  
  •  
    Import the list of expenditures that are within $1 of a key authorization threshold. (For example, if an original receipt is required for all employee expense over $30, pull all transactions that are $29.00 to $29.99.)
  •  
  •  
    Group by employee to see if certain employees routinely expense items just under the threshold using the summary functionality (group by, count, and sum), and calculate the frequency and value of transactions just below the authorization threshold.
  •  
  •  
    Investigate employees that have a significantly higher number of transactions just falling below a key authorization threshold to understand especially if the item descriptions are inadequate to confirm such as multiple entries for "Taxi."

    Payroll fraud queries using either tool  

  •  
  •  
    Import the list of all employee receiving salary from the last period.
  •  
  •  
    Import the list of all authorized, current employees:
    Using the "Find Unmatched" functionality in the Query Wizard, follow the steps to identify employees that are receiving salary but aren't on the list of authorized employees.
  •  
  •  
    Annualize the salary payments of the last payroll cycle for all employees excluding expense reports reimbursement, bonus payments, etc.
  •  
  •  
    Join to the personnel table on employee number.
  •  
  •  
    Compare the annualized salary calculation with the authorized salary amount.
  •  
  •  
    Investigate any discrepancies.

    Queries for gaps in check register using either tool  

  •  
  •  
    Sort the transactions by check number field, showing the vendor paid, date of payment, and amount paid.
  •  
  •  
    Look for sequences that are missing or check numbers that don't comply with standard numbering technique.

    Queries for dormant account activity using either tool  

  •  
  •  
    Review activities by accounts that may not be actively used.
  •  
  •  
    Date of the last transaction for each account can be determined by applying the Maximum function with the data entry date.
  •  
  •  
    Determine if dates of transactions only seem to occur during a short time period and research the account purpose.

     Dig for gold but withhold quick conclusions

    While taking the first step to begin learning a new tool may be a bit daunting, if you start with some simple scenarios you may quickly see the gold appear before your eyes. The data doesn't lie, although there's a chance that a misunderstanding can occur. Be careful not to jump to any conclusions before you have verified it through additional investigation. The initial effort to find and prepare your data for mining will be well worth it when you discover the gleaming nugget that makes your case.

    Tammi Johnson, CFE, is a senior forensic data analyst for the Financial Integrity Unit at Microsoft Corporation in Redmond, Wash.

    Judd Hesselroth, CIA, CISA, is an audit manager within the Internal Audit organization at Microsoft Corporation in Redmond, Wash.

    1 This fictitious case is used for illustrative purposes. 

    Crystal Reports
    Reference Materials
     

    Basic user

    "Jump Start: Crystal Reports Version 9, Level 1: The Basics" by Hammerman Associates

    Advanced user
    "The Complete Reference: Crystal Reports 9" by George Peck

     
    MS Access Reference Materials 


    Basic user

    "Access Plain & Simple - Access 2002" by Curtis Frye
    "Access Step by Step - Access 2002" by Online Training Solutions

    Advanced user
    "Intermediate/Advanced Users: Access Inside Out - 2002" by Helen Feddema
    "Advanced Users: Programming Microsoft Access 2002" by Rick Dobson

    The Association of Certified Fraud Examiners assumes sole copyright of any article published on www.Fraud-Magazine.com or ACFE.com. Permission of the publisher is required before an article can be copied or reproduced.  

    Begin Your Free 30-Day Trial

    Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.