Fraud EDge: A forum for fraud-fighting faculty in higher ed
In this column, we discuss "big data," a common term that refers to both the volume of data and the technology that's necessary to effectively analyze and utilize that data. We address the nature and character of big data, its challenges, approaches to working in a big-data environment, many of the options available to professionals and the role of academics in educating young fraud examiners in the new technologies.
The Gartner Group defines "big data" as "information of extreme size, diversity, and complexity." Big data is relative to the analyst. It can be intimidating for fraud examiners who have never encountered data beyond the capacity of Microsoft Excel. For others, traditional local machine-based tools no longer do the trick, and they might need to enter the perplexing world of production servers, cloud-based analytical platforms and oddities with such strange names as Hadoop, Spark, MapReduce and NoSQL.
Whatever your skill level or technological knowledge, it's a certainty that the size, complexity and nature of data you'll encounter in fraud examinations will evolve so that your methods eventually will be obsolete. As we've explored in previous columns, at least two factors will drive this phenomena: 1) the growth of unstructured data, already accounting for at least 80 percent of overall organization data volume and 2) the advent of the "Internet of Things" that'll involve live data streaming from embedded sensors in millions of objects.
We'll discuss the state of tools now available to fraud examiners and new tools that can handle larger volumes of data and greater complexity.
Traditional approaches: spreadsheets and CAATs
Microsoft Excel still dominates the market for standard data analysis — particularly in fraud examinations. Standard Open Database Connectivity (ODBC) functionality connects to data sources in a large variety of databases, and the Analysis Tools Add-in extends Excel's basic functionality to include statistical regression and ANOVA (a collection of statistical models) analysis, among others. (ODBC is "a database programming interface from Microsoft that provides a common language for Windows applications to access databases on a network," according to
PCMag.com.)
Third-party tools such as ASAP Utilities (www.asap-utilities.com) and Fraud ToolBox (fraudtoolbox.com) provide enhanced "ribbons" of menu options to automate otherwise routine tasks. For all its flexibility and ease of use, Excel limits data to little more than 1 million rows per sheet. That sounds impressive, but applying even a basic formula to a large volume of rows significantly reduces performance.
When size and format of data becomes too much for Excel, computer-assisted auditing tools (CAATs) — such as ACL, IDEA, Arbutus — and desktop databases — such as Microsoft Access — take over. Often these tools provide more robust data analysis functionality and accommodate a wider variety of formats. For example, Microsoft Access can handle up to two gigabytes of data, while the computer-assisted auditing tools can ingest and analyze datasets in excess of 100 gigabytes. Despite the ability to analyze these large data sets, performance at that level suffers dramatically. Some of these tools still operate in the older 32-bit environment, which means they can only leverage a fraction of the available memory in the system.
Throwing hardware at the problem
As data volumes soar, we might be tempted to "throw more hardware at it" by incorporating faster or more processors and increasing memory levels. Traditional PCs are limited, so we move to the next tier of tools: production servers or clusters of servers.
Database platforms such as Microsoft SQL Server, MySQL, PostgreSQL and Oracle are designed specifically to harness the terabytes of memory (1TB = 1,000 GB). Third-party tools such as IBM's Cognos run on top of these production databases and provide a robust set of analytics capabilities, including data visualization, statistical analysis and machine learning.
Cloud-based analytics as a service
While these production servers can certainly handle volumes of data that many fraud examiners don't usually see, establishing and maintaining such environments is often costly and requires a high level of expertise. And, as data volume and complexity increases, the only answer is to add more resources. However, at some point, this also becomes infeasible.
Fortunately, technology giants like Microsoft, Google and Amazon are creating farms of thousands to millions of servers and charging us to host our data in the "cloud." Their software analyzes our data, and their vast memory and processing resources can crunch it.
Fees for these cloud-based services are relatively low compared to their benefits. The three major providers also offer free trial periods for using the full functionality of their services:
Major cloud service providers
Common concerns about cloud-based services include: 1) perceived risk of breach of privacy and security 2) lack of control over hardware and services and 3) a steeper learning curve for analysts.
However, these areas are improving. For example, a data-hosting provider called atlantic.net now touts HIPAA compliance for health care providers, which is one of the first to do so. Other vendors, no doubt, will soon follow in this area.
Additionally, Microsoft, Amazon and Google already provide adequate assurances of data security in their hosting and analysis offerings. However, it's a best practice to always exercise due diligence when making a decision about data hosting and security and not blindly trust those assurances.
On the learning curve concern, cloud-based tools and services are becoming mainstream, and easier intuitive graphical user interfaces are following rather quickly. Also, relatively cheap online training is available for all platforms to help jump-start that learning process.
Future technologies
Many new technologies probably are beyond the needs and capabilities of most fraud examiners, but we provide this information as background.
Cutting-edge data technologies often are based on a framework called MapReduce.
Hadoop — a distributed storage and processing platform — runs on this framework.
The Hadoop environment enables processing of petabytes of data (1 petabyte = 1,000 terabytes = 1 million gigabytes), enormously high processing speeds and superior handling of a vast variety of data formats, including unstructured data. An even more recent evolution of this concept is called
Spark, which some claim is as much as 100 times faster than Hadoop.
The systems are designed to not only handle "static" historical data but also excel at handling streaming data — live data that pours in from Internet-based sources and physical sensors. As we explored in our
March/April 2015 column, streaming sensor data (the Internet of Things) will provide fraud examiners with robust, actionable data in their investigations — if they can learn to deal with such volume.
Major providers of this computing environment are Cloudera, MapR, HortonWorks and databricks (databricks.com). All offer free trial environments (called "sandboxes") to test functionality. Everyday users currently can't easily grasp MapReduce technologies because they require a working knowledge of IT and some programming skill. (Analytical database tools such as Google Big Query or SQL Server Analysis Services are easier for ad-hoc analysis.) However, these classes of technology will be the foundation from which the next generation of data analytics tools arises.
Practical advice and the role of academics
Fraud examiners interested in exploring big-data tools should access the free trial accounts of the cloud-service providers we include above to reduce the mystery of how they work. We recommend further exploration of the Hadoop sandboxes. Even if you don't need data storage and analysis at massive scales right now, you should be familiar with these environments well before your actual need arises because they're decidedly different than traditional PC-based tools.
Data-science curriculums are springing up in many colleges and universities. Students can gain exposure to the wide range of analytics tools from spreadsheets to Hadoop, MapReduce and the newer Spark. The next generation of fraud examiners entering universities today need to be adept in these technologies and be comfortable handling and analyzing massive (not just big) data in unusual formats.
Fraudsters often rely on large amounts of data to hide their schemes — the big-data phenomenon only exacerbates this problem. A new breed of tools — and the people who can use them — are necessary to manage the coming data storm. Universities will be their training ground in the fight against fraud.
Les Heitger, Ph.D., Educator Associate, is BKD Distinguished Professor of Forensic Accounting in the School of Accountancy at Missouri State University in Springfield. He's chair of the ACFE Higher Education Advisory Committee.
Jeremy Clopton, CFE, CPA, ACDA, is senior managing consultant in the Forensics Practice of BKD, LLP.
Lanny Morrow, EnCE, is a managing consultant in the Forensics Practice of BKD, LLP.
The Association of Certified Fraud Examiners assumes sole copyright of any article published on www.Fraud-Magazine.com or ACFE.com. Permission of the publisher is required before an article can be copied or reproduced.