The coming storm
A practical guide to preparing for 'big data'
Fraud EDge: A forum for fraud-fighting faculty in higher ed
In this column, we discuss "big data," a common term that refers to both the volume of data and the technology that's necessary to effectively analyze and utilize that data. We address the nature and character of big data, its challenges, approaches to working in a big-data environment, many of the options available to professionals and the role of academics in educating young fraud examiners in the new technologies.
The Gartner Group defines "big data" as "information of extreme size, diversity, and complexity." Big data is relative to the analyst. It can be intimidating for fraud examiners who have never encountered data beyond the capacity of Microsoft Excel. For others, traditional local machine-based tools no longer do the trick, and they might need to enter the perplexing world of production servers, cloud-based analytical platforms and oddities with such strange names as Hadoop, Spark, MapReduce and NoSQL.
Whatever your skill level or technological knowledge, it's a certainty that the size, complexity and nature of data you'll encounter in fraud examinations will evolve so that your methods eventually will be obsolete. As we've explored in previous columns, at least two factors will drive this phenomena: 1) the growth of unstructured data, already accounting for at least 80 percent of overall organization data volume and 2) the advent of the "Internet of Things" that'll involve live data streaming from embedded sensors in millions of objects.
We'll discuss the state of tools now available to fraud examiners and new tools that can handle larger volumes of data and greater complexity.
TRADITIONAL APPROACHES: SPREADSHEETS AND CAATs
Microsoft Excel still dominates the market for standard data analysis — particularly in fraud examinations. Standard Open Database Connectivity (ODBC) functionality connects to data sources in a large variety of databases, and the Analysis Tools Add-in extends Excel's basic functionality to include statistical regression and ANOVA (a collection of statistical models) analysis, among others. (ODBC is "a database programming interface from Microsoft that provides a common language for Windows applications to access databases on a network," according to
Third-party tools such as ASAP Utilities (www.asap-utilities.com) and Fraud ToolBox (fraudtoolbox.com) provide enhanced "ribbons" of menu options to automate otherwise routine tasks. For all its flexibility and ease of use, Excel limits data to little more than 1 million rows per sheet. That sounds impressive, but applying even a basic formula to a large volume of rows significantly reduces performance.
When size and format of data becomes too much for Excel, computer-assisted auditing tools (CAATs) — such as ACL, IDEA, Arbutus — and desktop databases — such as Microsoft Access — take over. Often these tools provide more robust data analysis functionality and accommodate a wider variety of formats. For example, Microsoft Access can handle up to two gigabytes of data, while the computer-assisted auditing tools can ingest and analyze datasets in excess of 100 gigabytes. Despite the ability to analyze these large data sets, performance at that level suffers dramatically. Some of these tools still operate in the older 32-bit environment, which means they can only leverage a fraction of the available memory in the system.
For full access to story, members may sign in here.
Not a member? Click here to Join Now and access the full article.