Using LLMs for electronic document review
Featured Article

Upgrade your electronic document reviews with LLMs

By Chetan Lunkar, CFE
Written by: Chetan Lunkar, CFE
Date: March 2, 2026
Read Time: 17 mins
Please sign in to save this to your favorites.

When reviewing extensive amounts of electronic data, large language models can help fraud examiners work faster to reveal key details that may otherwise be overlooked. Here’s what you need to know to use this technology effectively to improve your electronic document reviews. 

During an engagement involving a procurement fraud case, my team and I faced a dilemma all too common for fraud examiners reviewing electronic evidence. We tried to limit the number of documents we’d need to review, but the key words we used to refine our search returned an overabundance of data. 

In this case, an anonymous whistleblower alleged their company’s procurement team was accepting kickbacks to onboard new vendors. The whistleblower didn’t specify how or which vendors were involved, but since they’d made substantiated allegations in the past, my investigation team launched a comprehensive probe and gathered gigabytes of data, including emails, internal reports, chat messages and vendor records. To isolate the documents we’d need to review, we searched for terms like “onboard,” “approval,” “selection”, “quote analysis” and “award” to capture instances where someone might be committing fraud. 

The resulting document was too large to review thoroughly within our limited timeframe and budget. Narrowing the search to include explicit terms like “kickback” or “favor” might’ve reduced the volume but could’ve led to a deficient investigation. Corrupt conduct is rarely expressed explicitly and manifests in acts such as manipulated vendor ratings, approvals granted without justification and misaligned evaluation scores. Searching for what the misconduct was called, rather than how it could manifest, could’ve excluded key evidence suggesting fraud.  

Using LLMs for electronic document reviewWe then deployed a large language model (LLM), a type of artificial intelligence that can understand and generate human text. We fed the LLM documents and the company’s procurement manual and instructed it to identify deviations from the company’s process and any indications of misconduct. While the initial outputs were imprecise, subsequent rounds of feedback and instruction led to more precise, relevant outputs. Within a week, our LLM not only flagged procedural violations in the vendor selection process, but it also identified a pattern that keyword searches missed. It returned multiple instances where the winning vendors were last to bid, invariably at the lowest price, and each winning quote was preceded by a call from the procurement manager. This strongly hinted at bid leakage and collusion. We were then able to interview vendors and employees who confirmed the allegations.  

From this experience, my team and I saw firsthand how LLMs can accomplish more than keyword searches and function as an informed reviewer at scale. While LLMs can bring speed and analytical depth to investigations, deploying them to effectively gather evidence in fraud examinations requires knowing their benefits and limitations and methods to streamline workflow.  

Electronic document review in fraud examinations

Incriminating evidence is often scattered across various electronic documents in fraud examinations, including emails and chat messages and can involve gigabytes of digital data. The Electronic Discovery Reference Model (EDRM) provides a widely accepted framework for managing electronic evidence. Electronic document review (EDR) begins with identifying relevant data sources and concludes with presenting evidence. According to the EDRM, there are three tasks for managing electronic data.

  • Processing requires converting various datasets, such as emails, attachments, chat logs, scanned documents and proprietary file formats, into reviewable and trackable formats and pruning data to exclude extraneous files. At this stage, raw data is transformed into items suitable for review and preserves metadata such as dates, authors and file relationships.
  • Review involves inspecting individual documents after applying search terms (or keywords) to assess their relevance to the investigation. Documents are manually examined to assess their importance and tagged based on their relevance to the issues under investigation.
  • Analysis entails identifying patterns, building timelines and constructing a narrative from the reviewed documents. Fraud examiners must connect dots across documents, recognize relationships between parties, and assemble evidence of intent and culpability.

EDR can be expensive and time-consuming. Besides processing and hosting expenses, the time to complete reviews also drives costs, which are greatly increased when there are more documents to review. A reviewer’s accuracy can suffer if they have a large volume of documents and the pressure is on to finish quickly.  

The limitations of keyword-based EDR

Keyword-based EDR (K-EDR) uses search terms refined through Boolean operators, proximity searches and wildcards. It can also filter by date ranges, custodians or file types, and it’s commonly used to identify relevant case information from large datasets. However, it has significant limitations. 

First, keywords are often too broad and don’t cover all the nuances of word contexts. For example, a search for the keyword “fraud” will retrieve every document containing that word, regardless of context, including documents where the term appears in a compliance policy, a training manual or a routine disclaimer. The search retrieval might also fail to return documents that describe the same conduct in a different context. For instance, a document in which an employee writes, “I tweaked the numbers for the auditors” is describing potential fraud. A keyword search for “fraud,” “misrepresentation” or “falsification” would miss this. A study published by researchers participating in the Text Retrieval Conference (TREC), a program with the National Institute of Standards and Technology (NIST) that evaluates information retrieval systems, found that on average, K-EDR missed about 78% of relevant documents.
 
Second, keyword searches can’t make connections and identify patterns between documents. Often, a document seems irrelevant on its own. Substantive evidence of intent and culpability rarely exists in a single document; it must be assembled from scattered clues. For example, a keyword search that retrieves a “rejected” vendor’s invoice won’t reveal records of a late-night WhatsApp call to a vendor and a subsequent approval of the previously rejected invoice. These additional documents suggest a pattern of collusion, but an extensive human review is still necessary to reveal these connections. 

Third, when multiple reviewers work through a large document set, there’ll be inconsistencies and inaccuracies. Different reviewers may apply different relevance standards to similar documents, and the mental fatigue of reviewing thousands of documents compounds the risk for error. 

Using LLMs for electronic document review

Technology-assisted review (TAR), which employs machine learning to classify documents, addresses some of these limitations and is recognized in several legal jurisdictions. But TAR functions without a programmed understanding of what’s relevant to a case and learns from examples defined by human reviewers. 

Significant input from human reviewers is required to train TAR models. And when training data is selected through keyword searches, the TAR model produces the same drawbacks as K-EDR — it learns relevance based on keyword-matching language and may overlook documents that discuss the same issue using different terms. 

Mastering the mechanics of LLMs

LLMs can help fraud examiners overcome the limitations common to K-EDR and TAR because of their ability to comprehend meaning. 

LLMs are trained on extensive datasets from various digital sources, including books, articles, regulatory filings and business documents. This training enables LLMs to build a vast network of associations from the text they’re trained on, allowing them to “understand” different business processes, workplace standards and regulatory frameworks. 

Training produces numerical values called weights that capture how words relate to one another in different contexts. Rather than recording that certain words tend to appear together, weights encode the nature of the relationship between words. For instance, the word “fraud” is probabilistically associated with concepts related to an intent to deceive and misrepresentation, such as financial fraud and securities violations, but not with text or terms that suggest harassment. An LLM infers that for text in a document to indicate fraud, it should contain certain elements such as falsification or deception. An LLM can recognize both fraud and harassment simultaneously as types of misconduct. 

Using LLMs for electronic document reviewAlthough these capabilities might suggest that LLMs have human-like expertise or reasoning skills, they’re fundamentally probabilistic prediction systems. Think of LLMs as a system with extraordinary ability to recognize patterns. Just as a fraud examiner who’s examined thousands of cases might notice suspicious patterns in financial transactions without being explicitly told what to look for, LLMs can identify potential issues based on statistical patterns learned from their training data. They predict likely text continuations based on observed patterns. 

This is what makes LLMs fundamentally different. A keyword search can only match the exact terms you specify, but an LLM can recognize that phrases like “pushed the invoice through without sign-off” and “unapproved invoice” describe the same concept. Unlike keywords that match exact terms or TAR, LLMs understand meaning, context and relationships between concepts with comparatively minimal guidance and human oversight. 

Working faster with more data

LLMs help fraud examiners work faster and review more data.  According to joint research conducted in 2024 by artificial intelligence platform Relativity and the law firm, Redgrave, an LLM’s effectiveness is comparable to that of a human reviewer, making it a reliable tool for EDR. 

LLMs excel at identifying relevant patterns or datapoints across different types of documents and communication formats. Because they’re trained on terabytes of data from diverse public sources, they hold a vast repository of information about various business processes, workplace standards and regulatory frameworks. This knowledge enables LLMs to spot anomalies without specific instructions. For example, when giving search instructions to an LLM, you don’t have to explain that an expense claim should have proper approval accompanying it or that a cryptic conversation about competitors could indicate bid rigging. The LLM recognizes these as potential red flags based on its contextual knowledge, allowing fraud examiners to focus on broader objectives in their searches rather than on detailed rules, norms, processes and practices. 

An LLM’s vast encoded knowledge enables fraud examiners to analyze documents from multiple perspectives rather than rely solely on a rigid keyword-based method. By understanding context and sentiment, LLMs decipher an investigative narrative as it moves from a formal email to an informal chat and back again, linking related events across documents and communication formats for a cohesive picture of a scheme.

Key LLM limitations

LLMs are ordinarily associated with the risk of hallucinations, where an LLM generates outputs that are confidently stated but factually incorrect or fabricated. This risk alone makes human review of LLM outputs an essential step. However, from an EDR perspective, there are additional limitations that are less commonly understood but equally consequential.   

Statelessness and limited context window: LLMs can only process a limited amount of information at once. The context window encompasses the instructions provided, the documents submitted and the LLM’s own output, which represents the total information an LLM can process at any time. Most current models have, on average, a context window of approximately 250,000 tokens, which translates into roughly 300 documents (assuming an average of 500 words per document and after allowing for instructions and outputs).

Since fraud examinations can involve thousands or even millions of documents, they must be processed in batches. This is where statelessness compounds the problem: LLMs don’t retain memory of past interactions, so insights or patterns identified in one batch won’t be identified when processing the next batch. Each batch is reviewed in isolation, without any accumulated and contextual understanding of prior documents. An application layer is also necessary to log interactions, store outputs and resubmit relevant context, ensuring continuity across batches and producing an audit trail that a defensible review process requires.

Explainability: It’s difficult to understand and explain the reasoning behind an LLM’s predictions. Generating a word requires billions of calculations, and it’s impractical to analyze computations with the aim to understand the logic behind its output. Although you may instruct an LLM to explain its logical steps or reasoning (a technique called chain of thought), these explanations are probabilistic and don’t reflect the model’s true decision-making process. An output might be correct, but the underlying logic could be flawed. Explainability techniques based on computational statistics are still in the early stages of development. 

Prompt sensitivity: The quality of an LLM’s output depends on the quality of the instructions or prompts it gets. Prompts that are too broad may cause the LLM to retrieve irrelevant information, and prompts that are too narrow may omit relevant documents. Small changes in prompts may produce different results within the same LLM or across multiple LLMs. 

LLMs also struggle with overly long or complex instructions, especially those addressing multiple issues. Unlike human reviewers, who might notice evidence of other misconduct unrelated to a specific issue they’re searching for, LLMs operate within the boundaries of their prompts. They operate with tunnel vision, and wouldn’t flag observations about misconduct they hadn’t been instructed to look for, regardless of how clearly that misconduct appears in the documents.    

Training data limitations: An LLM’s understanding of an issue depends on its training data. If the data is inconsistent or doesn’t accurately reflect relevant standards or norms, the output may be flawed. For example, an LLM might incorrectly conclude that uncommon or outlier situations, such as one employee winking at another, don’t qualify as harassment, even if an organization’s culture considers it harassment.  

This limitation also affects an LLM’s ability to perform document review. If an LLM’s general training doesn’t factor in an organization’s specific policies, procedures or standards, it won’t be able to recognize relevant criteria against which potential misconduct is assessed. Fine-tuning (i.e., incrementally training) the LLM on organization-specific materials can help, but it can involve significant cost and complexity. An alternative is to append relevant policies to each prompt, but this too consumes context window space and could end up being a waste of resources.

Cost: Continuously using LLMs is expensive. While costs vary depending on the LLM model, the total expense of an LLM-based review — including platform licensing, data preparation and human validation — may still be higher than traditional methods, particularly for smaller document sets.

Using LLMs for electronic document review

Designing LLM-based EDR

With their limitations, LLMs can’t be used in isolation. When conducting EDR, fraud examiners must integrate LLMs into a broader review platform. In LLM-powered EDR (LDR), each document is treated separately and tracked so the analysis and its provenance can be recorded, reviewed and, if necessary, presented as evidence during litigation. The LLM acts as a reasoning engine within the platform, while the surrounding applications address the LLM’s limitations and provide the audit trails, access controls, data management and user interfaces required for a well-founded review process. 

LDR operates on a three-step cycle consistent with the EDRM framework:

  1. Input preparation
  2. LLM processing
  3. Output validation by human reviewers 

Structured prompts that incorporate the case instructions and the documents are fed into the LLM, which the LLM uses to examine and categorize documents. The output is then integrated into the LDR workflow for validation and review.

A poorly designed LDR workflow may result in inaccurate outputs. An LLM’s processing can’t be paused or segmented, so a fraud examiner must correctly understand the core components of LDR to design the most effective workflow. While you could build an LDR workflow using open-source tools and applications, there are products available, such as Relativity, that provide these solutions.

The following framework, including how to successfully navigate prompt engineering, will help you create an efficient LDR workflow. 

Selecting an LLM appropriate for the examination

Not all LLMs are alike. They may share the same foundational framework, but they differ significantly in their training data, design choices and in how their outputs are calibrated for caution, precision or creativity. One model may be more conservative and heavily qualified in its outputs; another may be more decisive. One might excel at nuanced reasoning across ambiguous facts, and another at structured extraction from dense documents. Before you select an LLM for document review, you’ll need to evaluate its characteristics and base your selection on the demands of your examination. 

At their core, LLMs are textual prediction engines, and while they appear to perform mathematical calculations, execute code, search the web or query databases, these tasks are in fact carried out by external applications that the LLM calls upon and then incorporates into its response. Think of an LLM as a senior partner who directs various experts and synthesizes their work into a coherent output, without personally performing the underlying computation. An LLM’s versatility depends on the tools it’s connected to; evaluate its outputs accordingly. [See “Agentic artificial intelligence and LDR” at the end of this article.]

Refined prompt engineering

Prompts are the primary means of guiding an LLM’s analysis, and providing quality instructions to an LLM is paramount. However, creating effective prompts is a balancing act, like finding the right blend of ingredients for a recipe through trial and error. Just as different palates require subtle adjustments to the same recipe, each fraud examination needs customized prompts.

Using LLMs for electronic document reviewGoing back to the example of harassment, a prompt instructing the LLM to identify “unusual or noncompliant behavior” might return documents that include unrelated behaviors, and a prompt asking an LLM to “demonstrate harassment” could leave out suggestive evidence. Using “suggests harassment” as a prompt may be a better option.

Another difficulty in ensuring reliable and consistent outcomes is that small changes in prompts may produce different results within the same LLM or through different LLMs. LLMs also struggle with overly long or complex instructions, particularly those that cover multiple issues at once. 

You can manage these deficiencies by repeatedly adjusting prompts based on an LLM’s outputs until they consistently produce accurate results. This iterative refinement can be tested on a representative set of responsive and unresponsive documents to measure metrics like precision (the proportion of retrieved documents that are actually relevant), recall (the proportion of relevant documents that are actually retrieved) and F1 score (a standard metric that balances precision and recall into a single measure of accuracy), to assess a prompt’s effectiveness.

Optimizing the LDR workflow

Processing an entire document set through an LLM is rarely feasible. The first step is therefore to reduce the number of documents using broad filters such as date ranges, custodians or broad search terms, and techniques such as email deduplication, filtering system files and logs, or scraping irrelevant or repetitive text, such as boilerplates typically appended to emails. In the procurement fraud case my team examined, we applied broad terms such as “purchase” and “vendor” and captured all communications involving external parties, rather than aiming for precision. Aim for the woods and not the trees. 

Documents must also be converted into an LLM-readable format before processing. This requires integrating corresponding applications into the LDR workflow. Depending on the nature of the dataset, this may involve extracting nested files and email attachments, applying optical character recognition (OCR) to scanned documents, transcribing audio and video content, decrypting password-protected files, and parsing data from nonstandard sources, such as chat applications or proprietary systems, into consistent text formats.

Retrieval-augmented generation 

A technique called retrieval-augmented generation (RAG) offers a practical solution to LLMs’ context window limitation and the challenge of basing an LLM’s analysis on organization-specific policies and procedures without the cost and complexity of fine-tuning. However, rather than fine-tuning the LLM or appending entire policy manuals to each prompt, RAG selectively retrieves content relevant to each batch of documents and appends the content to them, optimizing the context window. 

RAG-based functionality operates in three stages:

  1. The organization’s policies, procedures and relevant material are divided into chunks (i.e., segments of text) and converted into vector embeddings that capture each chunk’s semantic meaning. These vectors are stored in a database. A chunk drawn from a procurement policy governing vendor selection, for instance, would generate a vector, linking it to concepts such as approvals, due diligence, competitive bidding and conflict-of-interest declarations.
  2. As each document batch is prepared for processing, the RAG system converts the user’s prompt into a vector and searches the vector database for semantically relevant matches. Rather than appending an entire procurement manual, it only retrieves the provisions germane to the issues or relevant to the prompt.
  3. The retrieved policy chunks are appended to the prompt alongside the document batch before submission to the LLM. The LLM receives not just the documents but the specific standards against which potential deviations should be assessed without prior fine-tuning and without exhausting the context window on provisions irrelevant to a particular batch.

Building the investigative narrative

Because documents are processed in batches, the LLM’s analysis must be aggregated into a cohesive investigative narrative. For each batch, the LLM generates outputs that can include relevancy, a summary, key dates, parties involved and observations for each document or the batch, depending on how the documents are fed into the LLM. The precision of the prompt governs the precision of the summary. An instruction focused solely on employee conduct may cause the LLM to omit references to involved third parties. The LLM’s output, including summaries, is stored in a database and can be vectorized, enabling semantic searches across the entire summarized dataset. As explained above, outputs should be validated through human review by cross-checking a representative sample.  

These outputs can then be further synthesized and analyzed. You may query by date, custodian or participant, instruct the LLM to analyze clusters of related documents and begin assembling a cohesive investigative narrative from fragmented and dispersed records. Patterns invisible to keyword searches and to reviewers working in isolation begin to surface. In our procurement fraud case, it was precisely this step that revealed the pattern of selected vendors being the last to submit bids at the lowest price, each time preceded by a call from the procurement manager. No single document or batch suggested that pattern, and it only became visible when the LLM’s summaries were analyzed collectively.

Using LLMs for electronic document review

Legal considerations

Unlike TAR, which has received considerable scrutiny and acceptance by courts in multiple jurisdictions, LDR remains largely untested in adversarial proceedings. Legal systems require transparent, repeatable and defensible processes, a standard which conflicts with an LLM’s explainability limitations. An LLM may correctly flag a document as relevant, but for reasons different from those a fraud examiner or lawyer would consider relevant. 

Courts evaluating LLM-assisted review are likely to review whether the process was sufficiently documented, whether outputs were validated by human reviewers and whether the methodology can be reproduced. A poorly documented LDR workflow that fails to demonstrate instructions it was given, the documents it processed and how outputs were validated may not meet court standards. Comprehensive audit trails aren’t just good practice; they’re necessary.

The application of attorney-client privilege and work-product protection to LLM-assisted reviews are crucial considerations. Work-product protection typically shields materials prepared by or under the direct supervision of a lawyer in anticipation of litigation. When an LLM drives analytical conclusions, identifying patterns, assessing relevance and drawing inferences, the resulting work product may not be attributed to the supervising lawyer, and a court may decline to extend protection to the analysis.

While an LLM’s statelessness may provide a measure of data protection because it doesn’t retain data, there are risks. If the application layer surrounding the LLM isn’t properly secured, data leakage could happen. Moreover, some LLM providers use submitted prompts as training data, and any confidential information embedded in a prompt could become a permanent feature of the model’s weights and could show up in future queries by other users. 

Transforming document review

As document volumes grow and investigations become more complex, the question for fraud examiners is not whether to adopt LLMs but how to deploy them responsibly. That requires understanding their mechanics, designing workflows that address their limitations, and maintaining the human oversight that both the technology and the law demand. When used in a transparent and responsible framework, LLMs do more than just increase efficiency; they elevate the breadth and quality of evidence fraud examiners can find. 

Chetan Lunkar, CFE, is a senior director at SAMCO. Contact him at clunkar@gmail.com.


Agentic artificial intelligence and LDR

A significant development in LLM capability is the emergence of agentic artificial intelligence (AI). These are systems in which an LLM doesn’t merely respond to a single prompt but autonomously plans and executes a sequence of tasks to achieve a defined objective. Rather than waiting for instructions at each step, an agentic system breaks down a complex goal into subtasks, determines the order in which to execute them, calls on available tools and data sources, evaluates its interim outputs, and adjusts its approach accordingly with minimal human intervention.

In the context of EDR, agentic AI can substantially expand what an LDR workflow can accomplish. An agentic system could, for instance, receive a high-level instruction such as “identify evidence of procurement fraud across this document set” and then autonomously retrieve relevant documents, summarize findings, flag anomalies, cross-reference patterns across batches and produce a structured report, escalating to human review only where uncertainty warrants it. The fraud examiner shifts from directing each step of the process to defining the objective and validating the output.

Unlike an LDR workflow, which operates within explicitly defined parameters and requires a discrete instruction at each step, agentic AI can interact with external tools, databases and file systems without step-by-step human authorization. The consequences of an error or a misunderstood instruction can cascade before they’re detected. Two risks warrant particular attention in an investigative context. First, data confidentiality: An agentic AI may inadvertently expose information to unauthorized systems or third-party platforms when executing its tasks. Second, data integrity: An agent with write or delete permissions could modify or permanently destroy documents that form part of the evidentiary record. These risks make human oversight checkpoints, tightly scoped permissions and comprehensive activity logging essential features of any agentic AI deployment.

Begin Your Free 30-Day Trial

Unlock full access to Fraud Magazine and explore in-depth articles on the latest trends in fraud prevention and detection.