Mining Unstructured Data in Forensic Accounting Investigations
Mining structured data has become an established component of fraud and forensic accounting investigations. As the name implies, structured data is any kind of data with a consistent, reliable structure; this includes sources such as spreadsheets, databases and most data in accounting information systems. Robust tools such as ACL and IDEA (among others) can handle this data—not just on a sample basis but on the entire population of available data.
Contrast this with unstructured data, which includes everything else—texts, email, documents, social media, audio, video and many forms of Web-based content. According to an Ernst & Young study, unstructured data accounts for about 80 percent of all available data. So, by failing to account for unstructured data in an investigation, we only address 20 percent of the total available population of data. Furthermore, that ignored 80 percent is very rich in human-generated, contextual and even emotion-laden data.
Sources of Text Used in Forensic Investigations
The graphic at above identifies some of the more commonly generated unstructured data available in an investigation. Many, many other sources abound, but these are the most common in corporate investigations.
Chief among textual sources in an investigation is email. This source of evidence not only contains word-for-word communications, but also possesses a date/time element, metadata and even emotional tone as expressed through various idioms, phrases and adjectives. Another often overlooked source of rich unstructured data is the computer hard drive, which not only contains email, documents, audio and video, but also caches of Internet activity, discarded IM and chat sessions, deleted content and often overlooked backup and temporary copies of items. Computer forensics technologies can preserve, identify and produce these more obscure items.
Text Mining Tools & Processes
Handling the sheer volume and complexity of unstructured data requires special tools and processes. Since the majority of useful, relevant material is human communications, analysis should not be limited to mere keyword searches; it also should include extraction of meaning and topics, emotional tone of the conversation and the creation of relationship networks to visualize how key players and topics interact, influence and evolve over time.
The visual I like to use accurately portrays the analysis of text as a central concept, tactically accomplished via a family of tools and processes that work together to tell a story and supplement the more traditional structured data component of an investigation.
The broad categories encompass the science of “natural language processing” and related concepts of latent semantic analysis and concept searching, among others. Experience has shown these components—working together—to be an effective toolset in the identification of relevant evidence in a forensic investigation.
I can’t emphasize enough the importance of considering unstructured data in investigations. The oft-quoted stat that 80 percent of an organization’s data is unstructured isn’t marketing hype; it’s a reality that means if the data isn’t being considered, an investigation plan is only 20 percent effective. The power of being able to capture not just the topics and content of communications, but also the emotional state of the participants, is staggering—even more so when incorporating social network theory and leveraging it with artificial intelligence-assisted tools.
Text Mining Components
This is a conceptual overview of the processes comprising the core of text mining. Together, these components encompass the science of natural language processing as well as the related concepts of latent semantic analysis and concept searching, among others. Experience shows these components, when working together, are an effective tool set for identifying relevant evidence in a forensic investigation.
Text Mining Components
Predictive coding uses artificial intelligence (AI) to help find related and similar documents in a massive collection of text. The AI is capable of determining the underlying concepts in a document or email, so predictive coding can be performed independent of traditional methods that rely on keyword searches. Perhaps more important than the ability to rapidly find highly relevant content is that predictive coding can reduce the volume of material reviewed by the investigator by as much as 95 percent. The AI and human analyst leveraging each other’s strengths to achieve augmented intelligence makes this possible.
Part-of-speech (POS) tagging is the process of a computer program breaking text into grammatical parts.
By leveraging this function to dissect communications into their grammatical subcomponents, two of the more useful and exciting types of analysis—topic maps and word clouds—are possible. The following example illustrates “a tale of two finance departments”; it doesn’t take much imagination to tell which department may have some issues.
Graphics can be drawn from overall concepts as expressed through nouns or adjectives. Higher-quality systems also incorporate colors to distinguish between positive and negative emotions or events, and the date/time email element allows the investigator to explore the evolution of topics and emotions. Additional analytical leverage is gained by pairing noun topics with their descriptive adjectives to assess the emotional context of a topic.
Tone detection uses adjectives, idioms and phrases to assess the emotional tone of the communication. This ability has powerful implications—an investigator can quickly hone in on red flags without having an initial theory or starting point. Common tones that can be analyzed include tense, vague, nervous, low esteem and conspiratorial, among others.
Because text mining tools can identify grammatical components, they are adept at identifying proper names, places and events. This process is called named entity extraction (NEE) and provides a powerful analysis to the investigator. Because names and events can be pulled from email communications, NEE is useful in relationship mapping—graphically representing relationships among the various subjects of an investigation. For example, without NEE, a relationship map may only show relations between the sender and recipient of an email communication. By adding extracted topics, names and places from the message, the relationship map takes on a new dimension. Some maps become extremely complex, as illustrated here.
In Part 3 of this series, we’ll delve deeper into the relationship mapping aspect of text mining. We’ll discuss how it goes beyond simple graphs and incorporates unique mathematical principles to help shed light on relationship networks and define their characteristics. We’ll also examine how this is useful in Foreign Corrupt Practices Act and anti-bribery and corruption investigations.