Spring 2022

Digitizing the Haystack: Using AI-powered Text Analytics

By Shuo Zhao, M.S., and James W. Rice, Ph.D.

Automated text analytics provide a powerful toolkit during potentially responsible party identification procedures.

Identifying potentially responsible parties (PRPs) at contaminated sites is often a challenge, especially when an area experienced a long history of complex industrial development and use. In order to properly identify potential point and nonpoint contaminant sources, environmental professionals typically leverage a wealth of historical information (often collected over many decades from broad geographic areas). However, gathering useful information from the historical record can be complicated when such content is comingled within a large number of documents and amongst other irrelevant information (the metaphorical needle in the haystack). In such cases, manual document review can be cumbersome, time-consuming, and expensive. Hence, automated text analytics have become more and more appealing when examining vast amounts of documents within a limited timeframe. Fortunately, recent technological breakthroughs in both artificial intelligence (AI) and machine learning algorithms have improved natural language processing (NLP) techniques, thereby streamlining document review processes for PRP identification.

Gathering useful information from the historical record can be complicated when such content is comingled within a large number of documents and amongst other irrelevant information.”

Historical site research is a substantial component of PRP identification, especially at locations in which use and ownership have changed over time. In industrialized areas (many of which have or are being redeveloped for residential and mixed use), it is quite common to find that property owners have changed due to multiple mergers and acquisitions and/or that the usage of the same property has changed during the course of that ownership. This makes PRP identification much more challenging. In such cases, the “entity recognition technique,” which locates named entities (products, people, companies, or locations) that exist within voluminous text, can be a particularly powerful tool. Supplemented by automated keyword searching on critical dates, entity recognition allows environmental professionals to rapidly clarify complicated site history, including ownership changes, manufacturing processes, notable historical events, etc.

Natural Language Processing (NLP) to Identify Critical Documents

Figure Shows Text from Documents Translated to Graphics

Click to Enlarge Figure.

In addition to historical documents, documents related to environmental response actions conducted at sites, such as remedial investigation/feasibility study (RI/FS) reports, regulatory decision documents, remedial action plans, and laboratory analytical reports, can be highly valuable in PRP identification. However, such critical documentation is often buried amongst a slew of relatively uninformative documents. This is quite common during litigation-related digital document production, as well as in regulatory file depositories, which can strip informative material such as file names and other metadata from the historical record, making it extremely difficult to locate such documents using traditional keyword-searching methods. “Text classification,” an NLP technique, has been adopted to create a novel approach to tackle this very issue.

Because environmental and related data reports contain unique text features, those features (e.g., keywords, keyword frequency, document structure, word-to-number ratio) can be leveraged to differentiate key document types from a universe of other materials. A text classification algorithm can be developed using such features and applied to a large number of documents all at once. This allows scientists to not only effectively dismiss irrelevant documents, but also to rank documents by importance, so that they can prioritize their review and focus on the documents that are most likely to be critical. As an example, Gradient recently assisted on a project that required the review of nearly 200,000 files (with file size hundreds of gigabytes in total). To handle this amount of information, Gradient was able to build and apply a text classification algorithm to identify the very small percentage of documents that required manual review. This process greatly reduced the total volume of material needed to review the site, improving project efficiency and providing significant value.

It is worth noting that the techniques discussed here are by no means a replacement for manual document review by qualified environmental scientists. They instead amount to a powerful toolkit to support environmental scientists when reviewing vast historical and data records. These techniques enable professionals to focus on information that is most relevant to the task at hand, that is, extracting and interpreting data to pinpoint PRPs.

Contact Info

The authors can be reached at szhao@gradientcorp.com and jrice@gradientcorp.com.