The Oak Ridge National Laboratory's Computational Data Analytics Group's has worked over 12 years in creating text analytics systems to quickly
discover meaningful information from raw data. These capabilities focus on six key areas, emphasizing high performance over very large sets
of raw documents.
Collecting and Extracting: Collecting millions of documents from databases, Internet, Social Media, and hard drives; extracting
text from hundreds of file formats; and translating this information into multiple languages.
Storing and indexing: Storing and indexing millions of documents in search servers, distributed file systems (MapReduce),
relational databases, and file systems.
Recommending: Filtering the full content of millions of documents to recommend the most valuable and relevant information based on a user’s
own information, or user selections, or a user’s interactions with information.
Categorize: Grouping items based on the full content of documents using supervised and semi-supervised machine learning methods and
targeted search lists.
Clustering: Creating a hierarchical group of documents based on similarity using unsupervised learning methods on the full content
of each document.
Visualizing: Showing hierarchies, groups, and relationships among documents that helps the user quickly understand their value, and to
see new connections.
This work has resulted in four issued ( 7,072,883 7,315,858 7,693,9037,805,446) and four pending patents , several commercial licenses
(including Pro2Serve and TextOre), a spin off company (Global Security Information Analysts LLC (GSIA)), an R&D 100 Awards, and scores of
peer reviewed research publications.
Case study of Piranha's Text Mining Capabilities
In large cases millions of files must be manually processed to discover potential crimes and threats. To solve this problem, a typical
customer reviews several options:
Option 1: Use a search engine or document management technology to build a case. Drawback: key words of interest returned thousands of
hit for each keyword that must be manually processed.
Option 2: Use visual analysis tools such as Palantir or Analyst Notebook. Drawback: The documents must be manually processed/tagged
before the tool can be used which significantly limits the number of documents that can be processed.
Option 3: Use Piranha to sift through and analyze the documents. Piranha works on hundreds of raw data formats, and can process data
extremely fast, on typical computers.
For a recent customer, millions of files were loaded overnight into a desktop version of Piranha. The next day, using the the customer's 1200
keyword list, Piranha’s initial filter recommended one thousand documents. Piranha returned documents that contain sets of infrequently
occurring keywords, which often are valuable to the customer.
Next, the 1200 keywords were grouped in to 86 topics, for example, the keywords:
John Doe, President of Doe and Sons Manufacturing of Springfield, Iowa, Jane Doe Vice President of Doe and Sons Manufacturing. John Doe, Jr.,
Chief Technology officer Doe and Sons Manufacturing.
Would be contained in the topic John Doe. Piranha’s second filter used these topics to find the closest matches to individual topics, further
reduce the number of document down to 50. These two filtering steps took about 4 hours.
Piranha was then used to cluster these 50 documents by converting the documents into vectors and comparing the vectors to produce a hierarchy
of similar documents. This hierarchy and document set was presented to the customer the following day.
Piranha finds Actionable Intelligence
The case agent was amazed by the results. In a days time Piranha was able to discover the main points of the case, and then Piranha was used
by the agents over the next three days to discover several previously unknown actionable intelligence, including:
An active shell company
The target’s organizational details
Piranha was able to quickly and effectively find a valuable set of documents that provided a rich set of productive leads for further
investigation. Piranha is being used on additional cases for other agencies.
Try a Demo Version of Piranha
Click here to try Piranha. Note that this version only allows 200 documents. If you have trouble with the demo version, or would like to
obtain a version which allows more documents, please contact us.