Projects
projects  

Piranha - Big Data Analytics

The Oak Ridge National Laboratory's Computational Data Analytics Group's has worked over 12 years in creating text analytics systems to quickly discover meaningful information from raw data. These capabilities focus on six key areas, emphasizing high performance over very large sets of raw documents.

 

Collecting and Extracting: Collecting millions of documents from databases, Internet, Social Media, and hard drives; extracting text from hundreds of file formats; and translating this information into multiple languages.

Storing and indexing: Storing and indexing millions of documents in search servers, distributed file systems (MapReduce), relational databases, and file systems.

Recommending: Filtering the full content of millions of documents to recommend the most valuable and relevant information based on a user’s own information, or user selections, or a user’s interactions with information.

Categorize: Grouping items based on the full content of documents using supervised and semi-supervised machine learning methods and targeted search lists.

Clustering: Creating a hierarchical group of documents based on similarity using unsupervised learning methods on the full content of each document.

Visualizing: Showing hierarchies, groups, and relationships among documents that helps the user quickly understand their value, and to see new connections.

This work has resulted in four issued ( 7,072,883 7,315,858 7,693,9037,805,446) and four pending patents , several commercial licenses (including Pro2Serve and TextOre), a spin off company (Global Security Information Analysts LLC (GSIA)), an R&D 100 Awards, and scores of peer reviewed research publications.

Case study of Piranha's Text Mining Capabilities

In large cases millions of files must be manually processed to discover potential crimes and threats. To solve this problem, a typical customer reviews several options:

Option 1: Use a search engine or document management technology to build a case. Drawback: key words of interest returned thousands of hit for each keyword that must be manually processed.

Option 2: Use visual analysis tools such as Palantir or Analyst Notebook. Drawback: The documents must be manually processed/tagged before the tool can be used which significantly limits the number of documents that can be processed.

Option 3: Use Piranha to sift through and analyze the documents. Piranha works on hundreds of raw data formats, and can process data extremely fast, on typical computers.

For a recent customer, millions of files were loaded overnight into a desktop version of Piranha. The next day, using the the customer's 1200 keyword list, Piranha’s initial filter recommended one thousand documents. Piranha returned documents that contain sets of infrequently occurring keywords, which often are valuable to the customer.

Next, the 1200 keywords were grouped in to 86 topics, for example, the keywords:

John Doe, President of Doe and Sons Manufacturing of Springfield, Iowa, Jane Doe Vice President of Doe and Sons Manufacturing. John Doe, Jr., Chief Technology officer Doe and Sons Manufacturing.

Would be contained in the topic John Doe. Piranha’s second filter used these topics to find the closest matches to individual topics, further reduce the number of document down to 50. These two filtering steps took about 4 hours.

Piranha was then used to cluster these 50 documents by converting the documents into vectors and comparing the vectors to produce a hierarchy of similar documents. This hierarchy and document set was presented to the customer the following day.

Piranha finds Actionable Intelligence>

The case agent was amazed by the results. In a days time Piranha was able to discover the main points of the case, and then Piranha was used by the agents over the next three days to discover several previously unknown actionable intelligence, including:

New suspects
An active shell company
The target’s organizational details

Piranha was able to quickly and effectively find a valuable set of documents that provided a rich set of productive leads for further investigation. Piranha is being used on additional cases for other agencies.

References

Issued Patents

System for gathering and summarizing internet information

Method for gathering and summarizing internet information (2008)

Method for gathering and summarizing internet information (2010)

Agent-based method for distributed clustering of textual information

Dynamic reduction of dimensions of a document vector in a document search and retrieval system

Method and system for determining precursors of health abnormalities from processing medical records

Patents Pending

Method And System To Discover And Recommend Interesting Documents

Method And System Of Filtering And Recommending Documents

Cloud Computing Method For Dynamically Scaling A Process Across Physical Machine Boundaries

Key Papers

R. M. Patton, B. G. Beckerman, T. E. Potok, G. Tourassi, "A Recommender System for Web-Based Discovery and Refinement of Information Radiologists Seek", Radiological Society of North Amercia (RSNA), 2012 Annual Meeting, Nov. 2012, Chicago, IL, USA.

R. M. Patton, T. E. Potok, B. A. Worley, "Discovery & Refinement of Scientific Information via a Recommender System", The Second International Conference on Advanced Communications and Computation, Oct. 2012, Venice, Italy.

Steed, Chad A. (ORNL), Symons, Christopher T. (ORNL), DeNap, Frank (ORNL), Potok, Thomas E. (ORNL), “Guided Text Analysis Using Adaptive Visual Analytics,” Paper in Conf. Proceedings (book, CD), Visualization and Data Analysis 2012, Burlingame, California, January 23-25, 2012.

Patton, Robert M. (ORNL), McNair, Wade (ORNL), Symons, Christopher T. (ORNL), Treadwell, Jim N. (ORNL), Potok, Thomas E. (ORNL), “A Text Analysis Approach to Motivate Knowledge Sharing via Microsoft SharePoint,” Paper in Conf. Proceedings (book, CD), 45th Hawaii International Conference on System Sciences, Wailea, Hawaii, January 4, 2012.

Patton, Robert M. (ORNL), Rojas, Carlos C. (ORNL), Beckerman, Barbara G. (ORNL), Potok, Thomas E. (ORNL), “A Computational Framework for Search, Discovery, and Trending of Patient Health in Radiology Reports,” Paper in Conf. Proceedings (book, CD), 1st IEEE Conference on Healthcare Informatics, Imaging, and Systems Biology, San Jose, California, July 2011.

Patton, Robert M. (ORNL), Beckerman, Barbara G. (ORNL), Potok, Thomas E. (ORNL), Analysis and Classification of Mammography Reports Using Maximum Variation Sampling, Stephen L. Smith and Stefano Cagnoni (Eds.), Genetic and Evolutionary Computation: Medical Applications, pp. 113-131, Wiley Publishing, West Sussex, United Kingdom, January 2011.

Cui, Xiaohui (ORNL), Mueller, Frank (North Carolina State University), Zhang, Yongpeng (ORNL), Potok, Thomas E. (ORNL), “Data-Intensive Document Clustering on GPU Clusters,” Journal of Parallel and Distributed Computing, December 2010.

Patton, Robert M. (ORNL), Beckerman, Barbara G. (ORNL), Potok, Thomas E. (ORNL), Treadwell, Jim N. (ORNL), Genetic Algorithm for Analysis of Abdominal Aortic Aneurysms in Radiology Reports, Paper in Conf. Proceedings (book, CD), 2010 Genetic and Evolutionary Computation Conference, Portland, Oregon, July 2010. Genetic Algorithm for Analysis of Abdominal Aortic Aneurysms in Radiology Reports.

Cui, Xiaohui (ORNL), Potok, Thomas E(ORNL), Cavanagh, Joseph M(ORNL), Parallel Latent Semantic Analysis using a Graphics Processing Unit, Paper in conf proceedings (book, CD), 2009 Genetic and Evolutionary Computation Conference, July 2009.Parallel Latent Semantic Analysis using a Graphics Processing Unit.

Patton, Robert M (ORNL), Potok, Thomas E(ORNL), Beckerman, Barbara G(ORNL), Treadwell, Jim N(ORNL), A Genetic Algorithm for Learning Significant Phrase Patterns in Radiology Reports, Paper in conf proceedings (book, CD), Genetic and Evolutionary Computation Conference 2009, Montreal, CAN, July 2009.A Genetic Algorithm for Learning Significant Phrase Patterns in Radiology Reports.

X. Cui, J. M. Beaver, J. St. Charles, T. E. Potok, Dimensionality Reduction for High Dimensional Particle Swarm Clustering, Proceedings of the IEEE Swarm Intelligence Symposium, September, 2008, St. Louis, USA

Patton, Robert M (ORNL), Potok, Thomas E(ORNL), Identifying Event Impacts by Monitoring the News Media, Paper in conf proceedings (book, CD), 12th International Conference on Information Visualization, London, UK, July 2008. Identifying Event Impacts by Monitoring the News Media.

Patton, R.M., Cui, X., Jiao, Y., and Potok, T.E. (2008). Evolutionary computing. Intelligent Data Analysis: Developing New Methodologies through Patton Discovery and Recovery, Idea Group Inc., Hershey, P.A.

X. Cui and T. E. Potok, A Particle Swarm Social Model for Multi-Agent Based Insurgency Warfare Simulation, Proceedings of the IEEE Eighth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, August, 2007, Busan, Korea

J. W. Reed, T. E. Potok, and R. M. Patton, "A multi-agent system for distributed cluster analysis," in Proceedings of Third International Workshop on Software Engineering for Large-Scale Multi- Agent Systems (SELMAS'04)" W16L Workshop - 26th International Conference on Software Engineering Edinburgh, Scotland, UK: IEE, 2004, pp. 152-5.

J. Reed, Y. Jiao, T. E. Potok, B. Klump, M. Elmore, and A. R. Hurson, "TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams," in Proceedings of 5th International Conference on Machine Learning and Applications (ICMLA'06). vol. 0 ORLANDO, FL, 2006, pp. 258-263.

P. Yan, Y. Jiao, A. R. Hurson, and T. E. Potok, "Semantic-based information retrieval of biomedical data," in Proceedings of the 2006 ACM symposium on Applied computing Dijon, France: ACM Press, 2006.

T. E. Potok, M. T. Elmore, J. W. Reed, and N. F. Samatova, "An ontology-based HTML to XML conversion using intelligent agents," in Proceedings of the 35th Annual Hawaii International Conference on System Sciences Big Island, HI, USA: IEEE Comput. Soc, 2002, pp. 1220-9.

R. M. Patton and T. E. Potok, "Characterizing large text corpora using a maximum variation sampling genetic algorithm," in Proceedings of the 8th annual conference on Genetic and evolutionary computation Seattle, Washington, USA: ACM Press, 2006.

P. Palathingal, T. E. Potok, and R. M. Patton, "Agent based approach for searching, mining and managing enormous amounts of spatial image data," in Proceedings of the Eighteenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2005 - Recent 4 2007 R&D 100 Award Entry Form Advances in Artifical Intelligence Clearwater Beach, FL, United States: American Association for Artificial Intelligence, Menlo Park, CA 94025-3496, United States, 2005, pp. 351-356.

M. T. Elmore, T. E. Potok, and F. T. Sheldon, "Dynamic data fusion using an ontology-based software agent system," in Proceedings of 7th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2003) vol. Vol.9 Orlando, FL, USA: IIIS, 2003, pp. 5-E html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">