http://www.cs.uvm.edu/~xwu/pictures/Funding/NSFlogo.gif

 


Project Title: CAREER: A Unified Architecture for Data Mining Large Biomedical Literature
Databases

Sponsor: National Science Foundation (NSF), Award No. IIS 0448023

PI: Xiaohua Hu

Amount: $415,000

Duration: March 15, 2005 – Feb 28, 2010

http://www.cs.uvm.edu/~xwu/pictures/bar11.gif

Project Description

The large number of documents in biomedical literature databases and the lack of formal structure in the natural-language narrative in those documents make the search and processing very difficult to many scientists involved in bioinformatics research. This CAREER project is investigating the efficiency and effectiveness of various data mining techniques and method and developing a unified framework for mining large biomedical literature databases. Currently we are focusing on graph-based text mining techniques and methods, its application in biomedical literature. This project is testing its application in real-world bioinformatics domains such as chromatin interaction networks and microarray data analysis. The software package called Dragon Toolkit developed from this effort is free available for academia research use, related publications from this project are listed below..

The broad impact on society made by this project is the generation of a novel unified architecture for biomedical literature data mining. This integrated and complementary approach in a unified architecture has the potential to create a very powerful novel tool for bioinformatics and for most text processing tasks. This project has the potential to attract diverse collaborators who have an interest in accessing complex biomedical or general scientific data and information. Students are involved in this research through hands-on projects, a Co-Op program and courses at both the graduate and undergraduate level.

 

Researchers involved in the project

Xiaohua Hu (Faculty, PI)

Illhoi Yoo (former Ph.D. student, graduated in June 2006, tenure-track faculty in Univ. of Missouri-Columbia)

Xiaohua Zhou (graduated in 2008,  Data Analyzing Director at LYZ Capital Advisors LLC)

Xiaodan Zhang (graduate in 2009, Research Scientist at Vertex Pharmaceuticals)

Caimei Lu (Ph.D. student)

Xin Chen (Ph.D. student)

 

The Dragon ToolKit

The Dragon Tooolkit is a cute Java-based development package for academic research use in language modeling (LM) and information retrieval (IR). Language modeling has recently emerged as an attractive new framework for text information retrieval and text mining (TM). However, most Java-based free search engines such as Lucene does not support LM very well. The Lemur toolkit is designed for LM and IR, but written in C and C++, which may be a hindrance to people who prefer Java programming. Basically, the dragon toolkit is tailored for researchers who work on large-scale LM and IR and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon tookit seamlessly intergrates and implements a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. However, to minimize the learning time, we intentionally keep the package small and simple. The toolkit does not have some features including distributed IR and cross-language IR which are part of Lemur toolkit.

 

How to Cite Dragon Toolkit

If you are using the Dragon Toolkit for research work, please cite it in your published papers:

Zhou, X., Zhang, X., and Hu, X., Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining,  in the Proceeding of the 2007 IEEE International Conference on Tools on AI, 197-201 http://www.dragontoolkit.org

Download Dragon Toolkit

Get the Dragon Toolkit source code and binary libraries (including external libraries) and necessary supporting data. Click here to download.

 

 

Papers published related to this project

  1. Hu X., E.K. Park, Zhang X., Micraarray Gene Cluster identification and Annotation through Cluster Ensemble and EM based Informative Textual Summarization, in IEEE Transactions on Information Technology in Biomedicine, Sept., 2009, Vol. 13, No. 5, pp832-840
  2. Hu X., Ng M., Wu F., Sokhansanj B, Mining, Modeling and Evaluation of Sub-Network from Large Biomolecular Networks and its Comparison Study,  IEEE Transactions on Information Technology in Biomedicine, March 2009, Vol. 13, No. 2, pp 184-194
  3. Hu X., Zhang X., Lu C., Park. E.K.,  Exploring Wikipedia as External Knowledge for Document Clustering, in the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pp 389-396

4.      Wu D., Hu X. He T., Exploratory Analysis of Protein Translation Regulatory Networks Using Hierarchical Random Graphs, in the 2009 IEEE International Conference on Bioinformatics and Biomedicine,

  1. Chen X., Hu X., Shen X., Spatial Weighting for Bag-of-Visual-Words Representation and Its Application in Content-Based Image Retrieval, in the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009), pp 867-874

6.      Zhang X., Hu X., Zhou X., A Comparative Evaluation of Different Link Types on Enhancing Document Clustering,  in the 31th  Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR 2008), pp555-562, 

  1. Achananuparp P., Hu X., Shen X., The Evaluation of Sentence Similarity Measures, in the 2008 International Conference on Data Warehousing and Knowledge Discovery  (DaWaK 2008), pp305-316
  2. Hu X., Sokhansanj B, Wu D., Tang Y., A Novel Approach for Mining and Dynamic Fuzzy Simulation of Biomolecular Network,  IEEE Transactions on Fuzzy Systems, Vol 15, No. 6, Dec 2007, pp1219-1229
  3. Zhang X., Jing L., Hu X., Ng M., Xia J., Zhou X., Medical Document Clustering Using Ontology Based Term Similarity Measures, in the International Journal of Data Warehousing and Mining, 2008
  4. Zhou X., Zhang X., Hu X., Semantic Smoothing for Bayesian Text Classification with Small Training Data, SIAM SDM 08
  5. Zhou X., Hu X., Zhang X., A Segment-based Hidden Markov Model for Real-Setting Pinyin-to-Chinese Conversion,  in the ACM CIKM 2007, 1027-1030, (acceptance rate: 26%, 512 submission)
  6. Hu X., Wu D., Data Mining and Predictive Modeling of Biomolecular Network from Biomedical Literature Databases, IEEE/ACM Transactions on Computational Biology and Bioinformatics, (March-April  2007)
  7. Zhou X., Hu X., Zhang X., Topic Signature Language Models for Ad-hoc Retrieval, in the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)
  8. Hu X., Wu F.X. Ng M., Sokhansanj B., Mining and Dynamic Simulation of Sub-Networks from Large Biomolecular Networks, in 2007 International Conference on Artificial Intelligence, June 25-28, Las Vegas, USA (Best Paper Award, out of 500 submissions)
  9. Yoo I., Hu X., Song I-Y, A Coherent Document Clustering and Text Summarization Approach through a Scale-free Ontology-enriched Graphical Representation, BMC Bioinformatics
  10. Yoo I., Hu X., Song I-Y, Biomedical Ontology Improves Biomedical Literature Clustering Performance: A Comparison Study  in the International Journal of Bioinformatics Research and Application
  11. Song M, Song I-Y, Hu X., Allen B., Integration of Association Rules and Ontology for Semantic Query Expansion, in the Journal of  Data and Knowledge Engineering (DKE)
  12. Zhou X., Zhang X., Hu X., Semantic Smoothing of Document Models for Agglomerative Clustering, accepted in the Twentieth International Joint Conference on Artificial Intelligence(IJCAI 07), Hyderabad, India, Jan 6-12, 2007 (
  13. Zhang X.,  Jing L., Hu X., Ng M., Zhou X., A Comprehensive Study of Ontology Based Term Similarity Measures on Document Clustering, accepted in the 12th International conference on Database Systems for Advanced Applications (DASFFA2007)
  14. Zhou X., Hu X., Lin X., Zhang X., Relation-based Document Retrieval for Biomedical IR,   in LNCS Computational Systems Biology. Vol. 4. pp. 112 – 128
  15. Huang Z., Li Y., Hu X., Anti-parallel Coiled Coils Structure Prediction by Support Vector Machine Classification, in LNCS Transactions on Systems Biology, Vol. 4. pp. 1 - 8
  16. Hu X., Lin T.Y., Song I-Y., Lin X., Yoo I., Song M., A Semi-supervised Efficient Learning Approach to Extract Biological Relationships from Web-based Biomedical Digital Library, in International Journal of Web Intelligence and Agent System, Vol .4, No. 3, 2006
  17. Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR, in the Proceedings of the 29th  Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR 2006), pp 170-177 , (acceptance rate: 18.5%, 74/399)
  18. Hu X., Zhang X., Yoo I., Zhang Y-Q. A Semantic Approach for Mining Hidden Links from Complementary and Non-Interactive Biomedical Literature, Proceedings of the 6th SIAM International Conference on Data Mining (SIAM SDM 06), April 20-22, 2006, Bethesda, MD, USA, pp 200-209, (acceptance rate: 16%, 40/244)
  19. Hu X., Zhang X., Zhou X., Integration of Cluster Ensemble and EM based Text Mining for Microarray Gene Cluster Identification and Annotation, in the Proceedings of ACM 15th Conference on Information and Knowledge Management (ACM CIKM 2006), post paper, (537 submissions, 15% acceptance rate for full papers, 10% acceptance rate for post papers)
  20. Zhang X., Zhou X., Hu X., Semantic Smoothing for Model-based Document Clustering, accepted in the 2006 IEEE International Conference on Data Mining (IEEE ICDM06), Dec. 18-22, 2006, HongKong (800 submissions, acceptance rate : 20%)
  21. Hu X., Constructing Ensembles of Classifiers for Data Mining Applications based on Rough Set Theory and Set-Oriented Database Operations, in the Proceedings of the 2006 IEEE International Conference on Granular Computing (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006, pp 67-73 (acceptance rate: 15%, 49/321)
  22. Yoo I., Hu X., Song I-Y., Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering, in the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2006), short paper , pp 791-796, (acceptance rate for full paper: 11%, acceptance rate for short paper: 12%, 50 full papers, 55 short papers out of 457 submission)
  23. Yoo I., Hu X., A Comprehensive Comparison Study of Document Clustering for A Biomedical Digital Library MEDLINE, in the Proceedings of the 2006 ACM/IEEE Joint Conference on Digital Library (JCDL 2006), June 11-15, 2006, Chapel Hill, NC, USA, pp 220-229,  (acceptance rate: 15%, 28/188)
  24. Yoo I., Hu X., Clustering Ontology-enriched Graph Representation for Biomedical Documents based on Scale-Free Network Theory, accepted in the IEEE Conference on Intelligent Systems (IEEE IS’06), Sept 4-6, 2006 (acceptance rate: 16.7%, 100/600)
  25. Zhou X., Zhang X., Hu X., Using Concept-based Indexing to Improve Language Modeling Approach to Genomic, in the proceedings of the 28th  European Conference on Information Retrieval  (ECIR 2006) , pp 444-455 , (acceptance rate: 20%, 37/178)
  26. Zhu W., Xu X., Hu X., Song I-Y., Allen B., Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy, in the Proceedings of the 2006 IEEE International Conference on Granular Computing (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006, pp 217-222 (acceptance rate: 15%, 49/321)
  27. Zhou X., Hu X., Lin X., Han H., Zhang X., Relational-based Document Retrieval for Biomedical Literature Databases, in the Proceedings of the 11th  International Conference on Database Systems for Advanced Applications (DASFAA 2006), pp 689-701 , (acceptance rate: 25%, 47/188 )
  28. Yoo I., Hu X., Clustering Large Collection of Biomedical Literature based on Ontology-enriched Bipartite Graph Representation and Mutual Refinement Strategy, in the Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006, pp 303-312, (acceptance rate: 20%, 100/500)
  29. Wu C., Hu X., Yang X., Yang J., Expanding Tolerance RST Models based on Cores of Maximal Compatible Blocks, accepted as full paper in  the 5th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2006), (acceptance rate: 27.4%, 91/332)
  30. Zhang X., Wu D., Zhou X., Hu X., A Language Modeling Text Mining Approach to the Annotation of Protein Community, accepted  in the Proceedings of the 6th IEEE Symposium on Bioinformatics and Bioengineering (BIBE 06) (acceptance rate: 34%, 33/81)
  31. Yoo I. Hu X., Biomedical Ontology MeSH Improves Document Clustering Quality on MEDLINE Articles: A Comparison Study, accepted in the 19th IEEE International Symposium on Computer-Based Medical Systems, Salt Lake City, Utah, June 22-23, 2006
  32. Xu X., Zhu W., Hu X., Song I-Y., A Comparison of Local Analysis, Global Analysis and Ontology-based Query Expansion Strategies for Bio-medical Literature Search, accepted in 2006 IEEE International Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan, ROC, Oct 18-21, 2006
  33. Wu C., Hu X., Wang X., Yang X., Pan Y. Knowledge Dependency Relationships in Incomplete Information System Based on Tolerance Relations, accepted in 2006 IEEE International Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan, ROC, Oct 18-21, 2006
  34. Zhou X., Zhang X., Hu X., MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup,  in the Proceedings of the 9th  Biennial Pacific Rim International Conference on Artificial Intelligence (PRICAI 2006), short paper, pp 1145-1149 , (acceptance rate: 16.8%, 100/596),
  35. Yoo I., Hu X., Song I-Y., A Coherent Document Clustering and Text Summarization Approach through a Scale-free Ontology-enriched Graphical Representation, accepted in 8th International Conference on Data Warehouse and Knowledge Discovery (DaWak 2006),   Krakow, Poland, Sept. 4-8, 2006,  (acceptance rate 35%, 52/145)
  36. Zhong H., Hu X., Object Oriented Modeling of Protein Translation Systems, in the Proceedings of the 2006 IEEE International Confernece on Granular Computing, (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006 (short paper), pp 353-356, (acceptance rate: 31%, 101/321)
  37. Hu X., Zhang X., Wu D.,, Zhou X., Rumm P., Integration of Instance-based learning & Text Mining for Identification of Potential Virus / Bacterium as Bio-terrorism Weapons, in the Proceedings of the 2006 IEEE Intelligence and Security Informatics Conference (short paper), pp 548-553
  38. Wu D., Hu X., Mining and Analyzing the Topological Structure pf Protein-Protein Interaction Networks, in the Proceedings of the 2006 ACM Symposium on Applied Computing (Bioinformatics Track), April 23-27, Dijon, Bourgogne, France, pp185-189 , (acceptance rate: 32.4%, 300/927)