|

|
|
Project
Title: CAREER: A Unified Architecture
for Data Mining Large Biomedical Literature
Databases
Sponsor: National Science Foundation (NSF), Award
No. IIS 0448023
PI: Xiaohua Hu
Amount: $415,000
Duration: March 15, 2005 Feb 28, 2010
|

Project Description
The large number of documents in biomedical literature databases and the
lack of formal structure in the natural-language narrative in those documents
make the search and processing very difficult to many scientists involved in
bioinformatics research. This CAREER project is investigating the efficiency
and effectiveness of various data mining techniques and method and developing a
unified framework for mining large biomedical literature databases. Currently
we are focusing on graph-based text mining techniques and methods, its
application in biomedical literature. This project is testing its application
in real-world bioinformatics domains such as chromatin interaction networks and
microarray data analysis. The software package called Dragon
Toolkit developed from this effort is free available for academia research
use, related publications from this project are listed below..
The broad
impact on society made by this project is the generation of a novel unified
architecture for biomedical literature data mining. This integrated and
complementary approach in a unified architecture has the potential to create a
very powerful novel tool for bioinformatics and for most text processing tasks.
This project has the potential to attract diverse collaborators who have an
interest in accessing complex biomedical or general scientific data and
information. Students are involved in this research through hands-on projects,
a Co-Op program and courses at both the graduate and undergraduate level.
Researchers involved in
the project
Xiaohua
Hu (Faculty, PI)
Illhoi Yoo (former Ph.D. student, graduated in June 2006,
tenure-track faculty in Univ. of Missouri-Columbia)
Xiaohua Zhou
(Ph.D. student)
Xiaodan
Zhang (Ph.D. student)
Xin Chen (Ph.D.
student)
The Dragon ToolKit
|
The Dragon Tooolkit is a cute
Java-based development package for academic research use in language modeling
(LM) and information retrieval (IR). Language modeling has recently emerged
as an attractive new framework for text information retrieval and text mining
(TM). However, most Java-based free search engines such as Lucene does not support LM very well. The Lemur
toolkit is designed for LM and IR, but written in C and C++, which may be a
hindrance to people who prefer Java programming. Basically, the dragon
toolkit is tailored for researchers who work on large-scale LM and IR and
prefer Java programming. Moreover, different from Lucene
and Lemur, it provides built-in supports for semantic-based IR and TM. The
dragon tookit seamlessly intergrates
and implements a set of NLP tools, which enable the toolkit to index text
collections with various representation schemes including words, phrases,
ontology-based concepts and relationships. However, to minimize the learning
time, we intentionally keep the package small and simple. The toolkit does
not have some features including distributed IR and cross-language IR which
are part of Lemur toolkit.
|
|
How to Cite Dragon Toolkit
|
|
If you are
using the Dragon Toolkit for research work, please cite it in your published
papers:
Zhou, X., Zhang,
X., and Hu, X., Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge
into Large-Scale Text Retrieval and Mining, in the Proceeding of the 2007 IEEE
International Conference on Tools on AI, 197-201 http://www.dragontoolkit.org
|
|
Download Dragon Toolkit
|
|
Get the
Dragon Toolkit source code and binary libraries (including external
libraries) and necessary supporting data. Click here to
download.
|
Papers published related to this project
- Tang Y.C., Zhang
Y-Q, Huang Z., Hu X.,, and Zhao Y. Recursive
Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification accepted to be published in the IEEE Transactions on Information
Technology in Biomedicine
- Wang J.., Hu X., Zhu D., A Comparison and Scenario Analysis of
Leading Data Mining Software, International
Journal of Knowledge Management, 2008, Vol
4, No 2, pp17-34
- Zhang X., Hu X., Xia J., Zhou X., Achananuparp
P., Utilization
of Global Ranking Information in Graph-based Biomedical Literature
Clustering, one of the 5 best
papers in DaWak 07 and extended versions to be
accepted in the International Journal
of Data Warehousing and Mining, 2008
- Zhang X., Jing L., Hu X., Ng M., Xia J., Zhou X., Medical Document Clustering Using Ontology Based Term Similarity
Measures, accepted to be published in the International Journal of
Data Warehousing and Mining, 2008
- Hu X., Zhang X., Yoo I., Zhou X., Mining
Hidden Connections among Biomedical Concepts from Disjoint Biomedical
Literature Sets through Semantic-based Association Rule, expected to
be published in the Journal of
Intelligent System
- An
Y., Hu X., Song IL., Round-Trip Engineering for Maintaining
Conceptual-Relational Mappings, accepted in the 20th
Conference on Advanced Information Systems Engineering, June 16-20, 2008,
Montpellier, France, (acceptance rate: 13%, 271 submissions),
- Zhou
X., Zhang X., Hu X., Semantic Smoothing for Bayesian Text
Classification with Small Training Data, SIAM SDM 08
- Achananuparp P. Zhou X., Hu X.,
Zhang X., Semantic Representation in Text Classification Using Topic Singature Mapping, To appear in Proceedings of 2008 IEEE International Joint Conference on Neural
Networks, June 1-6, Hong Kong
- Wang
J. Hu X., Zhou, Applications of
Data Mining in the Healthcare Industry, Encyclopedia of Healthcare
Information Systems, Idea Group 2008
- Hu X., Wu D., Data
Mining and Predictive Modeling of Biomolecular
Network from Biomedical Literature Databases, IEEE/ACM Transactions on
Computational Biology and Bioinformatics, (March-April 2007)
- Hu X., Sokhansanj B, Wu D., Tang Y., A Novel Approach for Mining and Dynamic
Fuzzy Simulation of Biomolecular Network, in IEEE Transactions on Fuzzy Systems (2007)
- Zhou
X., Hu X., Zhang X., Topic Signature Language Models for
Ad-hoc Retrieval,in
the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)
- Hu X., Wu F.X. Ng M., Sokhansanj
B., Mining and Dynamic Simulation of
Sub-Networks from Large Biomolecular Networks,
in 2007 International Conference on Artificial Intelligence, June 25-28,
Las Vegas, USA (Best Paper Award, out of 500 submissions)
- Yoo I., Hu X.,
Song I-Y, A Coherent Document
Clustering and Text Summarization Approach through a Scale-free Ontology-enriched Graphical
Representation, to be published in the special issues of BMC
Bioinformatics
- Yoo
I., Hu X., Song I-Y, Biomedical Ontology Improves
Biomedical Literature Clustering Performance: A Comparison Study to be published in the special issues of
19th IEEE CBMS 2006 in the International Journal of
Bioinformatics Research and Application
- Song
M, Song I-Y, Hu X., Allen B., Integration of Association Rules and
Ontology for Semantic Query Expansion, to appear in the Journal
of Data and Knowledge Engineering
(DKE)
- Hu X., Zhang X., Yoo I.,
Zhou X., Mining Hidden Connections
among Biomedical Concepts from Disjoint Biomedical Literature Sets through
Semantic-based Association Rule, expected to be published in the Journal of Intelligent System
- Zhou
X., Zhang X., Hu X., Semantic Smoothing
of Document Models for Agglomerative Clustering, accepted in the
Twentieth International Joint Conference on Artificial Intelligence(IJCAI
07), Hyderabad, India, Jan 6-12, 2007 (acceptance rate: 15.7%, 212/1353)
- Zhang X., Jing L., Hu X., Ng M., Zhou X., A Comprehensive Study of
Ontology Based Term Similarity Measures on Document Clustering, accepted in the
12th International conference on Database Systems for Advanced
Applications (DASFFA2007) (acceptance
rate: 18.7%, 70/373)
- Hu X., Gene-Miner:
Integration of Cluster Ensemble and Text Mining for Comprehensive Gene
Expression Analysis, in the International
Journal of Bioinformatics Research and Application, Vol. 2, No. 3, 2006,
pp 325-338
- Zhou
X., Hu X., Lin X., Zhang X., Relation-based Document Retrieval for
Biomedical IR, in LNCS
Computational Systems Biology. Vol. 4. pp. 112 128
- Huang
Z., Li Y., Hu X., Anti-parallel Coiled Coils Structure
Prediction by Support Vector Machine Classification, in LNCS
Transactions on Systems Biology, Vol. 4. pp. 1 - 8
- Hu X., Lin T.Y., Song
I-Y., Lin X., Yoo I., Song M., A
Semi-supervised Efficient Learning Approach to Extract Biological
Relationships from Web-based Biomedical Digital Library, in
International Journal of Web Intelligence and Agent System, Vol .4, No. 3, 2006
- Hu X., Yoo
I., Zhang X., Nanavati P., Das D., Wavelet
Transformation and Cluster Ensemble for Gene Expression Analysis, in
the International Journal of Bioinformatics Research and Application, Vol.
1, No. 4, 2006, pp 447-460
- Wu
D., Hu X., Topological Analysis and Sub-Network Mining of Protein-Protein
Interactions, in Advances in Data Warehousing and Mining, D. Taniar (Ed), Idea Group Publisher, Dec., 2006
- Hu X., Zhang X., Wu D., Zhou X., Rumm
P., Text Mining the Biomedical
Literature for Identification of Potential Virus /
Bacterium as Bioterrorism Weapons, in Terrorism
Informatics: Knowledge Management and Data Mining for Homeland Security, H. Chen, E.
Reid, J. Sinai, A. Silke, B. Ganor (Eds), Springer, 2006
- Hu X., Zhang X., Yoo I.,
Zhou X., Wu D., A Comprehensive
Comparison Study of 7 Methods of Mining Hidden Links from Biomedical
Literatures, accepted to be published in Knowledge Discovery in
Bioinformatics: Techniques, Methods and Applications, X. Hu & Y. Pan (Eds). Wiley & Son, 2006
- Zhou
X., Hu X., Zhang X., Lin X.,
Song I-Y., Context-Sensitive Semantic
Smoothing for the Language Modeling Approach to Genomic IR, in the Proceedings of the 29th Annual International ACM SIGIR
Conference on Research & Development on Information Retrieval (SIGIR
2006), pp 170-177 , (acceptance
rate: 18.5%, 74/399)
- Hu X., Zhang X., Yoo I.,
Zhang Y-Q. A
Semantic Approach for Mining Hidden Links from Complementary and
Non-Interactive Biomedical Literature, Proceedings of the
6th SIAM International Conference on Data Mining (SIAM SDM 06),
April 20-22, 2006, Bethesda, MD, USA, pp 200-209, (acceptance rate: 16%, 40/244)
- Hu X., Zhang X., Zhou X., Integration of Cluster
Ensemble and EM based Text Mining for Microarray Gene Cluster Identification
and Annotation, in the Proceedings of ACM 15th
Conference on Information and Knowledge Management (ACM CIKM 2006), post
paper, (537 submissions, 15%
acceptance rate for full papers, 10% acceptance rate for post papers)
- Zhang
X., Zhou X., Hu X., Semantic Smoothing for
Model-based Document Clustering, accepted in the 2006 IEEE
International Conference on Data Mining (IEEE ICDM06), Dec. 18-22, 2006, HongKong (800
submissions, acceptance rate : 20%)
- Hu X., Constructing Ensembles of
Classifiers for Data Mining Applications based on Rough Set Theory and
Set-Oriented Database Operations, in the Proceedings of the 2006
IEEE International Conference on Granular Computing (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006, pp 67-73 (acceptance rate: 15%, 49/321)
- Yoo
I., Hu X., Song I-Y., Integration
of Semantic-based Bipartite Graph Representation and Mutual Refinement
Strategy for Biomedical Literature Clustering, in the Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (SIGKDD 2006), short paper , pp 791-796,
(acceptance rate for full paper:
11%, acceptance rate for short paper: 12%, 50 full papers, 55 short papers out of 457 submission)
- Yoo
I., Hu X., A Comprehensive
Comparison Study of Document Clustering for A Biomedical Digital Library
MEDLINE, in the Proceedings of the 2006 ACM/IEEE Joint Conference
on Digital Library (JCDL 2006), June 11-15, 2006, Chapel Hill, NC, USA, pp
220-229, (acceptance rate: 15%, 28/188)
- Yoo
I., Hu X., Clustering
Ontology-enriched Graph Representation for Biomedical Documents based on
Scale-Free Network Theory, accepted in the IEEE Conference on
Intelligent Systems (IEEE IS06), Sept 4-6, 2006 (acceptance rate: 16.7%, 100/600)
- Zhou
X., Zhang X., Hu X., Using Concept-based
Indexing to Improve Language Modeling Approach to Genomic, in the
proceedings of the 28th
European Conference on Information Retrieval (ECIR 2006)
, pp 444-455 , (acceptance rate:
20%, 37/178)
- Zhu
W., Xu X., Hu
X., Song I-Y., Allen B., Using UMLS-based Re-Weighting
Terms as a Query Expansion Strategy, in the Proceedings of the
2006 IEEE International Conference on Granular Computing (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006, pp 217-222 (acceptance rate: 15%, 49/321)
- Zhou
X., Hu X., Lin X., Han H.,
Zhang X., Relational-based
Document Retrieval for Biomedical Literature Databases, in the
Proceedings of the 11th International Conference on Database
Systems for Advanced Applications (DASFAA 2006), pp 689-701 , (acceptance rate: 25%, 47/188 )
- Yoo
I., Hu X., Clustering Large
Collection of Biomedical Literature based on Ontology-enriched Bipartite
Graph Representation and Mutual Refinement Strategy, in the
Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and
Data Mining (PAKDD) 2006, pp 303-312, (acceptance rate: 20%, 100/500)
- Wu
C., Hu X., Yang X., Yang J., Expanding Tolerance RST Models
based on Cores of Maximal Compatible Blocks, accepted as full paper
in the 5th International
Conference on Rough Sets and Current Trends in Computing (RSCTC 2006),
(acceptance rate: 27.4%, 91/332)
- Zhang
X., Wu D., Zhou X., Hu X., A Language Modeling Text Mining
Approach to the Annotation of Protein Community, accepted in the
Proceedings of the 6th IEEE Symposium on Bioinformatics and
Bioengineering (BIBE 06) (acceptance
rate: 34%, 33/81)
- Yoo I. Hu X., Biomedical Ontology MeSH Improves Document Clustering Quality on MEDLINE
Articles: A Comparison Study, accepted in the 19th IEEE
International Symposium on Computer-Based Medical Systems, Salt Lake City, Utah,
June 22-23, 2006
- Xu X., Zhu W., Hu X., Song I-Y., A Comparison of Local Analysis,
Global Analysis and Ontology-based Query Expansion Strategies for
Bio-medical Literature Search, accepted in 2006 IEEE International
Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan, ROC,
Oct 18-21, 2006
- Wu C., Hu X., Wang X., Yang X., Pan Y. Knowledge Dependency Relationships
in Incomplete Information System Based on Tolerance Relations, accepted in 2006 IEEE
International Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan,
ROC, Oct 18-21, 2006
- Zhou
X., Zhang X., Hu X., MaxMatcher:
Biological Concept Extraction Using Approximate Dictionary Lookup, in the Proceedings of the 9th Biennial Pacific Rim International
Conference on Artificial Intelligence (PRICAI 2006), short paper, pp 1145-1149
, (acceptance rate: 16.8%, 100/596),
- Yoo
I., Hu X., Song I-Y., A Coherent Document Clustering and
Text Summarization Approach through a Scale-free Ontology-enriched
Graphical Representation, accepted in 8th International
Conference on Data Warehouse and Knowledge Discovery (DaWak
2006), Krakow, Poland, Sept. 4-8,
2006, (acceptance rate 35%, 52/145)
- Zhong
H., Hu X., Object Oriented
Modeling of Protein Translation Systems,
in the Proceedings of the 2006 IEEE International Confernece
on Granular Computing, (IEEE GrC 2006), Atlanta,
GA, May 15-17, 2006 (short paper), pp 353-356, (acceptance rate: 31%, 101/321)
- Hu X., Zhang X., Wu D.,, Zhou X., Rumm
P., Integration of
Instance-based learning & Text Mining for Identification of Potential
Virus / Bacterium as Bio-terrorism Weapons,
in the Proceedings of the 2006 IEEE Intelligence and Security Informatics
Conference (short paper), pp 548-553
- Wu
D., Hu X., Mining and Analyzing
the Topological Structure pf Protein-Protein
Interaction Networks, in the Proceedings of the 2006 ACM Symposium
on Applied Computing (Bioinformatics Track), April 23-27, Dijon,
Bourgogne, France, pp185-189 , (acceptance
rate: 32.4%, 300/927)