Data Cleaning and Information Quality
This web site contains different kinds of information related to data quality issues
like data cleaning, record matching and data reconciliation. By no means, the list of
items below is a complete enumeration of the work which has been accomplished in this
area, but we are doing our best to enrich this collection. Our main goal, in maintaining
this site, is to report our experiences through our work in various projects closely
related to the management and integrity of the data.
Our research approach in cleaning data is focused on using machine learning and
statistical techniques to automatically build models from training data. The models,
derived in this way, can be applied to cleaning efficiently and effectively enormous
amount of data with very high precision and recall. We are in the process of building a
powerful data cleaning tool that produces different data cleaning models on the fly, and
evaluates these models by using a public domain database generator.
Related Links
- Data and Information Quality
- Data Cleaning and Dirty Data
- Record Linkage
Publications (A collection of papers)
- Ivan P. Fellegi, Alan B. Sunter. A
Theory for Record Linkage, Journal of the American Statistical Association, Vol. 64,
No. 328. (Dec., 1969), pp. 1183-1210.
- William E. Winkler, The State
of Record Linkage and Current Research Problems, U.S. Bureau of the Census, Technical
Report.
- Francesco Caruso, Munir Cochinwala, Uma Ganapathy, Gail Lalk, Paolo Missier: Telcordia's Database
Reconciliation and Data Quality Analysis Tool. VLDB 2000: 615-618
- Verykios, V.S., Elmagarmid, A.K., and Houstis, E.N. Automating
the Approximate Record Matching Process. Journal of Information Sciences, vol. 126,
issue 1-4, pp. 83-98, July 2000.
- Verykios, V.S., Elmagarmid, A.K., Elfeky, M., Cochinwala, M., and Dalal, S., On the
Completeness and Accuracy of the Record Matching Process. In Proceedings of the MIT
Conference on Information Quality, pp. 54-69, October 2000, Boston, MA.
- Missier, P., Lalk, G., Verykios, V.S., Grillo, F., Lorusso, T., and Angeletti, P.
Improving Data Quality in Practice: A Case Study in the Italian Public Administration.
Submitted to Distributed and Parallel Databases Journal.
- Mauricio A. Hernández, Salvatore J. Stolfo. Real-world Data is Dirty: Data Cleansing and
The Merge/Purge Problem, Data Mining and Knowledge Discovery Journal, pp. 9-37, vol 2,
issue 1, January 1998.
- Ralph Kimball, Dealing with Dirty Data,
DBMS Online, September 1996.
People
- Munir Cochinwala, Telcordia Technologies Inc.
- Sid Dalal, Telcordia Technologies Inc.
- Mohamed G. Elfeky,
Computer Sciences Department, Purdue University
- Ahmed K. Elmagarmid, Computer Sciences
Department, Purdue University
- Paolo Missier, Telcordia Technologies Inc.
- Vassilios S. Verykios, College
of Information Science and Technology, Drexel University
Events
Companies