Skip to main content

Information Geometry and Document Classification

Dr Guy Lebanon
Department of Statistics and School of Electrical and Computer Engineering,
Purdue University

Date:  Friday, November 18, 2005

Host: R. Jin

Abstract: The task of classifying documents according to topic is traditionally based on extracting features, and treating the features as points in a Euclidean space, equipped with Euclidean geometry. We argue that this may be improved upon by examining a more appropriate geometry for text documents, and adapting classification models to this geometry. By embedding documents in the multinomial simplex, we identify a canonical geometry for them - the Fisher geometry on the multinomial simplex. Adapting popular classification models such as radial basis support vector machines and logistic regression to the Fisher geometry yields impressive results in text classification. The application of information geometry to text classification results in an improvement over the-state-of-the-art in this field.

If time remains, I will discuss an extension of Cencov's theorem for spaces of conditional models and a novel geometric representation for documents that moves beyond the standard bag of words assumption.

Biography: Guy Lebanon is an Assistant Professor of Statistics and ECE at Purdue University. He received Bachelor and Masters degrees from Technion - Israel Institute of Technology, and a PhD from Carnegie Mellon University . The focus of his work at Carnegie Mellon University was the application of Riemannian geometry to machine learning and analyzing ranked data. After earning his Ph.D. degree, Guy completed a brief Post-Doctorate with his PhD advisor, John Lafferty. Dr. Lebanon's main research interest is machine learning as applied to large data sets. His work has so far applied mostly to text documents such as web pages and email documents. Dr. Lebanon is interested in pursuing other applications of his work to such areas as image analysis and bioinformatics.