Project: Large-scale Kernel-based Clustering  (N00014-11-1-0100)


PI: Anil K. Jain, Department of CSE, Michigan State University

Co-PI: Rong Jin, Department of CSE, Michigan State University



We are continuing to encounter an explosive growth in data, which poses a great challenge called information overload. The average amount of information produced by each man, woman and child on earth, is estimated to be 800 megabytes in 2002, and is expected to grow by 30% each year. Given this massive amount of information, one of the key challenges is how to organize data for efficient access. Data clustering is an exploratory data analysis tool, necessary for handling information overload. It has found applications across a large number of domains, including biology, physics, chemistry, business, geography, medicine, Internet, etc. In this project, we propose to develop efficient algorithms for large-scale kernel-based data clustering. Unlike most of the studies on large-scale data clustering that focus on a Euclidean distance, kernel-based clustering assumes that the geometric relationship between data points is measured by a nonlinear distance function. Nonlinear distance function is critical to many real-world problems. For instance, in gene data analysis, the distance between two DNA sequences is measured by a string kernel that computes the editing distance between two sequences; in natural language processing, the relationship between two sentences is measured by a tree kernel that measures the syntactic difference between sentences; in computer vision, the difference in the visual content between two images is
measured by a pyramid kernel that compares the ``interesting'' patches of the two images.



  1. Radhac Chittara
  2. Qi Qian


Project Goal:


There are two major challenges in large-scale kernel-based data clustering: (i) it is computationally expensive to measure the kernel distance between data points and cluster centers, and (ii) it is critical to identify the appropriate kernel distance function for a given data set. We propose to address the first challenge by three strategies: (1) the embedding approaches that approximate a kernel distance function by a Mahalanobis distance function, (2) divide and conquer approaches that simply computation by dividing a large data set into a number of small subsets, and (3) efficient nearest neighbor search by a random projection method. We propose to address the second challenge by multiple kernel learning and learning a Bregman distance function. The proposed algorithms will be evaluated on visual object categorization and document clustering.



  1. J. Yi, L. Zhang, R. Jin, Q. Qian, and A. Jain, Semi-supervised Clustering by Input Pattern Assisted Pairwise Similarity Matrix Completion, Proceedings of the 30th International Conference on Machine Learning (ICML), 2013

  2. J. Yi, T. Yang, R. Jin, A. K. Jain, and M. Mahdavi, Robust Ensemble Clustering by Matrix Completion,
    Proceedings of International Conference on Data Mining (ICDM), 2012

  3. R. Chitta, R. Jin and A. K. Jain, Efficient Kernel Clustering Using Random Fourier Features, Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), 2012

  4. J. Yi, R. Jin, A. K. Jain, and S. Jain, Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach, Proceedings of the 4th Human Computation Workshop in junction with AAAI 2012, 2012

  5. J. Yi, R. Jin, A. Jain, and S. Jain, Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning, Advance in Neural Information Processing Systems (NIPS), 2012

  6. R. Chitta, R. Jin, T. C. Havens, A. K. Jain, Approximate kernel k-means: solution to large scale kernel clustering. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2011), 2011