too close for comfort.

Screen capture of intrinsic dimensionality experiment 1

One research project that I have worked at the VLM lab involved determining the number of clusters that exist in a data set based on the information provided in nearest neighbor distance matrices. A possible application of this work could be to determine how many unique people appear in a set of pictures where each person would be considered a clustering class. The idea for this project is novel, in that it could determine the appropriate number of clusters regardless of the distance measure.

Our approach hinged on the assumption that the objects described in the distance matricies where, for the most part, distributed uniformly in some hyper-dimensional euclidean space. Then by using some basic geometrical operations, we can determine the intrinsic dimensionality of a dataset and in turn use different bounding parameters to determine if certain objects were just too close, probabilistically speaking, to be in the general uniform distribution of the rest of the space and must therefore belong to the same clustering class.

I approached this problem, under the guidance of my advisor, by first coming up with a theoretical model and then running some experiments on mock data to verify the validity of our assumptions. The initial results were encouraging, but we are currently trying to resolve some unexpected issues we have faced with objects distributed in higher dimensional space.

The first non-trival task of this project, and ultimately the current stumbling block, is figuring out a way to determine the intrinsic dimensionality of a data set. At first I experimented with coming up with models that based on hyper-dimensional spheres, or rather a set of concentric hyper-dimensional spheres that could be viewed as cores, and came up with some intuitive rudimentary relationships. The problem presented by this approach though was that often times our hyper-dimensional sphere would spread beyond the actual limits of our datasets, and would therefor skew the results. My advisor and I quickly realized that such events would also be more relevant as we moved into higher dimensional spaces.

So we thought of different ways to solve for intrinsic dimensionality, the most promising being based on linear projections/embedings. But after a week or so, my advisor decided to focus on another, and arguably more interesting, project. I still feel that a solution could be found to this problem and I would like to revisit it some day, when I have a better background in the field.


About this entry