An off-shore oil rig equipped with tens of thousands of sensors can capture data on practically every facet of oil production and extraction, as well as rig maintenance. But less than one percent of that data is ultimately used, according to research from McKinsey.
Worse yet, this is not a fluke example. For many organizations in the industrial realm, it is still difficult to use large-scale data for knowledge discovery. In recent years, data organization and classification have evolved modestly. Analyzing vast and heterogeneous datasets is also a challenge thanks to the ballooning volume of acquired datasets.
A technique known as data clustering can help, however. The method is simple in principle: Data objects are clustered into groups so that the objects in one class are similar yet distinct from objects in other categories.
Typically, well-known clustering algorithms sort data objects into distinct types. But in the real world, clustering problems can arise in data mining operations involving pair-wise heterogeneous data. In spite of the rapid development in data acquisition technology resulting in the explosive collection of acquired datasets, techniques such as data organization and classification, manipulation, and analysis of very large, diverse, heterogeneous datasets have only evolved modestly. This has hindered the effective utility and understanding of the acquired, large-scale data for knowledge discovery.
Here’s an example: In a customer relationship management (CRM) application, you want to co-cluster customers and items purchased. This cluster lets you study items of interest for a particular demographic. You can then customize product promotion campaigns for these clients.
Movie recommendation engines are another example of collaborative information filtering. These engines co-cluster accumulated movie ratings from viewers. When a new viewer submits a score for a film she liked, the engine recommends other movies based on classifying the rating she provided to a cluster of audience movie ratings.
Co-clustering is also used in biomedical applications to classify patient symptoms and medical diagnoses. Computer-aided diagnosis translates a patient’s symptoms and supporting data into probabilities.
We refer to the existence of two pair-wise data types with the expression “hand-in-hand.” In other words, one data type in this scenario induces clustering of the other data type and vice-versa. Hence, applying conventional clustering algorithms separately to each of the data types cannot produce meaningful co-clustering results.
Here’s the technical framework for how this works: Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry Cij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. Names such as bi-clustering, bi-dimensional clustering, and block clustering, among others, are often used in the literature to refer to the same problem formulation.
One technique for achieving co-clustering is to approach the problem from a graph theoretic point of view. That is, we model the relationship between the two data types in the co-clustering problem using a weighted bipartite graph model. The two data types represent the two kinds of vertices in the bipartite graph. Data co-clustering is achieved by partitioning the bipartite graph.
The square and circular vertices (m and r, respectively) denote the two data types in the co-clustering problem that are represented by the bipartite graph. Partitioning this bipartite graph leads to co-clustering of the two data types.
We would welcome any conversation on application development to provide stronger industrial insights. We can move rapidly into Industry 4.0 by combining subject matter expertise, data collection methods, and next-generation data science tools.