p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px ‘Helvetica Neue’; color: #000000; -webkit-text-stroke: #000000; min-height: 12.0px} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 28.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 26.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 30.0px} p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 27.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p5 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 26.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p6 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 11.0px ‘Helvetica Neue’; color: #000000; -webkit-text-stroke: #000000; min-height: 12.0px} p.p7 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px ‘Helvetica Neue’; color: #000000; -webkit-text-stroke: #000000} p.p9 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p10 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p11 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p12 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202; min-height: 11.0px} p.p13 {margin: 0.0px 0.0px 0.0px 0.0px; font: 15.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p14 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 24.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p15 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 16.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p16 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 1.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p18 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 20.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} p.p19 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 20.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p20 {margin: 0.0px 0.0px 5.0px 36.0px; text-align: center; text-indent: -36.0px; line-height: 20.0px; font: 10.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p21 {margin: 0.0px 0.0px 5.0px 36.0px; text-indent: -36.0px; line-height: 20.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p22 {margin: 0.0px 0.0px 0.0px 0.0px; font: 28.0px ‘Times New Roman’; color: #0000ff; -webkit-text-stroke: #0000ff; min-height: 32.0px} p.p23 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 10.0px ‘Times New Roman’; color: #000008; -webkit-text-stroke: #000008; min-height: 11.0px} p.p24 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 10.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p25 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 10.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 11.0px} p.p26 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 14.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p27 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p28 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 13.0px Times; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p29 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 13.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p30 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000; min-height: 14.0px} p.p31 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 26.0px; font: 13.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p32 {margin: 0.0px 0.0px 12.0px 0.0px; line-height: 15.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p33 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p34 {margin: 0.0px 0.0px 1.4px 36.0px; text-indent: -36.0px; line-height: 16.0px; font: 14.0px Helvetica; color: #222222; -webkit-text-stroke: #222222; background-color: #ffffff; min-height: 17.0px} p.p35 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} p.p36 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} p.p37 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000; min-height: 14.0px} p.p38 {margin: 0.0px 0.0px 12.0px 0.0px; line-height: 20.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p39 {margin: 0.0px 0.0px 12.0px 0.0px; line-height: 21.0px; font: 13.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p40 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Times; color: #000000; -webkit-text-stroke: #000000} p.p41 {margin: 0.0px 0.0px 12.0px 0.0px; line-height: 17.0px; font: 13.0px Helvetica; color: #000000; -webkit-text-stroke: #000000; min-height: 16.0px} li.li8 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px ‘Helvetica Neue’; color: #000000; -webkit-text-stroke: #000000} li.li9 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px ‘Times New Roman’; color: #000000; -webkit-text-stroke: #000000} li.li17 {margin: 0.0px 0.0px 5.0px 0.0px; line-height: 1.0px; font: 13.0px ‘Times New Roman’; color: #020202; -webkit-text-stroke: #020202} span.s1 {font-kerning: none} span.s2 {font: 15.0px ‘Times New Roman’; font-kerning: none} span.s3 {font: 14.0px ‘Times New Roman’; font-kerning: none} span.s4 {font: 15.0px ‘Times New Roman’; font-kerning: none; color: #000000; -webkit-text-stroke: 0px #000000} span.s5 {font-kerning: none; color: #000000; -webkit-text-stroke: 0px #000000} span.s6 {font: 10.0px ‘Times New Roman’; font-kerning: none} span.s7 {font: 13.0px ‘Times New Roman’; font-kerning: none} span.s8 {text-decoration: underline ; font-kerning: none} span.s9 {color: #000000} span.s10 {text-decoration: underline ; font-kerning: none; color: #020202; -webkit-text-stroke: 0px #020202} span.s11 {font-kerning: none; color: #020202; -webkit-text-stroke: 0px #020202} span.s12 {font: 14.0px ‘Times New Roman’; text-decoration: underline ; font-kerning: none} span.s13 {font: 11.0px ‘Helvetica Neue’; font-kerning: none} span.s14 {-webkit-text-stroke: 0px #222222} span.s15 {font: 11.0px ‘Times New Roman’} span.s16 {font: 11.0px ‘Times New Roman’; font-kerning: none} span.s17 {font: 12.0px Times; font-kerning: none} span.s18 {font: 13.0px Times; font-kerning: none} span.Apple-tab-span {white-space:pre} ol.ol1 {list-style-type: decimal} ol.ol2 {list-style-type: lower-alpha} ol.ol3 {list-style-type: upper-alpha} ul.ul1 {list-style-type: disc}

 

CP 5520  Advanced Databases and Applications

Research Report 
 on 
Clustering on Data Mining  

By 

M.R.Srinath (JC469124)

 

Table of Contents 

Abstract 
Data Mining 
Clustering 
Methods of Clustering on Data Mining

Partitioning Method
Hierarchal Method
Density Based Method
Grid – Based Method 
Model – Based Method
Constraint – Based Method

Methods of Clustering on Data Mining

Abstract:  The world is deluged with various kinds of data-scientific data, environmental data, financial data and mathematical data. Manually analysing, classifying, and summarising the data is impossible because of the incredible increase in data in this age of net work and information sharing. This research investigates the fundamentals of clustering on data mining and various techniques by which clustering of the known and unknown data is done, also includes the difference of clustering in data mining to the clustering in other fields. 1  

Data Mining:  Data mining is a simple computational process of identifying the unknown patterns with some accuracy which leads to effective actions and ultimately understandable patterns in large datasets. In simple terms data mining means getting important information from very huge amount of data.Many other terms are being used to interpret data mining, such as knowledge mining from databases, knowledge extraction, data analysis, and data archaeology. 1  

Clustering:  Clustering is a process of splitting the large data set into small classes which have similar characteristics. Clustering has its advantages when the data set is defined and a general pattern needs to be determined from the data. You can create a specific number of groups, depending on your business needs. One defining benefit of clustering over classification is that every attribute in the data set will be used to analyse the data.
It is a unsupervised classification that means it has no predefined classes .  
Methods of Clustering on Data Mining:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It means dividing the data into ‘K’ partitions, when given 
database of ‘n’ objects, the partitioning method constructs ‘k’ groups of data and in addition
to this it should follow some requirements : 
a. Each group should at least have one object. 
   b. Each object must be represented by only one group.
Hierarchical Method  :  They are two types of Hierarchical Methods of clustering. 

Agglomerative Approach 
Divisive Approach   

Agglomerative Approach: It is termed as a bottom – up approach,  It keeps on merging the   objects or groups that are close to one another. Iteratively clusters are merged together until they hold a termination condition. 
Divisive Approach: It is termed as a  Top- down approach, Initially all items in one cluster             but later Large clusters are successively divided. 
Fig 1: Different Approches 

Density-Based Method:  This method is based on the notion of density. The basic idea is to          continue growing the given cluster as long as the density in          the neighbourhood exceeds some threshold, i.e., for each data          point within a given cluster. The aim of these methods is to          identify the clusters and their distribution parameters. re
Grid-Based Method: These methods partition the space into a finite number of cells that           form a grid structure on which all of the operations for       clustering are performed. The main advantage of the approach is its       fast processing time.  

Fig 2: Density and Grid google images

Model-Based Method:  In this method, a model is hypothesised for each cluster to find           the best fit of data for a given model that means it locates the           clusters by clustering the density function.It also serve a way of           automatically determining number of clusters based on standard           statistics , taking outlier or noise into account and therefore           termed as a robust clustering methods.

Constraint-Based Method: The clustering is performed by the incorporation of user or               application-oriented constraints. Constraints provide us with    an interactive way of communication with the clustering    process.Constraints can be specified by the user or the    application requirement.

Clustering Large Data Sets 
  
 This is termed because we are having very large data sets generally in the practical world which would contain a minimum of 50 instances of data should be clustered. They are various techniques or approaches by the data can be abstracted but there is no such approach which can deal or handle this large data sets (For Example a data set with millions of instances). Approaches based on genetic algorithms, tabu search and simulated annealing are optimisation techniques and are restricted to reasonably small data sets.
For this kind of scenarios K means algorithm and its ANN equivalent have been used to cluster large data sets. 

K means Algorithm (Clustering) : K-means clustering is a type of unsupervised learning, which is used when you have un labeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on  feature similarity. 

However, the K-means algorithm is sensitive to initial seed selection and even in the best case, it can produce only hyper-spherical clusters. 

Algorithm 

Basic K-Means Algorithm:1. Select data points as the initial centroids. 
2. (Re)Assign all points to their closest centroids.3. Recompute the centroid of each newly assembled cluster. 
4. Repeat steps 2 and 3 until the centroids do not change. 
 
Advantages of K means Algorithm

It is easy to implement and works with any of the standard norms. 
It allows straightforward parallelisation.
It is insensitive with respect to data ordering 

DrawBack of K means Algorithm

The results strongly depend on the initial guess of centroids 
A local optimum (computed for a cluster) does not need to be a global optimum (overall clustering of a data set). 
It is not obvious what a good number is in each case.
The process is sensitive with respect to outliers
Resulting clusters can be unbalanced. 

Unlike the K means algorithm we have other algorithms too which can handle large data they are: 

K – medoids Algorithms (Extension of K means Algorithm)
Probabilistic Algorithms
Density-Based Algorithms

Density – Based Algorithms (DBSCAN) 
It is a clustering algorithm, given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbours ), marking as outliers points that lie alone in low-density regions (whose nearest neighbours are too far away). DBSCAN (Density based spatial clustering with Noise) is one of the most common clustering algorithms and also most cited in scientific literature. 
 

Advantages 

DBSCAN has a notion of noise, and is robust to outliners.
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k means.
DBSCAN is designed for use with databases that can accelerate region queries.  

Disadvantages

DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed.
DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-? combination cannot then be chosen appropriately for all clusters.
The quality of DBSCAN depends on the distance measure used in the function regionQuery. The most common distance metric used is Euclidean distance. 

K- Medoids Algorithm
 It is a clustering algorithm which is a extension of K- means algorithm and related to medoid shift algorithm. K-medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori.It could be more robust to noise and outliers as compared to k means because it minimises a sum of general pairwise dissimilarities instead of a sum of squared Euclidean distances.
K medoid clustering is Partitioning around medoids algorithm which is done some in some steps

Initialise: randomly select k of the n data points as the medoids
Associate each data point to the closest medoid.
For each medoid m and each data point o associated to m swap m and o and compute the total cost of the configuration
Select the medoid o with the lowest cost of the configuration.
Repeat alternating steps 2 and 3 until there is no change in the assignments 

Probabilistic Algorithms

In analysis of the algorithm, probabilistic analysis of algorithms is an approach to estimate the computational complexity of the algorithm or a computational problem. It starts from an assumption about a probabilistic distribution of the set of all possible inputs. This assumption is then used to design an efficient algorithm or to derive the complexity of a known algorithm.

Clustering High Dimensional Data
It is a cluster analysis of data can be any thing starting from hundred to few thousands of dimensions of data.This type of high-dimensional spaces of data are often found in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents and can be also found in text mining. 

They are some problems while you do clustering in high dimensional of data.

Curse of Dimensionality : Suppose we have multiple dimensions of data , it will become hard to understand , we cannot visualise and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality.
We have a large number of attributes in our data , they is a chance of our attributes getting correlated which results the clusters in arbitrarily oriented subspaces. So in order to overcome this we need to follow Subspace Clustering Approach. 
Cluster analysis is intended to group objects that have a similar, based on observations of their attribute’s values. but if we have large number of attributes some of the attributes will usually not be meaningful for a given cluster this means there is a chance of different clusters can be found in different subspaces.
The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges.

Evolutionary Algorithms
Among the set of search and optimisation techniques, the development of Evolutionary Algorithms (EA) has been very important in the last few years.
Most of the present implementations of Evolutionary algorithms come from any of these three basic types. 

Genetic Algorithms (GA), 
Evolutionary Programming (EP) 
Evolutionary Strategies (ES).

Genetic Algorithm

A “population” of a set of “k-means” systems is given. 
The population is improved through mutation and crossover. 
Unlike in the standard -means, clusters can have different size and elongation.

 
Clustering In Text Mining  
Text Mining:  Text Mining or knowledge discovery from text refers to the process of extracting high quality of information from text. It widely covers a large set of related topics and algorithms for analysing text, spanning various communities, including information retrieval, natural language processing, data mining, machine learning many application domains web and biomedical sciences.
Clustering: Text clustering can be in initialised in different methods ,where clusters can be documents, paragraphs, sentences or terms. Clustering is one of the main techniques used for organising documents to enhance retrieval and support browsing . It exploits clustering to construct a context-based retrieval systems. There are many clustering algorithms that can be used in the context of text data. Text document can be represented as a binary vector, i.e. considering the presence or absence of word in the document. 
Text clustering algorithms can be done into many different types such as agglomerative clustering algorithms, partitioning algorithms and etc. Just to check how much they have varied trade offs in terms of effectiveness and efficiency.
Hierarchical Clustering algorithm in Text Mining
As hierarchical clustering termed as one of the distance based algorithms which uses the same function an it involves different approaches of clustering the text data. They approaches are Top-down approach , bottom – up approach.
 In the top-down approach we begin with one cluster which includes all the documents and later we split it into further sub clusters . whereas in bottom -up approach each document is initially considered as an individual cluster. Then successively the most similar clusters are merged together until all documents are embraced in one cluster. There are three different merging methods for these approaches: 
 1) Single Linkage Clustering: In this technique, the similarity between two groups of documents is the highest similarity between any pair of documents from these groups. 
2) Group-Average Linkage Clustering: In group-average clustering, the similarity between two cluster is the average similarity between pairs of documents in these groups. 
3) Complete Linkage Clustering: In this method, the similarity between two clusters is the worst case similarity between any pair of documents in these groups.

K- Means Clustering in Text Mining 
K- means algorithm is one of the most widely used approach in data mining, as it take the input in data file and divides them into ‘k’ clusters.
In case of text mining , we have n documents of text data k means divides it into k clusters of text data. 

K means algorithm in text mining 

Raw_input: Document set D, number k of cluster 

Output : Set of k clusters initialisation  
  Select randomly k data points as starting centroids

Loop  
when not converged 
Assign documents to the centroids based on the closest similarity. 
Calculate the the cluster’s centroid.
end loop 
return k clusters ;

The main disadvantage of k-means clustering ,it is indeed very sensitive to the initial choice of the number of k.

Probabilistic Clustering and Topic Models

Probabilistic clustering has many algorithms but the Topic model is one of the most popular algorithms.y. Topic modelling is to create a generative model for the large set  of text documents. In topic models, documents are mixture of topics, where a topic is a probability distribution over words. 

They are Two Methods of topic model they are Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. 

Probabilistic latent semantic analysis model does not provide any probabilistic model at the document level which makes it difficult to generalise it to model new unseen documents.  

The latent Dirichlet allocation model is an unsupervised technique for extracting thematic information of a collection of documents.The basic idea is that documents are represented as a random mixture of latent topics, where each topic is a probability distribution over words.