A Concept-Based Similarity Measure For Enhancing Text Clustering
<font size=6><font color=#000066>Text Mining </font>
</font>techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning
contributed by the other term. Hence, the terms that capture the semantics of the text should be given more
importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. By using this method, the non-important terms with respect to the sentence semantics and terms that hold the concepts that represent the sentence meaning can efficiently be differentiated. Hence, the proposed method, analyzes the term that contributes to the sentence semantics based on the sentence documents and corpus level rather than the traditional analysis of the document only. The model consists of sentence-based concept
analysis which calculates the conceptual term frequency (ctf), document-based concept analysis which finds the term frequency (tf), corpus-based concept analysis which determines the document frequency (df) and concept-based similarity measure. The drawback of the existing system is that they can be used to cluster only the documents that are given into the system, that is, structured text. It cannot be used to cluster web documents, that is, unstructured
text documents. The proposed system is designed to overcome this disadvantage.. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept- Based Analysis Algorithm. The concept based analysis algorithm is capable of matching each concept in a new document d with all the previously processed documents in O(m) time, where m is the number of concepts in d. The concept based similarity
measure exploits the information extracted from the concept based analysis algorithm to better judge the similarity between the documents. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.