Download Advances in Data Mining. Applications and Theoretical by Petra Perner PDF

By Petra Perner

This e-book constitutes the refereed lawsuits of the 14th commercial convention on Advances in info Mining, ICDM 2014, held in St. Petersburg, Russia, in July 2014. The sixteen revised complete papers awarded have been rigorously reviewed and chosen from quite a few submissions. the themes diversity from theoretical features of knowledge mining to functions of information mining, similar to in multimedia info, in advertising and marketing, in drugs and agriculture and in procedure keep watch over, and society.

Additional info for Advances in Data Mining. Applications and Theoretical Aspects: 14th Industrial Conference, ICDM 2014, St. Petersburg, Russia, July 16-20, 2014. Proceedings

Sample text

1–8. ACM (2011) 11. : Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 782–792. Association for Computational Linguistics (2011) 12. : Named entity recognition and disambiguation using linked data and graph-based centrality scoring. In: Proceedings of the 4th International Workshop on Semantic Web Information Management, p. 4. ACM (2012) 13. : Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity.

Detecting templates correctly and precisely thus becomes a vital part for many applications. Methods for template detection have been studied extensively. However, they are insufficient to detect multiple templates in a Web site. In this paper, we propose a novel segment-based template detection method to identify templates. Our method works in three steps. First, for each Web site we construct a SSOM (Site-oriented Segment Object Model) tree from sampled pages in a Web collection, through aligning the pages’ SOM (Segment Object Model) trees.

Gibson et al. [7] have conducted an extensive survey on the use of templates on the Web which revealed the rapid development of template. They also develop new randomized algorithms (DOM-based algorithm and Text-based algorithm) for template extraction. In DOM-based algorithm, for each node, the hash is computed by the content of the node and the start and end of offsets. And then, the nodes are considered as templates if the occurrence counts of their hashes are within a specified threshold. In Text-based algorithm, the page is pre-processed to remove all HTML tags, comments, and text within