Clustering template based web documents
MacDonald, Craig (Hrsg). Advances in information retrieval : 30. European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30 - April 3, 2008 ; proceedings. Berlin u.a.: Springer 2008 S. 40 - 51
Erscheinungsjahr: 2008
ISBN/ISSN: 978-3-540-78645-0 ; 3-540-78645-7
Publikationstyp: Buchbeitrag (Konferenzbeitrag)
Sprache: Englisch
Geprüft | Bibliothek |
Inhaltszusammenfassung
More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those dist...More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.» weiterlesen» einklappen
Klassifikation
DFG Fachgebiet:
Informatik
DDC Sachgruppe:
Informatik