Starten Sie Ihre Suche...


Durch die Nutzung unserer Webseite erklären Sie sich damit einverstanden, dass wir Cookies verwenden. Weitere Informationen

Bridging the gap : from multi document template detection to single document content extraction

Proceedings of the IASTED International Conference on Internet & Multimedia Systems & Applications with Special Sessions on Visual Communications : March 17 - 19, 2008, Innsbruck, Austria ; EuroIMSA ; (Innsbruck) : 2008.03.17-19. Anaheim, Calif. u.a.: Acta Press 2008 S. 66 - 71

Erscheinungsjahr: 2008

ISBN/ISSN: 978-0-88986-727-7 ; 978-0-88986-7284

Publikationstyp: Buchbeitrag (Konferenzbeitrag)

Sprache: Englisch

GeprüftBibliothek

Inhaltszusammenfassung


Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter. Starting from a single initial document we use the set of hyperlinked web pages to build the requi...Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter. Starting from a single initial document we use the set of hyperlinked web pages to build the required training set for Template Detection automatically. By clustering the documents in this set according to their underlying templates we clean the training set from documents based on different templates. We confirm the applicability of the approach by using an entropy based Template Detection algorithm to build a Content Extractor.» weiterlesen» einklappen

Klassifikation


DFG Fachgebiet:
Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen


Thomas Gottron