Starten Sie Ihre Suche...


Durch die Nutzung unserer Webseite erklären Sie sich damit einverstanden, dass wir Cookies verwenden. Weitere Informationen

Evaluating content extraction on HTML documents

Grout, Vic (Hrsg). Proceedings of the Second International Conference on Internet Technologies and Applications (ITA 07) : 4-7 September 2007, University of Wales, NEWI, Wrexham, UK. Wrexham u.a.: NEWI 2007 S. 123 - 132

Erscheinungsjahr: 2007

ISBN/ISSN: 978-0-946881-54-3

Publikationstyp: Buchbeitrag (Konferenzbeitrag)

Sprache: Englisch

GeprüftBibliothek

Inhaltszusammenfassung


A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related ...A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related algorithm to perform Content Extraction. We compare the algorithms using the developed framework and show that our adapted algorithm performs best on most HTML documents.» weiterlesen» einklappen

Klassifikation


DFG Fachgebiet:
Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen


Thomas Gottron