Evaluating content extraction on HTML documents
Grout, Vic (Hrsg). Proceedings of the Second International Conference on Internet Technologies and Applications (ITA 07) : 4-7 September 2007, University of Wales, NEWI, Wrexham, UK. Wrexham u.a.: NEWI 2007 S. 123 - 132
Erscheinungsjahr: 2007
ISBN/ISSN: 978-0-946881-54-3
Publikationstyp: Buchbeitrag (Konferenzbeitrag)
Sprache: Englisch
Geprüft | Bibliothek |
Inhaltszusammenfassung
A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related ...A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related algorithm to perform Content Extraction. We compare the algorithms using the developed framework and show that our adapted algorithm performs best on most HTML documents.» weiterlesen» einklappen
Klassifikation
DFG Fachgebiet:
Informatik
DDC Sachgruppe:
Informatik