| SciPort RLP

Inhaltszusammenfassung

A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related ...A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related algorithm to perform Content Extraction. We compare the algorithms using the developed framework and show that our adapted algorithm performs best on most HTML documents.» weiterlesen » einklappen

Autoren

Gottron, Thomas (Autor)

Klassifikation

DFG Fachgebiet:
4.43 - Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen

Thomas Gottron
Administrator Forschungsdatenbank
(FB 4: Informatik)

Evaluating content extraction on HTML documents

Inhaltszusammenfassung

Autoren

Klassifikation

Verknüpfte Personen

Beteiligte Einrichtungen

Starten Sie Ihre Suche...

Evaluating content extraction on HTML documents

Inhaltszusammenfassung

Autoren

Klassifikation

Verknüpfte Personen

Beteiligte Einrichtungen