Combining content extraction heuristics : the CombinE system
Kotsis, G. (Hrsg). The 10th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2008) : November 24 - 26, 2008, Linz, Austria. New York, NY: ACM 2008 S. 591 - 595
Erscheinungsjahr: 2008
ISBN/ISSN: 978-1-605-58349-5
Publikationstyp: Buchbeitrag (Konferenzbeitrag)
Sprache: Englisch
Geprüft | Bibliothek |
Inhaltszusammenfassung
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated....The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document.» weiterlesen» einklappen
Klassifikation
DFG Fachgebiet:
Informatik
DDC Sachgruppe:
Informatik