Starten Sie Ihre Suche...


Durch die Nutzung unserer Webseite erklären Sie sich damit einverstanden, dass wir Cookies verwenden. Weitere Informationen

Content Code Blurring : a new approach to content extraction

Tjoa, A. Min (Hrsg). Proceedings / DEXA 2008, 19th international conference on database and expert systems applications : 1 - 5 September 2008, Turin, Italy ; [workshop papers]. Piscataway, NJ: IEEE 2008 S. 29 - 33

Erscheinungsjahr: 2008

ISBN/ISSN: 978-1-424-43256-1

Publikationstyp: Buchbeitrag (Konferenzbeitrag)

Sprache: Englisch

GeprüftBibliothek

Inhaltszusammenfassung


Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content Extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel Content Extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a we...Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content Extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel Content Extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing Content Extraction solutions we show that for most documents content code blurring delivers the best results.» weiterlesen» einklappen

Klassifikation


DFG Fachgebiet:
Informatik

DDC Sachgruppe:
Informatik

Verknüpfte Personen


Thomas Gottron