Development of Browser Extension for HTML Web Page Content Extraction

Karabulut, Murat; Mayda, Islam

Development of Browser Extension for HTML Web Page Content Extraction

dc.authorid	MAYDA, ISLAM/0000-0001-5584-0259
dc.contributor.author	Karabulut, Murat
dc.contributor.author	Mayda, Islam
dc.date.accessioned	2025-03-26T17:34:57Z
dc.date.available	2025-03-26T17:34:57Z
dc.date.issued	2020
dc.department	İstanbul Esenyurt Üniversitesi
dc.description	2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) -- JUN 26-27, 2020 -- TURKEY
dc.description.abstract	As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.
dc.description.sponsorship	IEEE Turkey Sect
dc.identifier.doi	10.1109/hora49412.2020.9152891
dc.identifier.endpage	22
dc.identifier.isbn	978-1-7281-9352-6
dc.identifier.scopus	2-s2.0-85089685417
dc.identifier.scopusquality	N/A
dc.identifier.startpage	17
dc.identifier.uri	https://doi.org/10.1109/hora49412.2020.9152891
dc.identifier.uri	https://hdl.handle.net/20.500.14704/960
dc.identifier.wos	WOS:000644404300002
dc.identifier.wosquality	N/A
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.language.iso	tr
dc.publisher	IEEE
dc.relation.ispartof	2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (Hora 2020)
dc.relation.publicationcategory	Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/closedAccess
dc.snmz	KA_WOS_20250326
dc.subject	web content extraction; web data extraction; web scraping
dc.title	Development of Browser Extension for HTML Web Page Content Extraction
dc.type	Conference Object

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Mühendislik ve Mimarlık Fakültesi Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Development of Browser Extension for HTML Web Page Content Extraction

Dosyalar

Koleksiyon