Development of Browser Extension for HTML Web Page Content Extraction
dc.authorid | MAYDA, ISLAM/0000-0001-5584-0259 | |
dc.contributor.author | Karabulut, Murat | |
dc.contributor.author | Mayda, Islam | |
dc.date.accessioned | 2025-03-26T17:34:57Z | |
dc.date.available | 2025-03-26T17:34:57Z | |
dc.date.issued | 2020 | |
dc.department | İstanbul Esenyurt Üniversitesi | |
dc.description | 2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) -- JUN 26-27, 2020 -- TURKEY | |
dc.description.abstract | As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully. | |
dc.description.sponsorship | IEEE Turkey Sect | |
dc.identifier.doi | 10.1109/hora49412.2020.9152891 | |
dc.identifier.endpage | 22 | |
dc.identifier.isbn | 978-1-7281-9352-6 | |
dc.identifier.scopus | 2-s2.0-85089685417 | |
dc.identifier.scopusquality | N/A | |
dc.identifier.startpage | 17 | |
dc.identifier.uri | https://doi.org/10.1109/hora49412.2020.9152891 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14704/960 | |
dc.identifier.wos | WOS:000644404300002 | |
dc.identifier.wosquality | N/A | |
dc.indekslendigikaynak | Web of Science | |
dc.indekslendigikaynak | Scopus | |
dc.language.iso | tr | |
dc.publisher | IEEE | |
dc.relation.ispartof | 2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (Hora 2020) | |
dc.relation.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | |
dc.rights | info:eu-repo/semantics/closedAccess | |
dc.snmz | KA_WOS_20250326 | |
dc.subject | web content extraction; web data extraction; web scraping | |
dc.title | Development of Browser Extension for HTML Web Page Content Extraction | |
dc.type | Conference Object |