Going Back in Time to Find What Existed on the Web and How much has been Preserved: How much of Palestinian Web has been Archived?

Thaer Sammar, Hadi Khalilia


The web is an important resource for publishing and sharing content. The main characteristic of the web is its volatility. Content is added, updated, and deleted all the time. Therefore, many national and international institutes started crawling and archiving the content of the web. The main focus of national institutes is to archive the web related to their country heritage, for example, the National Library of the Netherlands is focusing on archiving website that are of value to the Dutch heritage. However, there are still countries that haven’t taken the action to archive their web, which will result in loosing and having a gap in the knowledge. In this research, we focus on shedding the light on the Palestinian web. Precisely, how much of the Palestinian web has been archived. First, we create a list of Palestinian hosts that were on the web. For that we queried Google index exploiting the time range filter in order to get hosts overtime. We collected in 98 hosts in average in 5-years granularity from the year 1990 to 2019. We also obtained Palestinian hosts from the DMOZ directory. We collected 188 hosts. Second, we investigate the coverage of collected hosts in the Internet Archive and the Common-Crawl. We found that coverage of Google hosts in the Internet Archive ranges from 0% to 89% from oldest to newest time-granularity. The coverage of DMOZ hosts was 96%. The coverage of Google hosts in the Common-Crawl 57.1% to 74.3, while the coverage of DMOZ hosts in the Common-Crawl was in average 25% in all crawls. We found that even the host is covered in Internet Archive and Common-Crawl, the lifespan and the number of archived versions are low.


النص الكامل:


المراجع العائدة

  • لا توجد روابط عائدة حالياً.

مستودع الوقائع العالمي  © 2021