<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observed by Burcu Dogan &#187; crawlers</title>
	<atom:link href="http://blog.burcudogan.com/tag/crawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.burcudogan.com</link>
	<description>burcu dogan&#039;s monthly routine. caution: risk of overdose.</description>
	<lastBuildDate>Wed, 08 Sep 2010 20:37:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>An Introduction to Fundamental Web Crawling Strategies</title>
		<link>http://blog.burcudogan.com/100/</link>
		<comments>http://blog.burcudogan.com/100/#comments</comments>
		<pubDate>Thu, 12 Mar 2009 18:24:21 +0000</pubDate>
		<dc:creator>Burcu Dogan</dc:creator>
				<category><![CDATA[Regular]]></category>
		<category><![CDATA[crawlers]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[incremental crawling]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://blog.burcudogan.com/?p=100</guid>
		<description><![CDATA[Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web. What is a Web [...]]]></description>
			<content:encoded><![CDATA[<p>Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web.</p>
<h2>What is a Web Crawler?</h2>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling1.gif"><img title="crawling" class="right" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling-thumb1.gif" border="0" alt="crawling" width="240" height="160" /></a>A web crawler is an automatic web page collector to create a cache/base of local copies of pages found. Initially a crawler starts with a beginning set of known URLs. These known links are also known as <em>seed URLs</em>. Then, crawler extracts the links inside the known documents and responsible to download these newly found pages in some order. In crawling field, there are two major crawler types:</p>
<ol>
<li><strong>Periodic or Snapshot Crawling:</strong> Crawler continues to find new pages until the collection hits a desirable size. Then, periodically it runs the same process and replaces new collection with the existing. There are typically very large intervals <em>between two crawlings</em>. This method doesn’t use any existing knowledge comes from the previous crawls.</li>
<li><strong>Incremental Crawling:</strong> These crawlers keep searching for new links although collection becomes as large as it is desired to be. Old pages are repeatedly visited in a schedule to update the collection. It’s very hard to crawl a large source (such as Web) this way. Documents are needed to be refreshed one by one, also new links should be explored to be handled by the local collection. Web growth function is an exponential one according to the statistics. It means there are even more newly added pages than the updates of the existing documents. We unfortunately have a limited processing power, so it becomes more critical to decide <strong>which page to (re)visit next from the queue</strong>?</li>
</ol>
<p>Figure above shows it is more useful to choose snapshot strategies for large scope search engines. On the other side, if rapid change of documents is guaranteed for a small scale corpus, it’s more effective to use an incremental crawler.</p>
<h2>Scheduling the Download Queue</h2>
<p>Although I stated it is more likely to use a snapshot crawling for large amount of documents, large gaps between updates makes it absolutely <span style="text-decoration: underline;">impractical solution</span> for Web search engines while competitors such as Google explores even <em>the most least significant change in hours</em>. Incremental crawling looks (and actually is) very costly if we cannot predict when a page is updated/removed. Re-downloading the whole Web in short periods is also impossible. But, what if we can guess how often a page changes? What if we can prioritize each URL and schedule our download queue?<span id="more-100"></span></p>
<p>Crawling scheduling methods used in commercial engines are kept as secrets like the formula of Coca-Cola. There are only a few fundamental techniques I’ll share here to give beginners an impression.</p>
<p><strong><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling2.gif"><img class="right" title="crawling2" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling2-thumb.gif" border="0" alt="crawling2" width="240" height="157" align="right" /></a> Breath-first search:</strong> The easiest but least efficient crawling algorithm. There isn’t a particular sorting process for the download queue. Documents are downloaded in the same order their URLs are found. But all of the links in a level should be crawled before crawler starts to discover another level. On the figure right, links on the seed level document will be downloaded before the links on the level0 documents and so on.</p>
<p><strong>Best-fit search:</strong> This algorithm is currently the most popular search algorithm used in focused crawlers. In best-first search, URLs are not visited in the order they are discovered; instead, some heuristics (usually results from Web analysis algorithms) are used to rank the URLs in the crawling queue and those that are considered more promising to point to relevant pages are visited first. Least important pages have very less change to be visited and continuously put to the back of the queue. Back-link count or partial PageRank have found a large application domain in this field.</p>
<p><strong>Tunnelling</strong>: This method is a good solution to find the most relevant URLs that occurs in a page. Let’s assume we are exploring links in document D, and one of the links in D refers to document C. Some crawlers rank C to check how relevant C is to D. For example, if URLs points to C and D occurs on a page other than these pages, they are most likely to be relevant. Or, if both D and C gives link to an existing page other than D or C, they may be relevant. Documents are scored due to their relevance to the root document and queued according to their rankings.</p>
<h2>Constraints and Principles in Queuing</h2>
<p>A crawler should foremost be polite to the server it is trying to reach. You will most likely to crash or slow your resource if you make thousands of requests in a minute. Queues should be sorted in a way that may not put extra load on the source server. In distributed hierarchies queues should resorted once distribution is completed to avoid <em>unexpected concurrent transactions</em>.</p>
<p>Another problem occurs in <strong>synchronization</strong> of distributed crawlers. If there are two discrete crawling processes are running and crawler1 and crawler2 find same link on different pages, they should communicate to <em>avoid repeated downloads and post-processing</em>.</p>
<p>A large amount of data lies on the Deep Web – a term used for invisible Web, for documents have no direct links point to them from the visible Web. A download queue should be able to give acceptable priority to this discrete set even if link-based statistical data is not available. As a example from practise, Google started to use <a href="https://www.google.com/webmasters/tools">Sitemaps</a> a few years ago to explore hidden content and let authors assign relative priorities to help Google crawlers with <em>predicting the change rates</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.burcudogan.com/100/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
