Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web.
What is a Web Crawler?
A web crawler is an automatic web page collector to create a cache/base of local copies of pages found. Initially a crawler starts with a beginning set of known URLs. These known links are also known as seed URLs. Then, crawler extracts the links inside the known documents and responsible to download these newly found pages in some order. In crawling field, there are two major crawler types:
- Periodic or Snapshot Crawling: Crawler continues to find new pages until the collection hits a desirable size. Then, periodically it runs the same process and replaces new collection with the existing. There are typically very large intervals between two crawlings. This method doesn’t use any existing knowledge comes from the previous crawls.
- Incremental Crawling: These crawlers keep searching for new links although collection becomes as large as it is desired to be. Old pages are repeatedly visited in a schedule to update the collection. It’s very hard to crawl a large source (such as Web) this way. Documents are needed to be refreshed one by one, also new links should be explored to be handled by the local collection. Web growth function is an exponential one according to the statistics. It means there are even more newly added pages than the updates of the existing documents. We unfortunately have a limited processing power, so it becomes more critical to decide which page to (re)visit next from the queue?
Figure above shows it is more useful to choose snapshot strategies for large scope search engines. On the other side, if rapid change of documents is guaranteed for a small scale corpus, it’s more effective to use an incremental crawler.
Scheduling the Download Queue
Although I stated it is more likely to use a snapshot crawling for large amount of documents, large gaps between updates makes it absolutely impractical solution for Web search engines while competitors such as Google explores even the most least significant change in hours. Incremental crawling looks (and actually is) very costly if we cannot predict when a page is updated/removed. Re-downloading the whole Web in short periods is also impossible. But, what if we can guess how often a page changes? What if we can prioritize each URL and schedule our download queue? Read the rest of this entry »
On a relational database, everything is in control; you can add constrains to ensure nobody will be able to enter a duplicated row. Or in deletion, you can program DBMS to handle the useless orphan rows. But the best, a relational DBMS is going to pre-process your SQL query before executing to avoid silly performance mistakes you can make. Think of the environment now: constraints over constraints, query execution strategies, high-level of dependence and complex indexing methods. This package works great unless you want to
BigTable looks like a very large B+ tree. It has 3 levels of hierarchy. All of tables are sorted and those tables are separated into pieces called tablets. First two levels are made of metadata tables to locate you to the right tablet. Root tablet is not distributed, but with helps of prefetching and extreme caching, it is actually not the bottleneck of the system. Final level tablets points to physical files (managed by Google File System). GFS provides 3 copies for each file on the system, so no matter if a machine is going down, they still have 2 other copies somewhere else. In the 2nd figure, a row of a tablet is illustrated. com.cnn.www is the key in this case and value has three different columns: contents, anchor:cnnsi.com and anchor:my.look.ca. Notice the timestamps, these fields may contain more than one version of entry. In this case, as Google crawler finds updated content on