<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observed by Burcu Dogan &#187; distributed computing</title>
	<atom:link href="http://blog.burcudogan.com/tag/distributed-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.burcudogan.com</link>
	<description>burcu dogan&#039;s monthly routine. caution: risk of overdose.</description>
	<lastBuildDate>Wed, 08 Sep 2010 20:37:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>An Introduction to Fundamental Web Crawling Strategies</title>
		<link>http://blog.burcudogan.com/100/</link>
		<comments>http://blog.burcudogan.com/100/#comments</comments>
		<pubDate>Thu, 12 Mar 2009 18:24:21 +0000</pubDate>
		<dc:creator>Burcu Dogan</dc:creator>
				<category><![CDATA[Regular]]></category>
		<category><![CDATA[crawlers]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[incremental crawling]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://blog.burcudogan.com/?p=100</guid>
		<description><![CDATA[Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web. What is a Web [...]]]></description>
			<content:encoded><![CDATA[<p>Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web.</p>
<h2>What is a Web Crawler?</h2>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling1.gif"><img title="crawling" class="right" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling-thumb1.gif" border="0" alt="crawling" width="240" height="160" /></a>A web crawler is an automatic web page collector to create a cache/base of local copies of pages found. Initially a crawler starts with a beginning set of known URLs. These known links are also known as <em>seed URLs</em>. Then, crawler extracts the links inside the known documents and responsible to download these newly found pages in some order. In crawling field, there are two major crawler types:</p>
<ol>
<li><strong>Periodic or Snapshot Crawling:</strong> Crawler continues to find new pages until the collection hits a desirable size. Then, periodically it runs the same process and replaces new collection with the existing. There are typically very large intervals <em>between two crawlings</em>. This method doesn’t use any existing knowledge comes from the previous crawls.</li>
<li><strong>Incremental Crawling:</strong> These crawlers keep searching for new links although collection becomes as large as it is desired to be. Old pages are repeatedly visited in a schedule to update the collection. It’s very hard to crawl a large source (such as Web) this way. Documents are needed to be refreshed one by one, also new links should be explored to be handled by the local collection. Web growth function is an exponential one according to the statistics. It means there are even more newly added pages than the updates of the existing documents. We unfortunately have a limited processing power, so it becomes more critical to decide <strong>which page to (re)visit next from the queue</strong>?</li>
</ol>
<p>Figure above shows it is more useful to choose snapshot strategies for large scope search engines. On the other side, if rapid change of documents is guaranteed for a small scale corpus, it’s more effective to use an incremental crawler.</p>
<h2>Scheduling the Download Queue</h2>
<p>Although I stated it is more likely to use a snapshot crawling for large amount of documents, large gaps between updates makes it absolutely <span style="text-decoration: underline;">impractical solution</span> for Web search engines while competitors such as Google explores even <em>the most least significant change in hours</em>. Incremental crawling looks (and actually is) very costly if we cannot predict when a page is updated/removed. Re-downloading the whole Web in short periods is also impossible. But, what if we can guess how often a page changes? What if we can prioritize each URL and schedule our download queue?<span id="more-100"></span></p>
<p>Crawling scheduling methods used in commercial engines are kept as secrets like the formula of Coca-Cola. There are only a few fundamental techniques I’ll share here to give beginners an impression.</p>
<p><strong><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling2.gif"><img class="right" title="crawling2" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/crawling2-thumb.gif" border="0" alt="crawling2" width="240" height="157" align="right" /></a> Breath-first search:</strong> The easiest but least efficient crawling algorithm. There isn’t a particular sorting process for the download queue. Documents are downloaded in the same order their URLs are found. But all of the links in a level should be crawled before crawler starts to discover another level. On the figure right, links on the seed level document will be downloaded before the links on the level0 documents and so on.</p>
<p><strong>Best-fit search:</strong> This algorithm is currently the most popular search algorithm used in focused crawlers. In best-first search, URLs are not visited in the order they are discovered; instead, some heuristics (usually results from Web analysis algorithms) are used to rank the URLs in the crawling queue and those that are considered more promising to point to relevant pages are visited first. Least important pages have very less change to be visited and continuously put to the back of the queue. Back-link count or partial PageRank have found a large application domain in this field.</p>
<p><strong>Tunnelling</strong>: This method is a good solution to find the most relevant URLs that occurs in a page. Let’s assume we are exploring links in document D, and one of the links in D refers to document C. Some crawlers rank C to check how relevant C is to D. For example, if URLs points to C and D occurs on a page other than these pages, they are most likely to be relevant. Or, if both D and C gives link to an existing page other than D or C, they may be relevant. Documents are scored due to their relevance to the root document and queued according to their rankings.</p>
<h2>Constraints and Principles in Queuing</h2>
<p>A crawler should foremost be polite to the server it is trying to reach. You will most likely to crash or slow your resource if you make thousands of requests in a minute. Queues should be sorted in a way that may not put extra load on the source server. In distributed hierarchies queues should resorted once distribution is completed to avoid <em>unexpected concurrent transactions</em>.</p>
<p>Another problem occurs in <strong>synchronization</strong> of distributed crawlers. If there are two discrete crawling processes are running and crawler1 and crawler2 find same link on different pages, they should communicate to <em>avoid repeated downloads and post-processing</em>.</p>
<p>A large amount of data lies on the Deep Web – a term used for invisible Web, for documents have no direct links point to them from the visible Web. A download queue should be able to give acceptable priority to this discrete set even if link-based statistical data is not available. As a example from practise, Google started to use <a href="https://www.google.com/webmasters/tools">Sitemaps</a> a few years ago to explore hidden content and let authors assign relative priorities to help Google crawlers with <em>predicting the change rates</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.burcudogan.com/100/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Functional Programming for Beginners</title>
		<link>http://blog.burcudogan.com/46/</link>
		<comments>http://blog.burcudogan.com/46/#comments</comments>
		<pubDate>Sun, 08 Mar 2009 17:29:53 +0000</pubDate>
		<dc:creator>Burcu Dogan</dc:creator>
				<category><![CDATA[Regular]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[functional programming]]></category>
		<category><![CDATA[lambda calculus]]></category>

		<guid isPermaLink="false">http://blog.burcudogan.com/?p=46</guid>
		<description><![CDATA[Recently, I’m facing many questions about functional programming. Instead of answering everybody one by one, I decided to write a blog post about functional programming. In this article, I’ll try to introduce you the FP concept. If you are interested, I advice you to have a hands-on experience. There are many widely used functional languages [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I’m facing many questions about functional programming. Instead of answering everybody one by one, I decided to write a blog post about functional programming. In this article, I’ll try to introduce you the FP concept. If you are interested, I advice you to have a hands-on experience. There are many widely used functional languages available today: LISP, Haskell, Erlang and F# (new but promising) are a few to name.</p>
<h2>Firstly, a Brief History…</h2>
<p>A long time ago, in 1930s, when the world was stuck in another economical recession, lives of four extra-ordinary mathematicians were crossed in Princeton, NJ.  These men were not interested in the physical world they were living but trying to create their own universe to find answers about limits of <em>computation –</em> a word not heard by many yet. The area they were interested in was called formal systems and their main problem was to answer which problems are solvable if processing power and memory were infinite. One of them were a truly materialist, a slave of questioning and curiosity, a British guy who decided to move to the new world after graduating from Trinity College. Second was a super brain whose Ph.D. dissertation was accepted when he was just 23 years old, nicknamed “Mr. Why”, a close friend of Albert Einstein. The other two were recent Princeton graduates who decided to go for graduate school. Correspondingly, the names of these men were Alan Turing, Kurt Gödel, Alonzo Church and Stephen Kleene. In 1936, Turing extended Gödel’s study on the limits of proof and computation with replacing Gödel&#8217;s universal arithmetic-based formal language with formal devices called Turing machines. At the same time, two young grad students Church and Kleene were designing a universal model of computation which was identical to Turing machines in power. Their formal system were called<em> lambda calculus</em>. Let’s say it in a clearer and less scientific-BS way: they invented a language, lambda calculus, that was capable to be the smallest universal programming language of the whole world.</p>
<h2>Lambda Calculus</h2>
<p>Lambda calculus is the common programming language of the world. The main aim or their inventors was to prove any computable function can be expressed and evaluated using this formulization. In the universe of lambda calculus, the key elements are &lt;name&gt;, &lt;expression&gt; and &lt;application&gt; where,</p>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/0411.gif"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="04-1" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/041-thumb1.gif" border="0" alt="04-1" width="437" height="42" /></a></p>
<p>&lt;name&gt; in lambda calculus cannot be associated with different values, therefore it is not called a “variable.” Imagine your favourite iterative programming language don&#8217;t let you change values of the variables by default. Yes, it sounds like a headache at first, but the whole concept is standing on these rules. Now, let’s move on to a <strong>more practical example</strong>, for instance, to an function multiples its input by 2.</p>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/0421.gif"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="04-2" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/042-thumb1.gif" border="0" alt="04-2" width="158" height="25" /></a></p>
<p>For great examples, I suggest you to read “<em>A Tutorial Introduction to the Lambda Calculus</em>” by Rojas.<span id="more-46"></span></p>
<h2>Functional Programming Language Primitives</h2>
<p>With no surprise, functional programming languages are artificial languages where the main application is made from a function (or nested functions), a very similar concept to the lambda calculus. In this chapter I’m going to underline most characteristic features (not restrictions) of functional languages.</p>
<p>1. <strong>No sequences of discrete commands</strong>. Traditional programming languages are based around the idea of a variable as a changeable association between a name and values. These languages are said to be imperative because they consist of sequences of commands. On the other hand, functional languages are based on structured function calls. A functional program is an expression consisting of a function call which calls other functions in turn (nested function calls).</p>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/0431.gif"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="04-3" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/043-thumb.gif" border="0" alt="04-3" width="417" height="73" /></a></p>
<p>2.<strong> Names may only be associated with a single value</strong>. In iterational languages, the value of a name (variable) can be modified. In functional languages, names are only introduced as the formal parameters of functions and given values by function calls with actual parameters. Once a formal parameter is associated with an actual parameter value there is no way for it to be associated with a new value.</p>
<p>3. <strong>No guaranteed execution order</strong>. Iterational languages executes line by line (if there are no multi-threaded pieces.) In contrast, functional languages don&#8217;t guarantee a thing about execution order. You have to declare the execution order by yourself.</p>
<p><a href="http://blog.burcudogan.com/wp-content/uploads/2009/03/044.gif"><img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="04-4" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/044-thumb.gif" border="0" alt="04-4" width="476" height="73" /></a></p>
<p>4. <strong>No repetitions of names</strong>. In functional languages, because the same names cannot be reused with different values, nested function calls are used to create new versions of the names for new values. Similarly, because command repetition cannot be used to change the values associated with names, recursive function calls are used to repeatedly create new versions of names associated with new values.</p>
<h2>Advantages of Thinking Differently</h2>
<p>So, why do people need to think differently although we had all of the other rapid development languages off-the-shelf and not worrying about the marginal concepts FP brings? People who were interested in distributed computing somehow knows the answer. Have you even heard about <a href="http://en.wikipedia.org/wiki/Mutual_exclusion">mutual exclusion</a>, racing condition or other problems in distributed/parallel programming? Shared memory is a problem and no matter how hard computer scientists work on methods to lock shared resources during critical area executions, usually communication needed to synchronize these events aren’t even worth to distribute the process. Consequently some engineers come ideas like <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> where dependency between distributed tasks are none. MapReduce applies most of the concepts I represented above about functional programming.</p>
<p>Secondly, functional programming is more than programming, it’s a way of thinking. You don’t need design patterns for functional programming because it’s a design pattern as itself. With Java, you have billions of features to run the world but when it comes designing a software system, it causes more shortcuts in brain. On the other hand, functional designs are deterministic.</p>
<p>There are many more such as advantages in unit testing, deploying and updating software. Please keep in mind, functional programming is a tough area to transfer knowledge with a blog post. Every section is this post can be extended with a few thousand words. I recommend you to search, try, read, code, live it to have a complete feeling.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.burcudogan.com/46/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>BigTable Concept: Why do the World&#8217;s Smartest People Ignore Relational DBs?</title>
		<link>http://blog.burcudogan.com/9/</link>
		<comments>http://blog.burcudogan.com/9/#comments</comments>
		<pubDate>Tue, 03 Mar 2009 11:05:32 +0000</pubDate>
		<dc:creator>Burcu Dogan</dc:creator>
				<category><![CDATA[Regular]]></category>
		<category><![CDATA[azure]]></category>
		<category><![CDATA[bigtable]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[google]]></category>

		<guid isPermaLink="false">http://blog.burcudogan.com/?p=9</guid>
		<description><![CDATA[In the era of the Internet, the key problem is scalability. As cloud&#8217;s popularity climbs up, we are hearing more about the constraints. So far, I only had time to play with Google&#8217;s App Engine and Microsoft&#8217;s Azure Services Platform. Cloud developers are mainly shocked by the new non-relational databases that cloud services use as [...]]]></description>
			<content:encoded><![CDATA[<p>In the era of the Internet, the key problem is <em>scalability</em>. As cloud&#8217;s popularity climbs up, we are hearing more about the constraints. So far, I only had time to play with Google&#8217;s <a href="http://code.google.com/appengine/">App Engine</a> and Microsoft&#8217;s <a href="http://www.microsoft.com/azure/default.mspx">Azure Services Platform</a>. Cloud developers are mainly shocked by the new <strong>non-relational databases</strong> that cloud services use as the only alternative. Google calls it BigTable and Microsoft finds a new place in its own terminology dictionary for <a href="http://channel9.msdn.com/posts/smarx/Windows-Azure-Blob-Storage/">BLOB</a>. Many start to wonder what the hype about the <a href="http://en.wikipedia.org/wiki/Relational_database">relational databases</a> was over the past 30 years. Foremost, let&#8217;s clear that this is not a replacement, but a more efficient way to store data by eliminating not-that-fundamental super engineered functionality layers of the current relational database management systems. Yes, good news for people makes living by designing super large and highly <a href="http://en.wikipedia.org/wiki/Database_normalization">normalized databases</a> to ensure data integrity.</p>
<p><img class="left alignnone size-full wp-image-17" title="Bigtable Hierarchy" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/02-bigtablehierarchy.gif" alt="Bigtable Hierarchy" width="260" height="174" />On a relational database, everything is in control; you can add constrains to ensure nobody will be able to enter a duplicated row. Or in deletion, you can program DBMS to handle the useless orphan rows. But the best, a relational DBMS is going to pre-process your SQL query before executing to avoid silly performance mistakes you can make. Think of the environment now: constraints over constraints, query execution strategies, high-level of dependence and complex indexing methods. This package works great unless you want to <span style="text-decoration: underline;">distribute the tables</span> to different machines. Can you image joining two tables where tables are distributed over 100.000 nodes? In a Google case, this is the everyday problem (or better, call it an <em>every millisecond issue</em>). Luckily, Google&#8217;s data has characteristics; according to <a href="http://research.google.com/people/jeff/index.html">Jeffrey Dean</a>, they are able to manage constraints DBMSes serve to process data, <strong>on the application level</strong>. Consequently, Google keeps data in a very basic form as &lt;key,value,timestamp&gt; tuples.</p>
<p><img class="left alignnone size-full wp-image-19" title="Bigtable Tablets" src="http://blog.burcudogan.com/wp-content/uploads/2009/03/02-tablets.gif" alt="Bigtable Tablets" width="428" height="89" />BigTable looks like a very large B+ tree. It has 3 levels of hierarchy. All of tables are sorted and those tables are separated into pieces called tablets. First two levels are made of <em>metadata tables</em> to locate you to the right  tablet. Root tablet is not distributed, but with helps of prefetching and extreme caching, it is actually not the bottleneck of the system. Final level tablets points to physical files (managed by Google File System). GFS provides 3 copies for each file on the system, so no matter if a machine is going down, they still have 2 other copies somewhere else. In the 2nd figure, a row of a tablet is illustrated. com.cnn.www is the key in this case and value has three different columns: contents, anchor:cnnsi.com and anchor:my.look.ca. Notice the timestamps, these fields may contain more than one version of entry. In this case, as Google crawler finds updated content on <a href="http://www.cnn.com/">www.cnn.com</a>, a new layer is being added. This enables and leads BigTable to provide a three dimensional data presentation.</p>
<p>In the end of the day, BigTable is not rocket science. It is compact and easy to adopt. It is very straight-forward. Many friends know I came with a very similar concept while designing <em>Rootapi</em> two years ago, those were the times I havent heard of BigTable. Additionally I was saving values as JSON (equality operation was enough in querying) in blocks which were multiples of the sector size of my physical hard drives. IO operations were super fast, JSON based web services were super fast and it was highly distributable, although I couldn&#8217;t find a great environment to explore the severe situations deeply.</p>
<p>As we move on the cloud, this is the way we are going to look at data storage. If you need more technical details, I highly recommend you to take a look at the following references:</p>
<ol>
<li><a href="http://video.google.com/videoplay?docid=7278544055668715642">BigTable: A Distributed Structured Storage System</a></li>
<li><a href="http://labs.google.com/papers/bigtable.html">Bigtable: A Distributed Storage System for Structured Data</a> &#8211; Original publication paper of BigTable, appeared in OSDI&#8217;06.</li>
<li><a href="http://www.youtube.com/watch?v=5Eib_H_zCEY">Google File System</a> An introduction to GFS by Aaron Kimball.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://blog.burcudogan.com/9/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
