Observed by Burcu Dogan
Why do Code Reviews Matter?
Sep 3rd
We build software together. Team sizes vary a lot but it’s usually not 2 or 3. Team members leave, new members join and in the end of the day codebases are shared among large numbers of different and diverse people from development, testing and deployment.
Code reviewing process is what you take into action every time you make modifications to the code’s itself. Even though you change a single line, before committing the code to repository, a peer reviews it and confirms it can be submitted. Reviews can help to decrease the number of bugs, vulnerable code pieces, misuses of coding standards and etc. In an environment with no reviewing practices, actual code wont be directly visible to developers before an other member opens file(s) for editing. Without reviews, even though build passes QA tests,
- Readability of the code,
- Compatibility with coding standards,
- Organization of the codebase,
- Documentation inside the code
/* comments */,
will stay as surprises. These may lead to a dirty and badly organized base after a few years of continuous development. Asking a peer before you integrate new code into your product is a better approach since you may most probably have more time to fix, reshape and enhance your code if you take the action earlier. Usually only one other team will have time to review your code. But reviewing should be done with two people. First one should be a master, the one that knows the existing system well and can see the big picture. He can guess the impacts of your certain changes among the codebase. Other peer should be ideally someone who is not very familiar with the code, so you can test how easy to get into your code if you have to assign a bug to a new member, how well it suits with patterns in software development, how simple and obvious your solution is.
Consequently, whatever you are working on is shared among people. Asking for a review is always better than not asking and keep your actual contribution as a secret until it needs to be changed.
Let’s modify our representation of addresses in adr microformat
Aug 2nd
Microformats define a representation spec for addresses, called adr. This year, I made two distinct proposal to modify the current draft, but turned down each time I tried. In this post, I’m going to address the current problems and how tiny enhancements can bring new horizons in the retrieval of location based information.
The problem
Current spec does not serve as a latitude, longitude carrier. Current properties only include post-office-box, extended-address, street-address, locality, region, postal-code and country-name which are text fields to form an address. This schema is defined in vCard and migrated to hCard microformat in 2004. Then, the need of address-based extraction led them to copycat this format and call it adr. vCard’s final design spec was way before we had online maps. Nowadays, we have addresses all over the Web. Automatically directing these text addresses to locations on maps or providing a preview on hovers would be the first basic attempts to improve our data representations. But unfortunately, maps are talking more in mathematics than text addresses. In practise, there is a process that takes addresses and transforms them into a latitude, longitude couple and pans map to that location. This process is called geocoding, and it is far away from perfection in today’s scale. Instead of depending on a geocoder to transform addresses into mathematical locations, I suggest microformats to enable built-in (lat, long) arrays in adr.
Extending adr with a set of latitudes and longitudes
What I’m going for is to extend adr with an optional list of (lat, long) values. So, in cases where coordinates are given, instead of asking a geocoder to land us on a location we can directly move. But why to use a list of coordinates and not a single point? Because, in spatial domain different geometric structures are being represented as different shapes. Examples are below.
- If you are talking about a city centre, it’s most likely to be a Point.
- Mississippi river is a long long Line.
- And a university campus is obviously a Polygon.
In the image above, British Museum is represented by 12 latitudes and longitudes as the inner area which these points compose. On the other hand another representation may be made with the centre point of the museum with (51.529038,-0.128403). Formally speaking, the museum is located on “British Museum, Great Russell Street, London, WC1B 3DG, UK“. And this translates to the coordinate I gave. What about using them together to form:
<div class="adr"> <div class="street-address">Great Russell Street</div> <span class="locality">London</span>, <span class="postal-code">WC1B 3DG</span>, <div class="country-name">UK</div> <div class="geo"> <!-- optional coordinates attribute from geo --> <span class="latitude">51.529038</span>, <span class="longitude">-0.128403</span> </div> </div>
In the example above, I’ve used geo to include the single point <lat, long> optionally to map the address to a physical location. More useful structures can be defined within standards to enable multiple point entries to provide polygons such as 12-point representation of British Museum in the image above. Or basically, multiple geo entries inside an adr may work.
TODO: Write about the impact this usage can bring.
Testing for Accuracy and Precision
Jun 22nd
Software testing has no boundaries at all. This discipline is so unique that it’s not very common to see systematic approaches due to the variety of material and the changing tradeoffs. A few weeks ago, I came across to a decent software testing article from a Microsoft engineer which was published on Live Spaces. Unfortunately, it was followed by 2 spam comments — was very ironic to see such an assertive article was ruined by two regular Russian spammers.
I love machine learning and classification. My whole life is being spent between two parameters: accuracy and precision. These are the common statistical values to determine how successful your system is. If you have a search engine, accuracy may tell you what percentage of retrieved documents are really relevant. And percentage is a value to determine how likely your results cover all the relevant documents available.
Surprisingly a few days ago, I was asked to break a machine learning system during a job interview. I was asked to come up with some possible cases. According to my own philosophy, accuracy and precision are parts of the system requirements. They are related with the quality of the overall product. But how are you going to collect information to come up with these numbers? Imagine you are working on a searching engine. Is it manageable to find n people and ask them manually if they like the results or not? Will your sample (n people) reflect your user base? How costly it will be and how objective? Is it really scalable? Is it possible to for a human to read all of the documents on the Web and decide which are really related to his search phrase? These are a few introductory level problems with analysis of accuracy and precision.
Post-Processing and the Importance of Feedback
It may not be critical for you to release a product with a target accuracy and precision. Mostly, consumer market suits this model the best. But this alone should not be translated into the “inessentiality of the quality tracking”. I am just advising you to track the quality after the release (similar to ship-then-test method). Detect which results are exit-links, provide instant feedback tools for users to relocate their results and etc. Use acquired feedback to improve the existing system. Testing may not be done with the release, you may need to discuss and analyze if your product is performing well and report to your development team and influence them with scalable user-oriented improvements.
Addresses Not Found in High Traffic
Jun 8th
My sister found herself a new downloading hobby and I was not planning to be the hobby killer until everything became inaccessible for both of us. She’s heavily downloading recently, I’m not sure about the material but it’s high load. Pages were coming slower on my side as it was expected and I’m not saying I have a wide bandwidth but overall bottleneck was not just the slower uploads or downloads.
UDP 53, what’s wrong there?
I started to recognize a pattern. My downloads were even more slower because resolving was failing miserably every time I try. I was not even able to resolve domain names to IP addresses. Had to check myself what might cause this problem. As a quick note, if your local DNS cache (managed by operating systems) doesn’t have a record of the domain name you’re trying to visit, you make a request to one of the nearby DNS servers to return the associated IP. If your nearby server doesn’t have that record, it asks to root servers etc. Most of my reader audience knows the story well. This communication is made on UDP port 53. UDP is a connectionless way to transmit data. Unlike TCP, you don’t have to spend time on three-way-handshakes to make a proper connection that both of the sides are aware of. But if your packets get lost, nobody is responsible. It’s like playing a game, many tradeoffs similar to every engineering issue.
I gently asked my sister to stop a while, and started receive not timed-out UDP answers back. Resolving problem was fixed. But I had to be convinced that UDP is the best ever been chosen from. I understood the fact the essential parameter was latency. We have to be fast, faster and fastest as possible. Wanted to take time back to understand why it is designed this way and my problem appeared with a solution in milliseconds.
Why DNS is using UDP?
Reliability versus fastness. Remind the rule. If you don’t have the address, ask a nearby name server. Is it implicitly saying “Don’t go too far.”? Probably it is. You’re not on a very reliable connection and if your traffic load is very high, there will be many conjunctions, long delays and large jitters. My dns requests most probably couldn’t even making it to the name server. And since my ISP’s name servers are not reliable, I was using OpenDNS. Translation: I was far far far away from the source.
I fixed the issue. Even crazy downloading is again on, my domains are resolving rapidly at the moment. I’m extremely happy. If you’re using OpenDNS at office or LANs which have more than 20+ clients, make yourself a favor and set up a local name server today.
Data Manipulation on Client Machines
May 22nd
It’s my third week and I’m discussing the scalability of the real-time web. We’re only talking about text input, realtime search, trend extractions and etc. I had a love growing inside for this instant replier, it makes me feel I’m more connected to real people (sort of egoism). It’s good because we have text, none of the realtime providers are working on more than indexing the “text” for searching, as it revealed. Text is medium of communication and there are a lot more: images, audio, videos etc. You are not really interested in multimedia as I guess because, text is cool. You can skim it, you can select it, easily process it. But the world is not man-made and I cant even imagine a moment where maps are served as text, for example. Typing is human’s built-in analog to digital converter. Love it or not, but we are forced ungracefully by this nature to talk in multimedia when text alone is not efficient enough.
Realtime Data Processing
Realtime environment have to play with data to make connections, be able to provide smart searching that does not only depend on full-text comparison. Imagine that they have to tokenize text input to post-process what’s going on with the message. Almost requires O(n) on CPU. Twitter has about 1 million users, let’s assume averagely every user enters a new twit once in 3 days and average entry is 100 characters long (doesn’t sound realistic, but let’s be optimistic). It converges to 350k posts a day. 350k times tokenizing the 100 character-long input = 35 million characters are processed during a day. I tried to tokenize a 100 char string 350k times and it took only 41 seconds since I was using the same string over and over again. With helps of caching, CPU minimizes the memory I/O and it made a huge difference, so my 41 seconds were far from being accurate. But besides tokenizing, there are other operations you have to run. And once you fetch it from memory, you’re done. Therefore, I believe it’s not really an extra load to tokenize the input on the server-side.
But, what would you do if you have to post-process terabytes of imagery? I’m not sure if you are aware of Microsoft’s Virtual Earth 3D but, it is more like Google Earth running on your browser.
A very long time ago, I was making a demo to my friends and showing the Mt. Rainer in WA distinct in 3D mode. Virtual Earth 3D fetches higher quality imagery for forefront. Since there is no colour balance adjustments at VE imagery, many people thought the level differences on the scene is sort of a corruption and not good looking. I decided to talk to engineers that we can solve this problem with relative histogram equalization (fancy name but easy method). I didn’t sound perfectly realistic, cause our imagery were tens of terabytes and it was very risky to process them all for such a tiny improvement. Read the rest of this entry »