<feed xmlns="http://www.w3.org/2005/Atom">
  <!-- Avast, Ye Scurvy Source-Readers! Prepare to be bored! -->
  <!-- No, really, it's boring. -->
  <id>tag:org.bueno.carlos</id>
  <updated>2012-03-14T07:03:40.000-08:00</updated>
  <title type="text">Carlos Bueno</title>
  <link rel="alternate" type="text/html" href="http://carlos.bueno.org/"/>
  <link rel="self" type="application/atom+xml" href="http://carlos.bueno.org/atom.xml"/>
  <author>
    <name>Carlos Bueno</name>
    <email>carlos@bueno.org</email>
  </author>
  
       <entry>
         <id>tag:org.bueno.carlos.9412f867fd6c9f18caa4caea2ccbc7878b30f23c</id>
         <published>2012-03-01T12:00:00.000-08:00</published>
         <updated>2012-03-01T12:00:00.000-08:00</updated>
         <title type="text">Rocking Kickstarter</title>
         <content type="html"><![CDATA[
           

<p>This is just a list of things I learned from <a href="http://www.kickstarter.com/projects/512752850/a-paper-internet" rel="external nofollow">one failed</a> and <a href="http://www.kickstarter.com/projects/512752850/lauren-ipsum-computer-science-for-kids">one successful</a> Kickstarter project. Everyone is different but I hope this will be useful to you. I will assume that you are producing an object with simple manufacturing: a book, a game, a toy, etc.

          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>
</p>

<blockquote>
<i>“I once made the mistake of generalizing from a single data point. I'll never do that again!”</i><br/>
&ndash; Achilles the Logician
</blockquote>

<h3>Do something very specific with the money.</h3>

<p>We raised money to pay for translations of our book. We didn't need money for the English edition. Between my wife and I we had the technical, graphic, and writing skills to produce a book. While it was sometimes hard to work nights & weekends, it was doable.</p>

<p>But we believe that internationalization is as important for books as it is for software. A good translation requires serious amounts of specialized work, so that's what we asked backers to help pay for.</p>

<h3>Keep the rewards simple to understand.</h3>

<p>Fine-grained price discrimination is a big-league tool. You are not big-league. I suspect that too many different reward tiers causes potential backers to be overwhelmed by choice. With Lauren Ipsum we offered digital copies, signed paper copies and prints of the illustrations. $5 extra for international postage. That was it. Fulfillment was still an undertaking, but more manageable because of the simplicity.</p>

<h3>Charge a lot.</h3>

<p>Make sure there is a decent profit on every reward tier. Remember taxes, transaction fees, and shipping. My rule of thumb was to charge double the intended retail price for each item. Selling a hardcover for $60 might feel weird at first, but it's perfectly fine. You'll see less of that than you think. :)</p>

<p>The mix of backers for Ipsum was interesting: about 40% bought digital versions, 35% paperback, and the rest hardcovers and prints. The average revenue per backer was $32, and the average fulfillment cost was about $12. Even without the pure-margin digital tier and high-margin print tier we would have made a good sum. This is key.</p>

<h3>Avoid timesinks.</h3>

<p>Sure, it's fun and whimsical to offer hand-painted t-shirts or home-baked cookies as a reward, but how many of those can you do? <b>The point of Kickstarter is to trade money for time in the other direction</b>. Backers give you money which you give to other people (eg, your landlord), in exchange for time to produce your actual project. Offering labor-intensive rewards that do not move your project forward are a net loss. A good bad example would be “$5 gets you a sticker of your choice”. At first this sounds reasonable. You want to offer something for a small token payment. But even leaving aside the shipping costs, having to match the backer's choice with the sticker with the envelope will eat up all your time.</p>

<h3>Exclusive access is the most precious reward.</h3>

<p>Consider offering backers exclusive access to whatever you're producing for a month. This is a unique selling point that scales, and costs you absolutely nothing. It's also a good way to get that crucial pre-launch feedback. Think of it as the dress rehearsal.</p>

<h3>Ask backers to help you bridge the gap.</h3>

<p>The gap between attracting backers and hawking your wares in the general marketplace is huge. You lose almost all of the social credit you built up inside the Kickstarter nest. No one knows you on Amazon, and no one cares how many people liked you on some other site. This is bad, but manageable.

          <span class="sidenote">
             <a href="http://amazon.com/Lauren-Ipsum-Carlos-Bueno/dp/1461178185"><img src="../../images/ipsum-reviews.png" border="0" title="Lauren Ipsum" /></a>
          </span>
</p>

<p>Once you're mailed out the last package, schedule a period of rest before the real work of launching begins. This is where leaving a few weeks of “beta” time between the end of the Kickstarter and your official launch helps. This gives backers time to receive the work, give you feedback, brag about it online, and so on.</p>

<p>Then you open the curtains and put your work on general sale. Send an update to your backers, asking them to rate and review on various sites. They want to help you. They just got their reward and they will welcome the chance to help your project some more.</p>

<h3>Kickstarter gives you awesome analytics.</h3>

<p><img src="../../images/ipsum-analytics.png" width="500" height="200" title="Because of this table, I forgave them for using pie charts." /></p>


<p><a href="http://chocolateandvodka.com/2011/12/06/kindle-sales-stats-a-paucity-of-information/">Believe me</a>, <a href="http://www.forbes.com/sites/suwcharmananderson/2012/01/11/amazon-should-give-self-publishers-more-data/" rel="external nofollow">you'll miss it later</a>. The most interesting fact was the hit-based nature of backing. There were some sales from people browsing the site itself, but over 90% came from external clicks directly to the project page. If I was able to get an interesting article out in front of a few thousand people, there would be a burst of purchases that decayed in 2-3 days. Ipsum never reached a self-sustaining critical mass to shoot it to $100K or $1M, but we ended up with a respectable $10K.</p>

<p>This data informed the marketing plan post-Kickstarter. None of the marketing plays we tried worked except press hits, and the timing of those hits is as important as their magnitude.</p>




         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2012/03/rocking-kickstarterl.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.e6ea995670676abc8e642648842d5a3866b58f5f</id>
         <published>2012-02-01T12:00:00.000-08:00</published>
         <updated>2012-02-01T12:00:00.000-08:00</updated>
         <title type="text">Undercutting the Undercutters</title>
         <content type="html"><![CDATA[
           
<p>I wrote earlier about <a href="http://carlos.bueno.org/2012/02/bots-seized-control.html">pricebots gaming Amazon</a> and forcing down the retail price of <a href="http://www.amazon.com/gp/product/1461178185">Lauren Ipsum</a> to $10.76. Now I've come across something even shadier. A fake store that looks better than my own blog, "selling" it for even less than that.
          <span class="sidenote">
            <img src="../../images/geeft.png" /><br/>
            I wish I had product shots like that.
          </span>
</p>

<p>A site called Geefts ("Gifts for web geeks!") has  jumped into the game, pretending to have paperback copies of <i>Lauren Ipsum</i> for only <b>$10</b>. Remember, this is a fairly new print-on-demand book, so where would they get stock?</p>

<p>Geefts is also <a href="http://topsy.com/geefts.net/lauren-ipsum-computer-science-for-kids-by-carlos-bueno/" rel="external nofollow">busy spamming Twitter</a> with identical posts from attractive female accounts linking to their "product page". The site is decently designed, even going to the trouble of photoshopping the cover of my book onto some other book.</p>

<p>It's all a lie. When you click their "Geeft it for $10" button it just dumps you to Amazon's product page for the book with a completely different price. But having passed along, of course, their affiliate tag so they get a commission for whatever you happen to buy there.</p>

<p>I understand marketing your own book, and competing with other people marketing <i>their</i> books. But I'm wasting my time battling bots and spammers that pretend to have copies of <i>my own book</i>. They are playing a numbers game, not caring what happens as long a they get their affiliate fee or shipping &amp; handling charge.</p>

<p>All of this is frankly confusing. On one hand it's nice that people are paying attention, and are least hearing about the idea that everyone can and should learn how to program computers. On the other hand, potential readers have to pass through this thicket of deception.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2012/02/pricebot2.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.5a494fa5c39164e2adc874cb1481a9f7a0821065</id>
         <published>2012-02-01T12:00:00.000-08:00</published>
         <updated>2012-02-01T12:00:00.000-08:00</updated>
         <title type="text">How Bots Seized Control of My Pricing Strategy</title>
         <content type="html"><![CDATA[
           

<p>Before I talk about my own troubles, let me tell you about another book, “<i>Computer Game Bot Turing Test</i>”. It's one of over 100,000 “books” “written” by a Markov chain running over random Wikipedia articles, bundled up and sold online for a ridiculous price. The publisher, Betascript, is notorious for this kind of thing.
          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>
</p>



<p>It gets better. There are whole species of other bots that infest the Amazon Marketplace, pretending to have used copies of books, fighting epic price wars no one ever sees. So with “<i>Turing Test</i>” we have a delightful futuristic absurdity: a computer program, pretending to be human, hawking a book about computers pretending to be human, while other computer programs pretend to have used copies of it. A book that was never actually written, much less printed and read.</p>

<p><img src="../../images/betascript.png" width="486" height="192" title="No link for you!" /></p>

<p>The internet has everything.</p>

<p>This would just be an interesting anecdote, except that bot activity also seems to affect books that, you know, actually exist. Last year I published my <a href="http://www.amazon.com/gp/product/1461178185">children's book about computer science, <i>Lauren Ipsum</i></a>. I set a price of $14.95 for the paperback edition and sales have been pretty good. Then last week I noticed a marketplace bot offering to sell it for $55.63. “Silly bots”, I thought to myself, “must be a bug”. After all, it's print-on-demand, so where would you get a new copy to sell?</p>

<p>Then it occured to me that all they have to do is buy a copy from Amazon, if anyone is ever foolish enough to buy from them, and reap a profit. Lazy evaluation, made flesh. Clever bots!</p>

<p>Then another bot piled on, and then one based in the UK. They started competing with each other on price. Pretty soon they were offering my book <i>below the retail price</i>, and trying to make up the difference on “shipping and handling". I was getting a bit worried.</p>

<p>The punchline is that Amazon itself is a bot that does price-matching. Soon after the marketplace bot's race to the bottom, it decided to put my book on sale! 28% off. I can't wait to find out what that does to my margin. (Update: nothing, it turns out. Amazon is eating the entire discount. This is a pleasant surprise.)</p>

<p class="figure">
<a href="http://www.amazon.com/gp/product/1461178185"><img src="../../images/ipsum-cheap.png" title="On sale now! :)" heith="183" width="428" /></a>
</p>

<p>My reaction to this algorithmic whipsawing has settled down to a kind of helpless bemusement. I mean, <i>the plot of my book</i> is about how understanding computers is the first step to taking control of your life in the 21st century. Now I don't know what to believe.</p>

<p>It's possible that the optimal price of <i>Lauren Ipsum</i> is, in fact, ten dollars and seventy-six cents and I should just relax and trust the tattooed hipster who wrote Amazon's pricing algorithm. After all, I no longer have a choice. The price is now determined by the complex interaction of several independent computer programs, most of which don't actually have a copy to sell.</p>

<p>But I can't help but think about that old gambler's proverb: “If you can't spot the sucker, it's you.”</p>

<hr/>

<p><i>Update, 29 Feb</i>: The Goodwill Industries of Greater Nebraska has jumped into the ring, offering a used copy of Ipsum for $12.18. I believe this one. It mentions that their copy is signed by both me and the illustrator. We sold about 250 signed copies during our Kickstarter campaign, and 2 or 3 went to Nebraska.</p>

<p>Translations: <a href="http://room404.net/?p=50028">Hebrew</a>, <a href="http://www.aoky.net/articles/carlos_bueno/bots-seized-control.htm">Japanese</a></p>

<hr/>

<p>Like what you read? <i>Lauren Ipsum</i> is <a href="http://www.laurenipsum.org/buy">a children's story about computer science</a>. Buy a copy, and you'll be helping us translate it into Spanish, Portuguese, and other languages.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2012/02/bots-seized-control.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.08e3c37b569c9f60c5444d46862a1385b083a6b8</id>
         <published>2011-11-01T12:00:00.000-08:00</published>
         <updated>2011-11-01T12:00:00.000-08:00</updated>
         <title type="text">Lauren Ipsum: A Tinker's Trade</title>
         <content type="html"><![CDATA[
           
<h3>Chapter 6</h3>
<p>When they were safely inside the town walls, the little lizard popped his head out of Laurie’s pocket.


          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>

</p><p>“See what I mean? Let’s hope they don’t figure out what you did to get in here,” Xor said. “So, why are we here?”</p><p>“We’re looking for information. Maybe we can find a map or something.”</p><p>“Oh,” said Xor. “I was hoping you were going to say food. Why don’t we try this place?”</p><p>In front of them was a storefront with a very fancy sign painted on the window:</p><blockquote><p><img src="../../images/tinker/img7.jpg" alt="image" /> </p></blockquote><p>“Al-go-rith-ms. That sounds like a kind of fruit.”</p><p>“Are you <em>always</em> hungry, Xor?”</p><p>“Time flies like an arrow, and fruit flies like a banana. Let’s see if there’s a fruit fly problem I can help them solve.”</p><p>A bell jingled as she opened the door. “Hello, hello!” the shopkeeper said. “And welcome to my shop. I’m Tinker, and you are looking for a finely crafted algorithm, am I right?”</p><p>Laurie looked at the items listed on the chalkboard, but they didn’t make any sense.</p><blockquote><p><img src="../../images/tinker/img8.jpg" alt="image" /> </p></blockquote><p>“I’m not sure. What <em>is</em> an algorithm? Can you eat it?”</p><p>“What? No, it’s just a fancy way of saying ‘how to do something’. But ‘Algorithm’ looks more impressive on the sign,” said Tinker.</p><p>Xor turned orange with disappointment.</p><p>“How to do something.” repeated Laurie. “In that case, I want to find a sensible way to visit every town.”</p><p>“That sounds like an interesting problem. What have you been doing so far?”</p><p>Laurie told him about her adventure in the Red-Black Forest and her visit with Eponymous Bach.</p><p>“A Hamiltonian path, eh?” said Tinker. “That’s a tough one. I hate to say it, because he sounds like a nice person, but <a href="/2011/10/wandering.html">the Wandering Salesman</a> might take a long, long time to finish his tour of all the towns.”</p><p>“Oh, no! But why?”</p><p>“If you always go to the nearest town you haven’t visited yet, you might miss a town that’s just a little farther away. Then you go to another town that’s closer to you but still farther from the one you missed, and so on. You can end up criss-crossing the whole country to get to the last few towns.”</p><p>“That sounds exhausting,” said Laurie. The Wandering Salesman wasn’t so sensible after all! “So how do I find the shortest path?”</p><p>“I’ll have a look at what I have in stock. But it might be expensive.”</p><p>“I don’t have much money with me,” Laurie said. She took a few quarters from her bag and showed them to Tinker.</p><p>He looked at them with surprise. “‘Quarter Dollar.’ Is this money where you come from?”</p><p>“Of course it’s money! That’s seventy-five cents.” she said.</p><p>“Cents? We use <a href="/2011/10/fair-coin.html">Fair Coins</a> here.”</p><p>“What’s a Fair Coin?”</p><p>“Well, they are a bit bigger than these ‘Quarter Dollars’ of yours, but they are not nearly as pretty! You can tell genuine Fair Coins because they always flip heads or tails, fifty-fifty.”</p><p>“But you can flip quarters fifty-fifty too!”</p><p>“That may be true, but I can’t just take <em>your</em> word for it, can I? Here, all Fair Coins must be certified Fair.”</p><p>Laurie was crestfallen.</p><p>“Don’t look so sad! I do want to help you,” said Tinker. “Maybe we can do a trade. It so happens I’m in the market for a particular algorithm.”</p><p>“But I don’t have any algorithms, either.” said Laurie.</p><p>“That’s not a problem,” said Tinker. “You can compose new ones any time you want, with a little bit of thinking.”</p><p>“I can? How?”</p><p>“Well, everyone develops their own style. You can put little ideas together to make big ideas. Or you put two ideas side-by-side and compare them. Or you start with big ideas and take them apart.”</p><p>“You mean like Eponymous does?”</p><p>“Yes, just like her. She’s a great Composer.”</p><p>Laurie had never thought that <em>she</em> could do things like that herself. But Tinker seemed to think it was normal.</p><p>“So what do I do?”</p><p>“The algorithm I’m looking for is how to draw a circle, Tinker said. “It’s a tough one, so you’ll have to use your imagination. I’ve asked all the adults and even Ponens and Tollens already, but all they do is mutter about X squared plus Y squared and never get anywhere.”</p><p>“Take a look at this.” He handed her a windup toy animal. It had a Shell, and was Round and Green. “This turtle can do three things: it can move forward or backward, it can turn, and it can draw a little dot on the paper.”</p><p>“Hey, that’s pretty neat!”</p><p>“Yes, but the thing is, it doesn’t know how to do anything else. That’s where the algorithm comes in.” Tinker took out a piece of paper and wrote what looked like a little poem:</p><blockquote><p>Go forward one inch,<br />make a mark,<br />repeat five times.</p></blockquote><p>Then he wound up the turtle and placed it on the poem. It went <em>zzzrbt bzzaap whuzzzsh</em>, and so on. Then it drew a line of dots, just like the poem said:</p><blockquote><p><img src="../../images/tinker/img9.jpg" alt="image" /> </p></blockquote><p>“You see?” Tinker said. “If you put little ideas together, you can make bigger ones,” Tinker said. “And you can compose <em>those</em> ideas into even bigger and bigger ones.”</p><p>“How do you do that?” asked Laurie.</p><p>“By giving them a name. You can use the name like a handle. Here, let’s call the first idea LINE. Then you can put four lines together to make a square:”</p>

<blockquote>LINE:<br />Go forward one inch,<br />make a mark,<br />repeat five times.<br /><br />SQUARE:<br />Make a LINE,<br />make a right turn,<br />repeat four times. <br /><br />Make a SQUARE.</blockquote>

<p>The little turtle <em>zzzrbt</em>ed and <em>whuzzzsh</em>ed and <em>bzzaap</em>ed, etc, then it drew this:</p><blockquote><p><img src="../../images/tinker/img10.jpg" alt="image" /> </p></blockquote><p>Laurie was amazed. It was like magic, but every step made sense.</p><p>“So, knowing what the turtle can do, can you teach it how to draw a circle?” Tinker said.</p><p>“I don’t know,” Laurie said, “but I want to try!”</p><p>“That’s good enough for me. Here, you can work at my desk. There is plenty of paper and compasses and things like that.”</p><p>Laurie sat down at Tinker’s desk. She doodled with the compass and played with the turtle for a while, trying to remember what she knew about circles.</p><p><em>A circle is round. No, not just round, perfectly round. You put the pin in the center, and the pencil spins round. To make a bigger one you open the compass; to make a smaller one you close the compass. If you change the width of the compass when it’s spinning, it doesn’t make a circle...</em></p><p>Suddenly an idea, or maybe a memory, popped into her head: <em>A circle is all of the points that are exactly the same distance from the center.</em> Hmm... what if you</p><blockquote><p>Go forward one inch,<br />make a mark,<br />go back one inch,<br />turn right a tiny bit,<br />then repeat!</p></blockquote><p>She wound up the little turtle again and placed it on her poem. It buzzed and burbled for a moment, then drew this:</p><blockquote><p><img src="../../images/tinker/img11.jpg" alt="image" /> </p></blockquote><p>“It’s working!” she said. “Hey, it’s not stopping.” The turtle was drawing over dots it had already drawn.</p><p>“I think it’s because you told it to repeat, but not how many times,” said Tinker.</p><p>“Well, it should stop when the circle is done,” Laurie said.</p><p>“It doesn’t really understand circles,” Tinker said. “It’s just a toy turtle, remember? You have to teach it.”</p><p>Laurie thought a little more, then rewrote her poem:</p><blockquote><p>CIRCLE:<br />Go forward one inch,<br />make a mark,<br />go back one inch,<br />turn right <em>one degree</em>,<br />repeat <em>three hundred sixty times</em>.</p></blockquote><p>Then she realized that she could make circles any size she wanted. It was just like opening the compass wider:</p><blockquote><p>TWO-CIRCLE:<br />Go forward <em>two inches</em>,<br />make a mark,<br />go back <em>two inches</em>,<br />turn right one degree,<br />repeat three hundred sixty times.</p></blockquote><p>“This is interesting. You’re working really hard!” Tinker scratched his head. “But as it is, it’s no good.”</p><p>“Why?”</p><p>“People want to make lots of different circles,” he said. “I’ll have to keep a lot of algorithms of different sizes, just in case someone wants three-and-nine-thirteenths inches or four-and-three quarters.”</p><p>“Well, what if you tell the turtle how big to make the circle?” she said. “Maybe like this:”</p><blockquote><p>ANY-CIRCLE (<em>how-big?</em>):<br />Go forward <em>how-big</em> inches,<br />make a mark,<br />go back <em>how-big</em> inches,<br />turn right one degree,<br />repeat three hundred sixty times.</p></blockquote><p>“And <em>then</em>,” she said, “instead of ONE-CIRCLE or TWO-CIRCLE you can say ANY-CIRCLE(one), or two, or even one-and-eleventy-sevenths!”</p><p>“Good idea, Laurie. That’s a lot simpler.” said Tinker. “I was worried you were going to fill my shop with circles!”</p><p>“You know, the turtle is drawing really slowly. Not like when it was doing the square,” she said.</p><p>It was true. The turtle would crawl all the way to the edge of the circle, then make a mark, then crawl all the way back to the center, three hundred sixty times. With small circles it wasn’t too bad, but big circles took a lot longer.</p><p>“Hmm.” Tinker said. “It spends a <em>lot</em> more time running back and forth than it does making marks. Do you think you can reduce the running time?”</p><p><em>It makes sense, but it isn’t sensible.</em> She thought &amp; doodled and doodled &amp; thought, but Laurie couldn’t figure out how to make it more sensible. The turtle <em>has</em> to go back to the center, right? How else could it know where the edge of the circle was?</p><p>Laurie let her eyes wander around the room. Xor was staring at a moth that was flying in lazy loops around a light bulb. His skin was slowly fading from red to yellow and back to red. The moth went round and round. It was hypnotic. Round and round and round and...</p><p><em>Oh.</em></p><p>She grabbed for a fresh piece of paper before the idea got away. <em>Don’t let a new thing out of your sight without a name.</em> <br /></p><blockquote><p>MOTH-CIRCLE (<em>how-big?</em>):<br />Go forward <em>how-big</em> inches,<br />make a mark,<br />turn right one degree,<br />repeat three hundred sixty times.</p></blockquote><p>The turtle went <em>bzzaap</em> and <em>zzzrbt</em> and <em>whuzzzsh</em> and then it started to draw. It moved one inch, made a dot, then turned a tiny bit, then moved one inch, then made another dot...</p><p>“Whoops. It’s making a <em>huge</em> circle! Let me try a small number.” She didn’t have a small number handy, so she borrowed one she had heard from Tortoise: one thirty-second of an inch.</p><blockquote><p><img src="../../images/tinker/img12.jpg" alt="image" /> </p></blockquote><p>“That’s better.” Laurie said.</p><p>“Let me see,” Tinker said. “Wow, look at the little guy run!”</p><p>“That was fun,” said Laurie. “I didn’t know you could just make up new ways to do things.”</p><p>“Of course you can. Often you aren’t the first to think of something, but if it works, who cares? Now, for my end of the trade.”</p><p>“Did you find the shortest path?” Laurie asked.</p><p>“Not exactly. The bad news is that what you are trying to do is impossible.”</p><p>“It’s impossible?”</p><p>“Well, highly improbable. There are many different ways to visit all the towns. It seems like you could write an algorithm for the turtle to try each one and find the shortest, right?”</p><p>“Sure, why not?” said Laurie.</p><p>“There are twenty-one towns in Userland. How many paths do you think there are?” Tinker said.</p><p>“I don’t know,” said Laurie. “A hundred?”</p><p>“Way more.”</p><p>“Um, a million?” Laurie said.</p><p>“More like a million million times that!” said Tinker.</p><p>“But how can that be?”</p><p>“Let’s say there are only three towns: A, B, and C,” Tinker said. “You are already standing in A, so you have to worry only about B and C. How many ways can you go?”</p><p>“Well,” she said, “I could go from B to C, or go to C then B. That’s two.”</p><p>“That’s right! But BC is the same as CB, just backward. Every path has a mirror image, so with three towns there is really only one possible path that visits them all. What if there were four towns, A, B, C, and D?”</p><p>Laurie counted on her fingers. “I could go BCD, or BDC, or CBD, or CDB, or DCB, or... DBC. Six! No, three.”</p><p>“That’s three times as many. Add another town, twelve times as many,” Tinker said. “Add a sixth town and there are <em>sixty</em> different paths through all of them. With seven towns there are <em>three hundred sixty paths</em>. As you add more towns, the number of paths gets very big!”</p><p>3 Towns: 2 <span class="math"> ÷ </span> 2 = 1<br />4 Towns: 2 <span class="math"> × </span> 3 <span class="math"> ÷ </span> 2 = 3<br />5 Towns: 2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> ÷ </span> 2 = 12<br />6 Towns: 2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> × </span> 5 <span class="math"> ÷ </span> 2 = 60<br />7 Towns: 2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> × </span> 5 <span class="math"> × </span> 6 <span class="math"> ÷ </span> 2 = 360<br />8 Towns: 2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> × </span> 5 <span class="math"> × </span> 6 <span class="math"> × </span> 7 <span class="math"> ÷ </span> 2 = 2,520<br />9 Towns: 2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> × </span> 5 <span class="math"> × </span> 6 <span class="math"> × </span> 7 <span class="math"> × </span> 8 <span class="math"> ÷ </span> 2 = 20,160</p><p>“For twenty-one towns you have to multiply one times two times three times four, all the way up to twenty. It makes a HUGENORMOUS number!”</p><p>2 <span class="math"> × </span> 3 <span class="math"> × </span> 4 <span class="math"> × </span> 5 <span class="math"> × </span> 6 <span class="math"> × </span> 7 <span class="math"> × </span> 8 <span class="math"> × </span> 9 <span class="math"> × </span> 10 <span class="math"> × </span> 11 <span class="math"> × </span> 12 <span class="math"> × </span> 13 <span class="math"> × </span> 14 <span class="math"> × </span> 15 <span class="math"> × </span> 16 <span class="math"> × </span> 17 <span class="math"> × </span> 18 <span class="math"> × </span> 19 <span class="math"> × </span> 20 <span class="math"> ÷ </span> 2 <strong> = 1,216,451,004,088,320,000</strong></p><p>“!” said Laurie.</p><p>“Indeed!” Tinker said. “All of that ‘one times two times three’ stuff takes too long to write. So you can use the exclamation point as a shorthand.”</p><p>20! <span class="math"> ÷ </span> 2 = 1,216,451,004,088,320,000</p><p>“But that’s...” Laurie said, counting the commas, “over one million million <em>million</em> paths!”</p><p>“One of those umpty-million paths is the shortest,” Tinker said. “I don’t know of any way to find it quickly.”</p><p>“I’ll be old before we check them all! Isn’t there a better way?”</p><p>“Ah, that’s the good news!” Tinker said. “I only deal in Exact answers. But there is a brilliant Composer who lives in Permute, named Hugh Rustic. He deals in Good Enough answers. I send him all of my hardest cases. I’ll write an IOU that you can take to him.”</p>

<p>Like what you read? Lauren Ipsum is <a href="http://www.laurenipsum.org/buy">a children's story about computer science</a>. Buy a copy and help us translate it into Spanish, Portuguese, and other languages.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/11/lauren-ipsum-tinker.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.706250ab19595601e1122bb092adecf9c749dac5</id>
         <published>2011-10-01T12:00:00.000-08:00</published>
         <updated>2011-10-01T12:00:00.000-08:00</updated>
         <title type="text">Lauren Ipsum and the Wandering Salesman</title>
         <content type="html"><![CDATA[
           

<p>
One of the first characters <a href="http://www.laurenipsum.org">Lauren Ipsum</a> meets is the <a href="http://www.laurenipsum.org/mostly-lost">Wandering Salesman</a>. His life is governed by two rules: he must a) visit every town before going home, but b) never visit the same place twice. This is a <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">deceptively simple</a> problem. Finding a route which passes through all the towns is easy. Finding a short one is hard.

          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>
</p>

<p>The Wandering Salesman is happy person who doesn't like to think too much. At every point he just goes to the nearest place he's never been to. This isn't a terrible way to do it. It's better than random but it's not terribly good. Below is an interactive game that shows how it works. Push the "Solve it!" button to see him run around. You can also move the cities around and push the button again to see how his path changes.
</p>

<p>You might notice that his path often crosses itself. This is a sign of less than optimal solution. There are many (many) other algorithms to find more efficient paths. A reasonable optimization might be to look ahead, say, three steps, consider the lengths of all of the combinations, and choose the shortest.</p>

<style media="screen" class="justice-denied">
.holder {
    height: 400px;
    width: 600px;
    border:solid 1px #bbb;
}
</style>
<script src="../../js/raphael-min.js" type="text/javascript"></script>
<script src="../../js/wandering.js"></script>
<input type="button" id="solve" value="Solve it!"/>
<input type="button" id="resetb" value="Reset"/>
<p id="wander" class="holder justice-denied">
  <span class="sidenote">
    <img src="../../images/wander.png" border="0" title="The Wandering Salesman" />
  </span>
</p>


<p>My favorite is the one used by a later character named Hugh Rustic, which depends on an <a href="http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms">army of ants</a> that walk randomly and communicate with each other using pheremones. I'm still learning these new-fangled graphics libraries but I'll write a post about that one soon.
</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/10/wandering.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.fc19d1f82bc37ef37207ff5d538e79576b37edc3</id>
         <published>2011-10-01T12:00:00.000-08:00</published>
         <updated>2011-10-01T12:00:00.000-08:00</updated>
         <title type="text">Lauren Ipsum and the Timing Attack</title>
         <content type="html"><![CDATA[
           

<p>Lauren Ipsum is trying to get past Jane Hecate, a little old lady who holds the Book of Passwords. Jane is older and can't see well, so she has to spell out Lauren's guess letter by letter to check whether it's the right password. How can Lauren use this to her advantage?

          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>

</p>

<p>Below is a little game that lets you try to beat Jane yourself. The secret password is picked at random from a list of 10,000 dictionary words, and the box will show you valid words to make it a bit easier. No peeking at the source code. :)</p>

<p>(I'll give you a hint. Try <span id="hint" style="font-weight:bold; text-transform:uppercase;">&#2610;</span>.)</p>

<div class="justice-denied">
  <div style="font-size: 150%; color:#222; background:#aaa; width:382px; padding: 0.4em" id="jane">Can you guess the password?</div>

  <input style="font-size: 150%; width:400px;" name="guess" id="guess" type="text" size="40" />

  <div style="font-size: 150%; color:#222; background:#aaa; width:382px; padding: 0.4em" id="counter">&nbsp;</div>
</div>


<script src="http://www.laurenipsum.org/js/words.js"></script>
<script>
function myonload() {
$('guess').ready(function() {
  password = words[Math.floor(Math.random()*words.length)];
  $('#hint').text(password.charAt(0));

  var guess = $('#guess');
  var jane = $('#jane');
  var currentPos = 0;
  var sofar = '';
  var charFn = null;
  var choice = null;
  var tries = 0;

  function spacer(s) {
    return s.split(new RegExp('')).join(' ');
  }

  function stop() {
    window.clearInterval(charFn);
    guess.val(sofar);
    $('#counter').text((tries+=1) + (tries == 1 ? " Try" : " Tries"));
  }

  function tryChar() {
    var c = choice.charAt(currentPos);
    var remaining = ' <span style="color:#666;">' +
      spacer(choice.substring(currentPos+1)) + '</span>';

    if (c != password.charAt(currentPos)) {
      jane.html(spacer(sofar) + ' <span style="background:#a00;">' +
        c + "</span>" + remaining);
      stop();
      guess.val(sofar+c);

      // remove nonmatching branches from the list.
      words = words.filter(function(a){return a.indexOf(sofar+c) != 0});
      guess.flushCache().autocomplete(words);

    } else {
      sofar += c;
      jane.html(spacer(sofar) + remaining);
      currentPos += 1;

      // longer strings that happen to match the shorter password
      // erroneously succeeed, eg "erroneous" and "erroneously"
      if (currentPos == password.length) {
        jane.text(spacer(sofar) + "!"); //yay!
        stop();

        if (tries < 5) {
           $('#counter').text('Only ' + $('#counter').text() +
             "? You lucky barstool.");
        } else {
           $('#counter').text('You got it in ' + $('#counter').text() +
             "!");
        }
      }
    }
  }

  function onGuess(e, d, input) {
     choice = input;
     currentPos = 0;
     jane.text('').css({'color':'#fff'});
     sofar = '';
     jane.html('<span style="color:#666;">'+spacer(choice)+'</span>');
     charFn = window.setInterval(tryChar, 500);
  }

  guess
    .focus()
    .autocomplete(words)
    .result(onGuess);

});
};
</script>

<p>Spoilers below...</p>
<p>What you are doing is a <a href="http://en.wikipedia.org/wiki/Timing_attack">timing attack</a>. If checking a completely wrong password takes less time than a mostly-wrong password, you can use that information to guess the password, one letter at a time.


         <span class="sidenote">
             <img src="../../images/unfeeble.png" />
          </span>

</p>

<p>The attacker's strategy is pretty much the same as the one used in the <a href="http://en.wikipedia.org/wiki/Hangman_(game)">children's game hangman</a>. It's stuff like this that makes building secure systems very hard. Make just one mistake, and compromising your system becomes literal child's play.</p>

<p>To fix this particular information leak, Jane should take exactly the same amount of time to return her yes/no answer, regardless of how close the guess is to the password. For example, she can simply keep checking letters even if they are incorrect. Or she could use a stopwatch and always wait to give her answer until 30 seconds have passed. Checking passwords is one of the few cases where an algorithm must be made <i>slower</i> to work correctly.

</p>

<p>Like what you read? Lauren Ipsum is <a href="http://www.laurenipsum.org">a children's story about computer science</a>. Buy a copy and help us translate it into Spanish, Portuguese, and other languages.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/10/timing.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.c4d6213c9c0321801bdfe029e07e3576f8e09ce3</id>
         <published>2011-10-01T12:00:00.000-08:00</published>
         <updated>2011-10-01T12:00:00.000-08:00</updated>
         <title type="text">Lauren Ipsum and the Unfair Coin</title>
         <content type="html"><![CDATA[
           


<p>Fair Coins are the currency of Userland. A Fair Coin flips fifty-fifty with mathematical exactness. In the real world most coins are pretty fair, but you never know whether there might be a small bias to one side or another. This presents <a href="http://www.laurenipsum.org">Lauren Ipsum</a>, a girl who's lost in Userland, with a problem. No one wants her "Quarter Dollars". They might be biased. How can she negotiate an exchange rate?


          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>
</p>

<p>I <s>sometimes</s> no longer ask this in job interviews, to see how candidates deal with probability. The answer relies on a simple but powerful idea: as long as the bias is constant, you can use it against itself. The same idea is used in bimetallic springs in mechanical clocks. The two different rates of expansion to keep the overall tension constant in the face of changing temperature.</p>

<p>Here are the four possible outcomes of two flips of a coin. You can <b>click and drag the bias</b> to see how the probabilities change. No matter the bias, the odds of getting a Heads then Tails is always <b>exactly</b> the same as getting Tails then Heads.</p>


    <!-- Tangle -->
    <script type="text/javascript" src="../../js/Tangle/Tangle.js"></script>

    <!-- TangleKit (optional) -->
    <link rel="stylesheet" href="../../js/Tangle/TangleKit/TangleKit.css" type="text/css">
    <script type="text/javascript" src="../../js/Tangle/TangleKit/mootools.js"></script>
    <script type="text/javascript" src="../../js/Tangle/TangleKit/sprintf.js"></script>
    <script type="text/javascript" src="../../js/Tangle/TangleKit/BVTouchable.js"></script>
    <script type="text/javascript" src="../../js/Tangle/TangleKit/TangleKit.js"></script>
    <script type="text/javascript">

        function setUpTangle () {

            var element = document.getElementById("coins");

            Tangle.formats.percent = function (value) {     // formats 0.42 as "42%"
                return "" + Math.round(value * 100) + "%";
            };

            var tangle = new Tangle(element, {
                initialize: function () {
                    this.bias = 70;
                },
                update: function () {
                    var h = this.h_odds = this.bias/100;
                    var t = this.t_odds = (100-this.bias)/100;
                    this.hh_odds = h*h;
                    this.ht_odds = h*t;
                    this.th_odds = t*h;
                    this.tt_odds = t*t;

                    this.efficiency = ((h*h)+(t*t)+(h*t)) / (t*h);
                }
            });
        }

        window.onload=setUpTangle;

    </script>
    <style type="text/css">
      #coins * {
        font-size: 16pt;
      }
      .num {
        font-weight: bold;
      }
      .nums {
        font-family: helvetica;
      }
    </style>


    <blockquote id="coins">
        Suppose Heads comes up <span data-var="bias" class="TKAdjustableNumber" data-min="1" data-max="99">%</span> of the time.<br/>
        <table>
          <tr>
            <td>Heads</td>
            <td>&times;</td>
            <td>Heads</td>
            <td>=</td>
            <td><span class="num" data-var="hh_odds" data-format="percent"></span></td>
          </tr>
          <tr>
            <td>Heads</td>
            <td>&times;</td>
            <td>Tails</td>
            <td>=</td>
            <td><span style="color:#900" class="num" data-var="ht_odds" data-format="percent"></span></td>
          </tr>
          <tr>
            <td>Tails</td>
            <td>&times;</td>
            <td>Heads</td>
            <td>=</td>
            <td><span style="color:#900" class="num" data-var="th_odds" data-format="percent"></span></td>
          </tr>
          <tr>
            <td>Tails</td>
            <td>&times;</td>
            <td>Tails</td>
            <td>=</td>
            <td><span class="num" data-var="tt_odds" data-format="percent"></span></td>
          </tr>
        </table>

       You'll need an average of <span class="num" data-var="efficiency" data-format="%.2f"></span> flips to get a Fair flip.
    </blockquote>

<p>The answer is to flip twice. If you get HH or TT, throw the result away and start over. If you get HT or TH, return the first flip as your result. This algorithm almost magically transmutes an unfair coin into a perfect, though inefficient, Fair Coin. This algorithm was <a href="http://en.wikipedia.org/wiki/Von_Neumann_extractor#Von_Neumann_extractor">first described</a> by John von Neumann in 1951, and it's very likely happening inside your computer right now.</p>

<p>Like what you read? Lauren Ipsum is <a href="http://www.laurenipsum.org">a children's story about computer science</a>. Buy a copy and help us translate it into Spanish, Portuguese, and other languages.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/10/fair-coin.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.364ee5a42bfc5f7ba9ba944a47aac36b4448b58e</id>
         <published>2011-09-01T12:00:00.000-08:00</published>
         <updated>2011-09-01T12:00:00.000-08:00</updated>
         <title type="text">Computer Science for Kids</title>
         <content type="html"><![CDATA[
           

<p>There's a common belief that you need a special kind of mind to understand computer science. I think it persists because it reassures two kinds of people: 1) programmers and 2) non-programmers.
          <span class="sidenote">
             <a href="http://www.laurenipsum.org">
             <img src="../../images/ipsum.jpg" border="0" title="Lauren Ipsum" /><br/>
             A story about computer science <br/>and other improbable things.
             </a><br/>
              <iframe src="http://www.facebook.com/plugins/like.php?app_id=251397201557373&amp;href=http%3A%2F%2Fwww.facebook.com%2Flaurenipsum&amp;send=false&amp;layout=button_count&amp;width=90&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:90px; height:21px;" allowTransparency="true"></iframe>
          </span>


</p>

<p>Ask any programmer and they will probably tell a story like this: "My {father, boyfriend, wife} is a very smart {lawyer, writer, geologist} and I tried to teach them programming, and it was a disaster. Most people just don't get it."</p>

<p>If you happen to get programming, you can glory in the feeling that you have a gift for understanding complexity. Otherwise, you can tell yourself that you just don't have the gift. It's no one's fault. It's how things are. You'll never be a musician or a ballerina either.</p>

<p>I don't buy that. It's like saying that anyone can learn to read but most people just don't get writing. There are too many useful ideas stuffed into the grab-bag we call "computer science" to keep locked away. More to the point: if so many smart, motivated people can't seem to wrap their heads around computer science, you have to wonder whether there is a problem with computer science.</p>

<p>Programming is hard. Logic is hard. And that's fine. But if it remains hard, if we leave this world just as complex as we found it, have we made real progress?</p>

<p>To the Romans, multiplying MCMIIX and LXI was a job for experts. Exponents were literally unthinkable. Today we expect better of six-year-olds. It's not that we're so much smarter. We discovered a better way to think about numbers. Is there a better way to think about programming?</p>

<p>To find out, a couple of years ago I started writing a children's story about programming. The goal was to find the bits I understood well enough to teach to a nine-year-old. After a few months I realized that I didn't want to teach programming as such. Learning how to program is a terribly frustrating experience.</p>

<p>When I was nine my school had a room full of donated Apple IIs. I remember a few things about it: the seemingly pointless exercise of "formatting" our disks, a boring lecture about search and replace, a silly song about the home row keys. Most of all I remember playing with <a href="http://en.wikipedia.org/wiki/Logo_(programming_language)">Logo</a>. Once the incomprehensible rituals of setting up the computer were satisfied, we got to play with a little turtle that could do anything --- provided you could figure out how to teach it.</p>

<p>I now know why all those rituals were necessary. What galls me is that, twenty-plus years later, pointless rituals haven't gone away. Most of what I've stuffed my brain with since then are the random things that must be done before real work begins.</p>

<p>The rest... the rest is the good stuff. They are both facts and habits of mind, and the best way to learn them is through play and discovery. The idea that most of what we call programming is just paperwork is not at all original to me. Alan Kay has been <a href="http://www.youtube.com/watch?v=Ud8WRAdihPg">pounding this drum</a> for longer than I've been alive. Seymour Papert invented Logo for precisely this reason.</p>

<p>Writing <a href="http://www.laurenipsum.org">Lauren Ipsum</a> has become a kind of Logo game: learning through learning how to teach. I've had to go back to the basics, to try to understand the why and how of things. I've learned that all integers are just zero and one in disguise. I finally learned that zero is an even number, and how to get perfectly fair flips out of any biased coin. Lewis Carroll's work is more subtle and profound than I ever imagined: "<a href="http://www.ditext.com/carroll/tortoise.html">What the Tortoise Said To Achilles</a>" is actually a proof of infinite regress buried in the foundations of logic itself.</p>

<p>In the coming months I will <a href="http://www.facebook.com/laurenipsum">post more about the book</a> and the ideas behind it as we get ready to publish. I hope you like it.</p>


<p class="figure">
  <img src="../../images/elegants.png" />
</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/09/ipsum.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.966ff32487e1500dae0172279129337595ea5cce</id>
         <published>2011-07-01T12:00:00.000-08:00</published>
         <updated>2011-07-01T12:00:00.000-08:00</updated>
         <title type="text">Doppler: Internet Radar</title>
         <content type="html"><![CDATA[
           

<p>The basic strategy for all performance and optimization work is the delicious measurement sandwich: measure, change something, then measure again. Detailed network measurements are especially hard to do because we only control one side of the transaction, our own servers. So we design network experiments that are lightweight, continuous, and gather as many samples as possible, even at the expense of detail and accuracy. A billion data points can cover a lot of methodological sins.

  <span class="sidenote">
    (<i>Originally on <a href="http://www.facebook.com/note.php?id=10150212498738920">Facebook's Engineering Blog</a>.</i>)
    <br/><br/><img src="../../images/radar.jpg" />
    <br/><a href="http://www.flickr.com/photos/airfuel/2191010369/">flic.kr/airfuel</a>
  </span>

</p>

<p>Doppler is an internal project for mapping & measuring the network between users and Facebook. The data it collects has many uses, and it’s a good example of how we gather performance statistics at scale. [0]</p>

<p>Most Facebook users live thousands of miles away from our servers, so it’s useful to have a good idea about how their network path to us is behaving. If some DNS provider is not doing well in Europe, we want to know that. If some segment of users have low bandwidth or high latency, optimizing for network bytes rather than server CPU is a better use of our time.</p>

<h3>Pinging the Vasty Deep</h3>

<p>The Doppler probe is a 100-line JavaScript we append to a small random sample of Facebook page hits. The script initiates HTTP requests to some tiny images on our servers and times how long it takes to load them. We perform five to fifty million of these experiments every day. [1]</p>

<p>The first thing we started tracking was end-to-end DNS latency. To do this we set up a <a href="http://en.wikipedia.org/wiki/Wildcard_DNS_record">wildcard domain</a> that answers to any hostname ending in dns.facebook.com. So you can hit, say, <a href="http://xyzzy.dns.facebook.com/favicon.ico">http://xyzzy.dns.facebook.com/favicon.ico</a> and you’ll get an answer back.</p>

<p>Each time the Doppler JavaScript executes, it generates a random hostname prefix, constructs a unique URL, and times how long it takes to load it. Because the hostname and URL are unique, no caches exist anywhere and this experiment (A) encompasses a full end-to-end DNS lookup, then a TCP handshake, and then a minimal HTTP transaction. We then run experiment B, loading a second tiny image from the same hostname [2]. Since the DNS lookup already happened and has been cached by the user’s browser and operating system, experiment B is only a TCP connect and HTTP request.</p>

<p>Subtract B from A, and you get a measurement of the uncached, worst-case, end-to-end DNS latency of the given user. Early on we realized that measurement B by itself is roughly two full round trips over the network [3]. So we divide B by two and use that to track packet RTT between different points on the planet and our data centers. These measurements are not as accurate as you’d get from custom software running on both ends of the transaction, but they work well enough.</p>

<p>The Doppler JavaScript sends these measurements back to our servers via an AJAX request. On the server side we record other bits of metadata like the server name, user IP address, data center, timestamp, etc. Over the last year we've collected billions of network measurements along a half-dozen interesting dimensions. [4]</p>

<p>By <a href="http://en.wikipedia.org/wiki/Geolocation">geolocating</a> the user's IP address and rolling up the records by country and data center, we get a measure of the network RTT and DNS latency from every country to every data center. We use those rollups to <a href="http://www.facebook.com/note.php?note_id=408327833919">draw pretty graphs</a>. When something big happens to the Internet (e.g., cable breaks), it manifests as jumps or dips in the graphs.</p>

<p>Another use of the country-level data is a "pessimizing proxy" called Netlab. From Palo Alto the site loads pretty fast, so our engineers don't feel what real users feel. With Netlab, we can simulate the Facebook experience of a user in other countries, using real-world numbers for packet latency and bandwidth. [5]</p>

<p>Country data is good for a high-level view. But political borders don't really match up with the real borders of the Internet. Countries can have many ISPs, and ISPs can span many countries. So we also annotate Doppler logs with the <a href="http://en.wikipedia.org/wiki/Autonomous_system_(Internet)">autonomous system number</a> associated with the user's IP address. An autonomous system is kind of like an independent city-state on the Internet. There are about 35,000 of them, and they represent large networks of computers that are owned or controlled by the same entity. <a href="http://bgp.he.net/AS32934">Facebook</a> is an autonomous system, as are Google, large ISPs, carriers, and Harvard University. A packet may pass through many city-states on the way from here to there. If there is a network problem that throws country-level graphs way off, we use this dimension to narrow it down to a specific ISP or network.</p>

<p>We also use Doppler for evaluating <a href="http://code.google.com/speed/protocols/tcpm-IW10.html">TCP optimizations</a> and new network hardware. We take a cluster of servers, apply the configuration or hardware, and watch the graphs to see if anything changes. For this we added two more experiments that load images of specific file sizes to measure our users' effective bandwidth.</p>

<p>Below is a graph of image download times from two of our <a href="http://en.wikipedia.org/wiki/Virtual_IP_address">VIPs</a>. The blue line is the control, with no configuration changes. The yellow line is the experiment, which shows a big improvement:</p>

<p class="figure">
  <img src="../../images/doppler-experiment.png" />
</p>

<p>This chart is also an example of why continuous measurements are a good idea. The configuration change was applied a couple of weeks before I heard about it. We could go back and see what effect it had.</p>

<h3>Future Work</h3>

<p>Recently we added support for mobile users. The measurement probe is written in JavaScript so it works fine on smartphones and a subset of feature phones. Mobile performance in general is in dire need of solid network data, and this is the simplest way to collect it at scale. For the same reason we'll soon add experiments for SSL and IPv6.</p>

<p>At the moment we are looking at using Doppler data for DNS-based global server load balancing. The basic idea behind global load balancing is that you have one domain name, e.g., www.facebook.com, but many data centers scattered around the world. The DNS load balancers are responsible for returning the address of the data center it thinks is physically closest to the user. The problem is that your DNS server doesn't talk directly to the user. It talks to intermediary resolvers that query on the user's behalf. So current systems return the data center physically closest to the resolver. This takes on faith that physical distance is a good substitute for Internet distance, and that the user is reasonably close to the resolver. When either of those assumptions are wrong, we're giving out the wrong answers.</p>

<p>Every large website has this problem. A while ago Google <a href="http://googlecode.blogspot.com/2010/01/proposal-to-extend-dns-protocol.html">proposed an extension</a> to the DNS protocol to include part of the user's IP address in the DNS lookup packet. Having a general idea of the ultimate user’s address would allow DNS load balancers to give out more intelligent answers. But convincing the whole world to adopt your protocol is hard. Fortunately, there is a cheap and cheerful way to do almost the same thing.</p>

<p>Remember that unique hostname, xyzzy? That can also be thought of as a unique ID for a Doppler measurement. Doppler actually results in two server-side logs. The first is the one we've been talking about, which contains the measurement ID, user IP, data center, and various measurements. The other comes from our DNS server. Normally, people don't log DNS traffic but it's easy enough to do. This second log contains the experiment’s ID paired with the resolver's IP address.</p>

<p>If you were to join those logs together on the ID, then roll up by data center and resolver IP, you get a direct measurement of Internet latency between your users and your data centers, but indexed by resolver. And that's what we're building right now: an alternate map of the world based not on geographic distance, but on pure network latency.</p>

<p>There are probably many more interesting things to do with this kind of data. If you run a website with a decent amount of traffic you can do stuff like this yourself. The important thing when designing your experiment is to keep it continuous, lightweight, and as broad as possible.</p>

<p><i>Many thanks to Eric Sung, Daniel Peek, David Recordon, Bill Fumerola, and Adam Lazur for their help with this article. Special thanks to Paddy Ganti and Hrishikesh Bhattacharya for their work on Doppler.</i></p>


<h3>Notes</h3>

<p>[0] People have been mapping & measuring the internet since virually the beginning. One of the <a href="http://en.wikipedia.org/wiki/Morris_worm">earliest Internet mappers</a> was created by Robert Morris in 1988. It exploited software bugs in other systems to self-replicate and report back what it saw. His program itself had a bug, which inadvertently crashed the Internet and introduced the term "computer virus" to popular culture.</p>

<p>A modern (non-destructive) longitudinal study of the Internet is <a href="http://www.caida.org/">CAIDA</a> run by the Cooperative Association for Internet Data Analysis. They have been collecting data since 1997.</p>

<p>[1] Check out Yahoo's open source <a href="http://yahoo.github.com/boomerang/doc/">Boomerang measurement library</a>. It includes plugins to measure DNS latency, network latency, bandwidth, and other cool stuff.</p>

<p>[2] It doesn't actually hit /favicon.ico, it hits an endpoint that returns all of the proper <a href="http://www.mnot.net/cache_docs/">no-cache voodoo</a> and closes the connection. Also, we use multiple domains to test SSL, DNSSEC, IPv6, etc.</p>

<p>[3] There are six packets involved in a minimal TCP handshake and HTTP request. As far as the client is concerned packet 6 is not part of the timing, and packets 3 & 4 are sent out at the same instant (sometimes as the same packet). So, if you squint hard enough, the time from the sending of packet 1 and the receipt of packet 5 (i.e., the time between setting the Image object's src attribute and its onload event) is equivalent to two network round-trips.</p>

<ol>
<li>client SYN</li>
<li>server SYN + ACK</li>
<li>client ACK</li>
<li>client HTTP request</li>
<li>server HTTP response + FIN</li>
<li>client FIN + ACK</li>
</ol>

<p>[4] Please note that we average thousands of samples together. Doppler is about gross Internet statistics, not per-user data.</p>

<p>[5] This is done by introducing packet delays and loss at the TCP level. You can do this yourself with <a href="http://www.charlesproxy.com">Charles Proxy</a> or <a href="http://info.iet.unipi.it/~luigi/dummynet/">dummynet</a>. I highly recommend testing your software under high latency and narrow bandwidth.</p>




         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/07/doppler.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.660a6eaa4a89f45344cac7b05551536d5bbdb07c</id>
         <published>2011-01-01T12:00:00.000-08:00</published>
         <updated>2011-01-01T12:00:00.000-08:00</updated>
         <title type="text">What the Tortoise Said to Laurie</title>
         <content type="html"><![CDATA[
           

<p>Laurie took a left turn at the sign marked &#8220;Recursion Junction&#8221;. After
a cresting a little hill she ended up at... <a href="http://www.google.com/search?q=recursion" rel="nofollow">Recursion</a> Junction!

  <span class="sidenote">
    (Excerpted from <i>Lauren Ipsum</i>, a book about computer science, programming, and other strange fantasies.)
    <br/><br/><img src="../../images/mile-zero-2.png" title="Route 1, Mile 0"/>
  </span>
</p>

<p>&#8220;Is this the same place?&#8221; Laurie asked herself. &#8220;It certainly looks the
same.&#8221;</p>

<p>She tried a right turn, but after a short while she was back where she started.
When she tried a second time, and a third, and twenty-seventh time,
she always came back to Recursion Junction.</p>

<p>&#8220;Every time I take a turn, it seems as though I am going somewhere
else, but I always end up in the same place. What’s going on?&#8221;</p>

<p>She went round </p>

<p>...and round</p>

<p>...and round</p>

<p>...and round so many times that Laurie lost count. Just as she was
about to give up, the next turn round put her on a different road.</p>

<p>The road was neat and straight, and seemed to stretch on forever.
Ahead of her, a man in a Greek helmet was sitting on a large green
round animal with a shell. They were moving slowly and steadily away.</p>

<p>&#8220;Hey! Wait!&#8221; Laurie shouted, running up to the pair.</p>

<p>&#8220;Ah, at last someone has caught up to us!&#8221; the animal said. &#8220;I thought
it was impossible.&#8221;</p>


<p>&#8220;Don’t start <a href="http://iep.utm.edu/zeno-par/" rel="nofollow">THAT</a> old argument again!&#8221; said the man.</p>

<p>&#8220;Hello, I am <a href="http://ditext.com/carroll/tortoise.html" rel="nofollow">Tortoise</a>, a humble tortoise,&#8221; said the animal. &#8220;And this is my esteemed companion, Achilles the Logician.&#8221;</p>

<p>&#8220;<i>At</i> your service, miss!&#8221; said Achilles, bowing grandly from his perch
atop the Tortoise.</p>

<p>&#8220;Um, Hello. My name is Lauren Ipsum.&#8221; Laurie
attempted a curtsey. </p>

<p>&#8220;&#8216;Lauren Ipsum&#8217;. That’s quite a <a href="http://en.wikipedia.org/wiki/Lorem_ipsum" rel="nofollow">GENERIC</a> name, isn’t it?&#8221; observed Achilles. </p>

<p>&#8220;Never mind that. How did you get here?&#8221; asked Tortoise.</p>

<p>&#8220;I don’t really know,&#8221; Laurie said. &#8220;I was following the path to
Symbol but I got turned around at Recursion Junction.&#8221;</p>

<p>&#8220;That often happens. You spent a fair amount of time chasing your
tail, I imagine.&#8221; </p>

<p>&#8220;But I don’t have a tail!&#8221;</p>

<p>&#8220;Got away from you, did it? Well, it should turn up again,&#8221; said
Tortoise. &#8220;Or perhaps it was optimized <a href="http://en.wikipedia.org/wiki/Tail_call" rel="nofollow">away</a>. But seeing as MOST of you
is present, perhaps you can help us resolve a discussion.&#8221;</p>

<p>&#8220;Well, I can try.&#8221;</p>

<p>&#8220;Splendid. The question I was posing to my dear friend Achilles is
this: How long is an infinite piece of string?&#8221;</p>

<p>&#8220;An infinite string? Infinite means it’s really really really really
really REALLY long. Really.&#8221; said Laurie.</p>

<p>&#8220;Ah! You agree with ME,&#8221; Achilles said, &#8220;and so the burden of proof
must be borne by the other side.&#8221;</p>

<p>&#8220;The burden of Achilles on my BACK is more than enough!&#8221; said Tortoise.</p>

<p>&#8220;My colleague the Tortoise is wise in many matters,&#8221; Achilles
explained. &#8220;But he is clearly wrong this time. He maintains that an
infinite string can be less than TWO INCHES long!&#8221;</p>

<p>&#8220;But how can an infinite string be two inches long?&#8221; Laurie asked.</p>


<p>&#8220;His claim sounds preposterous and indiscrete,&#8221; said Achilles. &#8220;We are
in continuous disagreement.&#8221;</p>

<p>&#8220;I never disagree,&#8221; said Tortoise. &#8220;I only discuss, especially with a
formidable intellect such as yours, Achilles.&#8221; Achilles preened at the
Tortoise’s praise. &#8220;Allow me to suggest a way to settle the matter.&#8221;</p>

<p>&#8220;Please, suggest away,&#8221; said Achilles.</p>

<p>&#8220;Let us build —hypothetically, of course— an infinite piece of string,
and then measure it. Laurie can be our impartial judge.&#8221;</p>

<p>&#8220;I accept. Experiment always beats Theory,&#8221; Achilles said. &#8220;And an
impartial judge sounds wonderful, especially if she already agrees
with me!&#8221;</p>

<p>&#8220;Excellent,&#8221; said Tortoise. &#8220;Let us begin. If you had an infinite
number of pieces of string, and laid them end-to-end, would that be
infintely long? Hypothetically?&#8221;</p>

<p>&#8220;Yes, it must be.&#8221; said Laurie.</p>

<p>&#8220;No matter how long or short each individual piece is?&#8221; asked Tortoise.</p>

<p>&#8220;Surely,&#8221; said Achilles. &#8220;Infinity is infinity.&#8221;</p>

<p>&#8220;I wonder. Suppose we start with a piece of string ONE inch long,&#8221;
Tortoise said. &#8220;Then add a second piece of string that is ONE-HALF
inch long. How long are they together?&#8221;</p>

<p>&#8220;One-and-a-half inches,&#8221; Laurie said. </p>

<p>&#8220;And that is shorter than two
inches?&#8221; Tortoise asked. </p>

<p>&#8220;One-half inch shorter. Unmistakably.&#8221;
Achilles answered. </p>

<p>&#8220;Laurie?&#8221; </p>

<p>&#8220;That sounds right.&#8221; </p>

<p>&#8220;We all agree thus
far,&#8221; said Tortoise. &#8220;Perhaps we’ll converge on the same conclusion.&#8221;</p>

<p>
&#8220;I doubt that!&#8221; said Achilles. Laurie wasn’t sure where Tortoise was
going, but she doubted too.</p>

<p>&#8220;Achilles, would you please keep count of our hypothetical string? I
want to add a third piece ONE-QUARTER of an inch long,&#8221; said
Tortoise. &#8220;Is our string now one-and-three-quarters inches long?&#8221;</p>

<p>Achilles retrieved a much-used notebook from under his helmet and
scribbled some figures. &#8220;It seems so,&#8221; he said.</p>

<p>&#8220;With one-quarter inch to spare?&#8221; asked Tortoise.</p>

<p>Scribble. &#8220;Yes, only one-quarter inch! You are a finger’s-width away
from defeat!&#8221;</p>

<p>&#8220;Add an EIGHTH-inch piece,&#8221; Tortoise continued. &#8220;Do I still have some
space left over?&#8221;</p>

<p>&#8220;Yes, but I’ll have beaten you soon!&#8221; Achilles crowed. &#8220;Your string is
only an eighth-inch away from the limit, and you’ve only done FOUR
pieces!&#8221;</p>

<p>&#8220;You may well prove right, Achilles, but honor demands that we
continue until the bitter end,&#8221; said Tortoise.</p>

<p>&#8220;It won’t be long,&#8221; Achilles said graciously. &#8220;What is your next move?&#8221;</p>

<p>&#8220;I would like to add another piece of string, but this time
one-sixteenth inches long.&#8221;</p>

<p>&#8220;Done!&#8221; Scribble. &#8220;You have only one-sixteenth inch left, old friend!&#8221;</p>

<p>&#8220;My word,&#8221; said Tortoise. &#8220;I should be careful with my allotted space.
For the next one, I would like to add a piece of string
one-thirty-second inches long.&#8221;</p>

<p>&#8220;As you wish, poor Tortoise, one-thirty-second inch added. Only
one-thirty-second inch remaining, and an infinity of strings to go!
There will be PLENTY of rope left over to hang yourself with!&#8221; said
Achilles.</p>

<p>&#8220;Put on a sixty-fourth inch piece, a one-hundred-and-twenty-eighth
inch piece, then a two-hundred- and-fifty-sixth inch piece, and a
five-hundred-and-twelveth inch piece of string,&#8221; said Tortoise.</p>

<p>&#8220;Hold on, those are very big —no, SMALL— numbers!&#8221; Achilles figured
and scribbled for a minute. &#8220;Ah! There is only a
five-hundred-and-twelveth inch remaining! It’s too bad we’re not
splitting HAIRS, or you could have gotten a little farther! Do you
give up now?&#8221;</p>

<p>&#8220;Oh, wait, I see!&#8221; exclaimed Laurie. &#8220;Achilles, Tortoise is right.&#8221;</p>

<p>&#8220;What? Don’t change your mind NOW, when I am so close to victory!&#8221;
Achilles cried.</p>

<p>&#8220;No, I’m sure Tortoise is right,&#8221; said Laurie. &#8220;Don’t you see? Every
piece he adds is half as long as the one before. That leaves just
enough room left over. Then he repeats again and again with shorter
and shorter strings. Even if he adds an infinite number of pieces, it
will NEVER reach two inches.&#8221;</p>

<p>&#8220;Well, hardly ever,&#8221; said Tortoise.</p>

<p>Achilles grimaced. &#8220;It seems you’ve done it again, Tortoise. But just
to make sure, I will check the arithmetic MYSELF.&#8221; He proceeded to
scribble in his notebook: one-thousand-twenty-fourth inches, plus
two-thousand-forty-eighth inches, plus four-thousand-ninety-sixth
inches, plus...</p>

<p class="figure">
<img src="../../images/infinite-string.jpg" />
</p>

<p>&#8220;THAT should keep him busy. Thank you for your assistance, Laurie.&#8221;</p>

<p>
&#8220;You’re welcome, Mr Tortoise,&#8221; said Laurie. &#8220;I didn’t know something
so big could also be so small.&#8221; </p>

<p>&#8220;Or that something so small can also
be so big,&#8221; Tortoise said. </p>

<p>&#8220;Mr Tortoise, do you know how long this
road is?&#8221; asked Laurie. &#8220;It feels like it goes on forever. I’m trying
to get to the next town.&#8221;</p>

<p>&#8220;It’s quite long,&#8221; said Tortoise. &#8220;In fact, it is infinite.&#8221;</p>

<p>&#8220;Oh, no! How do I get to the end?&#8221;</p>

<p>&#8220;That can be done in two simple steps.&#8221;</p>

<p>&#8220;Really? What steps?&#8221;</p>

<p>&#8220;A step with your right foot, then a step with your left foot. But the
attitude is important. It’s integral. Close your eyes and picture the
road as only two steps long, like the string.&#8221;</p>

<p>Laurie closed her eyes, took a deep breath, and stepped forward with
her right foot. Then she stepped again with her left foot. When she
opened her eyes, Achilles and the Tortoise were gone. In front of her
was a signpost:</p>

<p class="figure">
<img src="../../images/welcome-to-symbol.png" title="Welcome to Symbol! Population: N"/>
</p>

<hr/>

<p>
<iframe src="https://spreadsheets.google.com/embeddedform?formkey=dDhJSEc4VXVObmN3eEhhem5WUmFUUWc6MQ" width="700" height="250" scrolling="no" frameborder="0" marginheight="0" marginwidth="0">Loading...</iframe>
</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2011/01/tortoise.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.f6eac1876a3334816575b898bf86f4dff6116e57</id>
         <published>2010-11-01T12:00:00.000-08:00</published>
         <updated>2010-11-01T12:00:00.000-08:00</updated>
         <title type="text">The Full Stack, Part I</title>
         <content type="html"><![CDATA[
           

<p>One of my most vivid memories from school was the day our chemistry teacher let us in on the Big Secret: every chemical reaction is a joining or separating of links between atoms. Which links form or break is <a href="http://en.wikipedia.org/wiki/Valence_electrons#Valence_electrons_in_chemical_reactions">completely governed</a> by the energy involved and the number of electrons each atom has. The principle stuck with me long after I'd forgotten the details. There existed a simple reason for all of the strange rules of chemistry, and that reason lived at a lower level of reality. Maybe other things in the world were like that too.

<span class="sidenote">
  (<i>Originally appeared on <a href="http://www.facebook.com/note.php?note_id=461505383919">Facebook's Engineering blog</a> and <a href="http://calendar.perfplanet.com/2010/">perfplanet.com</a>.</i>)
</span>
</p>

<p class="figure">
  <a href="http://xkcd.com/435">
    <img src="../../images/xkcd-435.png" width="540" height="150" border="0"
      title="As always, xkcd makes the point better than I ever could." />
  </a>
</p>

<p>A "<a href="http://forge38.com/blog/2008/06/full-stack-web-developers/">full-stack programmer</a>" is a generalist, someone who can create a non-trivial application by themselves. People who develop broad skills also tend to develop a good mental model of how different layers of a system behave. This turns out to be especially valuable for performance &amp; optimization work. No one can know everything about everything, but you should be able to visualize what happens up and down the stack as an application does its thing. An application is shaped by the requirements of its data, and performance is shaped by how quickly hardware can throw data around.</p>

<p>Consider this harmless-looking SQL statement:</p>

<pre class="code">INSERT INTO some_table VALUES (1, 2, 3);</pre>

<p>What is the maximum number of these you can do per second? Well, it first depends on the storage engine you use. If the table is in MyISAM format a write may result in one more <a href="http://www.pcguide.com/ref/hdd/perf/perf/spec/posSeek-c.html">disk-seek</a> or rotation than InnoDB. InnoDB stores the row data right after the index. With MyISAM indexes and data are stored in different files. A hard drive can only do one thing at a time, so this detail can make a big difference in performance. It also depends on what mechanisms are in place (write-through caches, journals, etc) to keep random seeks to a minimum, and whether you are using table-level or row-level locking. Digging deeper into how these storage engines work, you can <a href="http://marksverbiage.blogspot.com/2007/10/mysql-delaykeywrite-is-good.html">find ways</a> to <a href="http://news.ycombinator.com/item?id=16430">trade safety</a> for speed.</p>

<h3>The shape of the data</h3>

<p>One way to visualize a system is how its data is shaped and how it flows. Here are a some useful factors to think about:</p>

<ul>
<li><b>Working data size:</b> This is the amount of data a system has to deal with during normal operation. Often it is identical to the total data size minus things like old logs, backups, inactive accounts, etc. In time-based applications such as email or a news feed the working set can be much smaller than the total set. People rarely access messages more than a few weeks old.</li>

<li><b>Average request size:</b> How much data does one user transaction have to send over the network? How much data does the system have to touch in order to serve that request? A site with 1 million small pictures will behave differently from a site with 1,000 huge files, even if they have the same data size and number of users. Downloading a photo and running a web search involve similar-sized answers, but the amounts of data touched are very different.</li>

<li><b>Request rate:</b> How many transactions are expected per user per minute? How many concurrent users are there at peak (your busiest period)? In a search engine you may have 5 to 10 queries per user session. An online ebook reader might see constant but low volumes of traffic. A game may require multiple transactions per second per user.</li>

<li><b>Mutation rate:</b> This is a measure of how often data is added, deleted, and edited. A webmail system has a high add rate, a lower deletion rate, and an almost-zero edit rate. An auction system has ridiculously high rates for all three.</li>

<li><b>Consistency:</b> How quickly does a mutation have to spread through the system? For a keyword advertising bid, a few minutes might be acceptable. Trading systems have to reconcile in milliseconds. A comments system is generally expected to show new comments within a second or two.</li>

<li><b>Locality:</b> This has to do with the probability that a user will read item B if they read item A. Or to put it another way, what portion of the working set does one user session need access to? On one extreme you have search engines. A user might want to query bits from anywhere in the data set. In an email application, the user is guaranteed to only access their inbox. Knowing that a user session is restricted to a well-defined subset of the data allows you to shard it: users from India can be directed to servers in India.</li>

<li><b>Computation:</b> what kinds of math do you need to run on the data before it goes out? Can it be precomputed and cached? Are you doing intersections of large arrays? The classic <a href="http://www.paulgraham.com/carl.html">flight search problem</a> requires lots of computation over lots of data. A blog does not.</li>

<li><b>Latency:</b> How quickly are transactions supposed to return success or failure? Users seem to be ok with a flight search or a credit card transaction taking their time. A web search has to return within a few hundred milliseconds. A widget or API that outside systems depend on should return in 100 milliseconds or less. More important is to maintain application latency within a narrow band. It is <i>worse</i> to answer 90% of queries in 0.1 seconds and the rest in 2 seconds, rather than all requests in 0.2 seconds.</li>

<li><b>Contention:</b> What are the <i>fundamental</i> bottlenecks? A pizza shop's fundamental bottleneck is the size of its oven. An application that <a href="http://www.random.org/randomness/">serves random numbers</a> will be limited by how many random-number generators it can employ. An application with strict consistency requirements and a high mutation rate might be limited by lock contention. Needless to say, the more parallelizability and the less contention, the better.</li>
</ul>

<p>This model can be applied to a system as a whole or to a particular feature like a search page or home page. It's rare that all of the factors stand out for a particular application; usually it's 2 or 3. A good example is <a href="http://en.wikipedia.org/wiki/ReCAPTCHA">ReCAPTCHA</a>. It generates a random pair of images, presents them to the user, and verifies whether the user spelled the words in the images correctly. The working set of data is small enough to fit in RAM, there is minimal computation, a low mutation rate, low per-user request rate, great locality, but very strict latency requirements. I'm told that ReCAPTCHA's request latency (minus network latency) is less than a millisecond.</p>

<h3>A horribly oversimplified model of computation</h3>

<p>How an application is implemented depends on how real computers handle data. A computer really does only two things: <a href="http://aturingmachine.com/">read data and write data</a>. Now that CPU cycles are so fast and cheap, performance is a function of how fast it can read or write, and how much data it must move around to accomplish a given task. For historical reasons we draw a line at operations over data on the CPU or in memory and call that "CPU time". Operations that deal with storage or network are lumped under "I/O wait". This is terrible because it doesn't distinguish between a CPU that's doing a lot of work, and a CPU that's waiting for data to be fetched into its cache.[0] A modern server works with five kinds of input/output, each one slower but with more capacity than the next:
<span class="sidenote">
  <a href="http://aturingmachine.com"><img src="../../images/turing-machine.jpg" /></a>
</span>
</p>

<ul>
<li><b>Registers &amp; CPU cache (1 nanosecond)</b>: These are small, expensive and very fast memory slots. Memory controllers <a href="http://www.akkadia.org/drepper/cpumemory.pdf">try mightily</a> to keep this space populated with the data the CPU needs. A cache miss means a 100X speed penalty. Even with a 95% hit rate, CPU cache misses <a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf">waste half the time</a>.</li>

<li><b>Main memory (10^2 nanoseconds)</b>: If your computer was an office, RAM would be the desk scattered with manuals and scraps of paper. The kernel is there, reserving Papal land-grant-sized chunks of memory for its own mysterious purposes. So are the programs that are either running or waiting to run, network packets you are receiving, data the kernel thinks it's going to need, and (if you want your program to run fast) your working set. RAM is hundreds of times slower than a register but still orders of magnitude faster than anything else. That's why server people go to such lengths to <a href="http://www.flickr.com/photos/lastfm/2266368081/">jam more and more</a> RAM in.</li>

<li><b>Solid-state drive (10^5 nanoseconds)</b>: SSDs can greatly improve the performance of systems with working sets too large to fit into main memory. Being "only" one thousand times slower than RAM, solid-state devices can be used as <a href="http://www.facebook.com/note.php?note_id=388112370932">ersatz memory</a>. It will take a few more years for SSDs to replace  magnetic disks. And then we'll have to rewrite software tuned for the RAM / magnetic gap and not for the <a href="http://forums.mysql.com/read.php?123,194317,194317">new reality</a>.</li>

<li><b>Magnetic disk (10^7 nanoseconds)</b>: Magnetic storage can handle large, contiguous streams of data very well. <a href="http://www.lexemetech.com/2008/03/disks-have-become-tapes.html">Random disk access is what kills performance</a>. The latency gap between RAM and magnetic disks is so great that it's hard to overstate its importance. It's like the difference between having a dollar in your wallet and having your mom send you a dollar in the mail. The other important fact is that access time varies wildly. You can get at any part of RAM or SSD in about the same time, but a hard disk has a physical metal arm that swings around to reach the right part of the magnetic platter.</li>

<li><b>Network (10^6 to 10^9 nanoseconds)</b>: Other computers. Unless you control that computer too, and it's less than a hundred feet away, network calls should be a last resort.</li>
</ul>

<h3>Trust, but verify</h3>
<p>The software stack your application runs on is well aware of the memory/disk speed gap, and does its best to juggle things around such that the most-used data stays in RAM. Unfortunately, different layers of the stack can <a href="http://varnish-cache.org/wiki/ArchitectNotes">disagree</a> about how best to do that, and often fight each other pointlessly. My advice is to trust the kernel and keep things simple. If you must trust something else, <a href="http://books.google.com/books?id=BL0NNoFPuAQC&lpg=PA273&ots=CNYJystz3P&pg=PA273#v=onepage">trust the database</a> and tell the kernel to get out of the way.</p>

<h3>Thumbs and envelopes</h3>
<p>I'm using approximate powers-of-ten here to make the mental arithmetic easier. The actual numbers are less neat. When dealing with very large or very small numbers it's important to get the number of zeros right quickly, and only then sweat the details. Precise, unwieldy numbers usually don't help in the early stages of analysis. [1]
<span class="sidenote">
  <a href="../../images/emergency.jpg"><img src="../../images/emergency-sm.jpg" /></a>
</span>
</p>

<p>Suppose you have ten million (10^7) users, each with 10MB (10^7) bytes of data, and your network uplink can handle 100 megabits (10^7 bytes) per second. How long will it take to copy that data to another location over the internet? Hmm, that would be 10^7 seconds, or about 4 months: not great, but close to reasonable. You could use compression and multiple uplinks to bring the transfer time down to, say, a week. If the approximate answer had been not 4 but 400 months, you'd quickly drop the copy-over-the-internet idea and look for <a href="http://en.wikipedia.org/wiki/Sneakernet#Usage_examples">another answer</a>.</p>

<h3>movies.example.com</h3>

<p>So can we use this model to identify the performance gotchas of an application? Let's say we want to build a movies-on-demand service like Netflix or Hulu. Videos are professionally produced and 20 and 200 minutes long. You want to support a library of 100,000 (10^5) films and 10^5 concurrent users. For simplicity's sake we'll consider only the actual watching of movies and disregard browsing the website, video encoding, user comments & ratings, logs analysis, etc.</p>

<ul>
<li><b>Working data size:</b> The average video is 40 minutes long, and the bitrate is 300kbps. 40 * 60 * 300,000 / 8 is about 10^8 bytes. Times 10^5 videos means that your total working set is 10^13 bytes, or 10TB.</li>

<li><b>Average request size:</b> A video stream session will transfer somewhere between 10^7 and 10^9 bytes. In Part One we won't be discussing networking issues, but if we were this would be cause for alarm.</li>

<li><b>Request rate:</b> Fairly low, though the concurrent requests will be high. Users should have short bursts of browsing and long periods of streaming.</li>

<li><b>Mutation rate:</b> Nearly nil.</li>

<li><b>Consistency:</b> Unimportant except for user data. It would be nice to keep track of what place they were in a movie and zip back to that, but that can be handled lazily (eg in a client-side cookie).</li>

<li><b>Locality:</b> Any user can view any movie. You will have the opposite problem of many users accessing the same movie.</li>

<li><b>Computation:</b> If you do it right, computation should be minimal. DRM or on-the-fly encoding might eat up cycles.</li>

<li><b>Latency:</b> This is an interesting one. The worst case is channel surfing. In real-world movie services you may have noticed that switching streams or skipping around within one video takes a second or two in the average case. That's at the edge of user acceptability.</li>

<li><b>Contention:</b> How many CPU threads do you need to serve 100,000 video streams? How much data can one server push out? Why do real-world services seem to have this large skipping delay? When multiple highly successful implementations seem to have the same limitation, that's a strong sign of a <a href="http://highscalability.com/youtube-architecture">fundamental bottleneck</a>.</li>
</ul>

<p>It's possible to build a single server that holds 10TB of data, but what about throughput? A hundred thousand streams at 300kbps (10^5 * 3 * 10^5) is 30 gigabits per second (3 * 10^10). Let's say that one server can push out 500mbps in the happy case. You'll need at least 60 servers to support 30gbps. That implies about 2,000 concurrent streams per server, which sounds almost reasonable. These guesses may be off by a factor or 2 or 4 but we're in the ballpark.</p>

<p>You could store a copy of the entire 10TB library on each server, but that's kind of expensive. You probably want either:</p>

<ul>
  <li>A set of origin servers and a set of streaming servers. The origins are loaded with disks. The streamers are loaded with RAM. When a request comes in for a video, the streamer first checks to see if it has a local cache. If not, it contacts the origins and reads it from there.</li>
  <li>A system where each video is copied to only a few servers and requests are routed to them. This might have problems with unbalanced traffic.</li>
</ul>

<p>An unknown factor is the distribution of popularity of your video data. If everyone watches the same 2GB video, you could just load the whole file into the RAM of each video server. On the other extreme, if 100,000 users each view 100,000 different videos, you'd need a <i>lot</i> of independent spindles or SSDs to keep up with the concurrent reads. In practice, your traffic will probably follow some kind of <a href="http://en.wikipedia.org/wiki/Zipf's_law">power-law distribution</a> in which the most popular video has X users, the second-most has 0.5X users, the third-most 0.33X users, and so on. On one hand that's good; the bulk of your throughput will be served hot from RAM. On the other hand that's bad, because the rest of the requests will be served from cold storage. A useful hack might be to keep the first minute of every video in RAM at all times.</p>

<p>Whatever architecture you use, it looks as though the performance of movies.example.com will depend almost completely on the number of  concurrent reads your storage devices can handle. If I were building this today I would give both SSDs and <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.312&rep=rep1&type=pdf">non-standard</a> <a href="http://www.ece.eng.wayne.edu/~sjiang/Tsinghua-2010/linux-readahead.pdf">data</a> <a href="http://users.nccs.gov/~zzhang3/pubs/pfc.pdf">prefetching</a> strategies a serious look.</p>


<h3>It's been fun</h3>
<p>This subject is way too large for a short writeup to do it justice. But absurd simplifications can be useful as long as you have an understanding of the big picture: an application's requirements are shaped by the data, and implementations are shaped by the hardware's ability to move data. Underneath every simple abstraction is a world of details and cleverness. The purpose of the big fuzzy picture is to point you where to start digging.</p>

<h3>Notes</h3>
<p>[0] Fortunately there is a newish tool for Linux called "<a href="http://lxr.linux.no/linux+v2.6.36/tools/perf/">perf counters</a>".</p>

<p>[1] Jeff Dean of Google deserves a lot of credit for popularizing the "<a href="http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf">numbers you should know</a>" approach to performance and systems work. As my colleague Keith Adams put it, "The ability to quickly discard bad solutions, without actually building them, is a lot of what good systems programming is. Some of that is instinct, some experience, but a lot of it is algebra."</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/11/full-stack.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.841267b99168b0ce42683aba4ac6e83f0a5b00c6</id>
         <published>2010-10-01T12:00:00.000-08:00</published>
         <updated>2010-10-01T12:00:00.000-08:00</updated>
         <title type="text">Predicting the Future</title>
         <content type="html"><![CDATA[
           

<p>Lately I've been using a definition of the future that seems to kick up interesting ideas. It's not original, but it is useful:</p>

<blockquote><p><i>The future is a disagreement with the past about what is and is not important.</i></p></blockquote>

<p>The differences between two ages are informed by politics, technology, demographics, etc. But they are easiest to understand in terms of what each age thinks is worth caring about.</p>

<p>Knowing who your parents are is less important today than a couple hundred years ago. It doesn't cripple you socially as it once did. Knowing what material your water pipes are made of became suddenly important when we noticed the crippling effects of lead. Today we pay close attention to lab tests and labels and indicator lights that summarize what's really going on, because a lot of important stuff is too small or complex to comprehend directly. If I had to pick a major disagreement between the present world and the past, it would be the importance of invisible amounts of mass and energy, be they trace chemicals or transistors. Moreover we tend to care about emergent information content, the patterns in the material, rather than the actual material.

  <span class="sidenote">
    <a href="http://en.wikipedia.org/wiki/Mercury_poisoning"><img title="The had all the clues to discover mercury poisoning, too." src="../../images/mercury-poisoning.jpg" /></a>  </span>
</p>

<p>To a typical Victorian that wouldn't be <a href="http://paulgraham.com/say.html">heresy</a> so much as fantastic nonsense. Your great-grandparents' world was populated by people, animals, and human-scale artifacts. Man was more literally the measure of all things. Important things were assumed to be big and obvious, or at least visible to the senses.

  <span class="sidenote">Germ theory was one of the first ideas to break that assumption in a serious way. It didn't help that its companion idea, vaccination, was even <a href="http://books.google.com/books?id=Q_4vAAAAYAAJ&lpg=PA421&ots=uW3bzaI2zu&dq=denouncing%20Pasteur%20as%20a%20quack&pg=PA421#v=onepage&q=quack&f=false">weirder</a>.</span>
</p>

<p style="text-align:center"><a title="As always, there is an XKCD for the occasion." href="http://xkcd.com/722"><img border="0" src="../../images/xkcd-722.png" /></a></p>

<p>The point is that if you want to do truly "futuristic" work, you can't just extrapolate from what is believed right now. You have to suppose that at least one aspect of our worldview is wrong: something we hold dear is not actually important, or there is something else that should be, or both. If you imagine people feeling and acting exactly the same way as they do now, that's not the future. That's just later on.
  <span class="sidenote">
    Also remember that the future is not an ever-upwards spiral. Who would have predicted in 1930 that mass slavery would return to Europe?
  </span>
</p>

<p>The flip side is that these discontinuities make predicting the future hard. The clues to how we may think are camouflaged among thickets of established fact. You have to isolate assumptions underneath our view of the world, alter them, then look around to see what changed. The deeper and more unspoken the assumption, the greater the potential for change. On the surface it sounds like a pretty stupid way to spend your time: disassembling ideas that <i>aren't broken</i> in order to discover ways to break them. On the other hand, that's precisely what tinkers do with machines: break them in order to understand and improve on them. That is, I think, what is meant by the motto "<a href="http://www.ecotopia.com/webpress/futures.htm">the best way to predict the future is to invent it</a>". The initial spadework is the same whether you are predicting or inventing.</p>


<p>You needn't court-martial everything you think you know, but you do need ways to identify suspicious areas, and tools to dig deeper. So what might they be? There is one that I was mighty proud of inventing, until I described it to a Literature major.</p>

<h3>Popular culture as a lens</h3>

<p>One way to find these unspoken assumptions is to examine the treatment of a futuristic subject in popular culture, pick out the common themes, and ask whether they make sense. Popular culture is pretty good for this kind of practice, though what you get out of it may not be earth-shattering. A fiction writer's stock-in-trade is ideas that <i>feel</i> right but are not necessarily backed up by evidence.</p>

<p>For example, science fiction overflows with stories about artificial beings. What do they have in common? Well, a common trope is that artificial intelligences are sufficiently human <i>de facto</i>, especially if the story's conflict is about their status <i>de jure</i>. It's rare to find a story about AI which concludes that they are utterly and forever alien. Wintermute employed synthetic emissaries, and spoke casually about its motivations and desires. Agent Smith is explicitly Neo's mirror image. HAL9000's voice was as warm and soothing as a late-night radio host's. 
  <span class="sidenote">The golem of Jewish folklore is an interesting case. They are humanoid but explicitly not intelligent. Their moral position is somewhere between djinns and power tools. Terry Pratchett explored the idea of intelligent golems in "Feet of Clay", and it turned out exactly as you expect it would: after many misunderstandings they assert their rights and join the mainstream of society. It's hard to shake the feeling that AI stories are mostly allegories for racial integration.</span>
</p>

<p>We seem to have a hard time coming to grips with intelligence that does not have a face or a voice. When we say "intelligent" in casual speech, we mostly mean "<a href="http://en.wikipedia.org/wiki/Talk:Intelligence">a bloke I can have a conversation with</a>". It may turn out that artificial intelligences are not able to evoke empathy from or experience empathy for natural born humans. I'm not sure I like the idea of <a href="http://wiki.lesswrong.com/wiki/Friendly_AI">what happens</a> when we compete with such beings for resources.</p>

<p>The phrase "artificial intelligence" itself may harbor simplistic assumptions, like calling an X-ray machine a "magic lantern". It's not a bad analogy, but it misses important things like how X-ray waves are generated, other species of exotic radition, and how too much of them will kill you. Imagine <a href="http://www.nytimes.com/2010/09/01/opinion/01gibson.html">a passive fabric of knowledge</a> that, when and only when directed by a human, accomplishes superhuman feats. The conflict would not be over the humanity of the entity in question, but over <a href="http://www.edge.org/3rd_culture/dyson05/dyson05_index.html">which humans control it</a>.</p>

<p>So, an unoriginal though interesting prediction: the idea of artificial intelligence having a distinct personality with recognizable motivations and desires is attractive, but there is little evidence that it must happen.</p>

<h3>Subtextual subversion</h3>

<p>Fans of hard science like to <a href="http://www.info.ucl.ac.be/~pvr/decon.html">deride Derrida</a> for being pseudo-intellectual, but this literary method is more or less what he was talking about. My wife was very pleased to point that out when I tried to pass it off as my own invention. Deconstructionism got lost in the weeds because they don't use reality to verify their theories, but the basic method seems sound. Can you use it on other bodies of literature, not just fiction?</p>


<p>Over the last ten years, the field of data-mining shifted its focus to gathering enough of the right data instead of just ever more-clever algorithms. I remember reading a lot of papers on automatic text classification in the late 1990s. They drew from a <a href="http://archive.ics.uci.edu/ml/datasets.html">small pool</a> of datasets, such as a collection of news articles. Innovation happened in the algorithms. This was a reasonable idea. Data was hard to come by, and using the same datasets seemed like a good way to compare the performance of different algorithms. The underlying assumption was carried over from other fields of computer science: given a representative sample of data, the way forward is to come up with more sophisticated algorithms.</p>

<p>Researchers at Google were the first I know of to demonstrate value in the opposite: dumb algorithms executed over gigantic datasets. They came to this opinion because they had so much damned data that it was hard enough to count it, much less run O(n&#179;) algorithms over it. So they tried dumb algorithms first, and they worked surprisingly well. Older, naive algorithms turned out to be perfectly valid; they just needed orders of magnitude more input than had been previously tried.

<span class="sidenote">I would not be surprised to learn that several people thought of this idea early on. I don't know enough about the field to say.</span>
</p>

<p>It's possible you would have hit on the same idea, if you'd analyzed the literature for unspoken assumptions. Or, like Google, you could have played with big problems and new technology while under the gun to produce something useful.</p>

<h3>Adopt early and often</h3>

<p>Rubbing up against the new is another way to glimpse the future. Just as a child is immersed in a culture and then later derives the rules which shape it, early adopters immerse themselves in new ideas and new technology in order to puzzle out the shape of the future.</p>

<p>Some people are natural early adopters. A friend of mine is busy building an electronic library and giving away his physical books. In eight months he burned through two Kindles and an iPad, and I added a lot of his books to my shelves. The funny thing is that both of us genuinely feel we're getting a good deal: he is divesting his burden of dead trees and space, and I am saving perfectly good books from futuristic folly.</p>

<p>Ours is a classic future/past disagreement. He thinks it's better to move to this new new thing and see how it works. Eventually books will be published on-demand and kept up-to-date just as websites are. Paper will be an option, and not the most popular one. I think that <a href="http://carlos.bueno.org/2010/09/paper-internet.html">paper</a> is actually a pretty good medium for archival storage. Individuals should act in concert to preserve as much as we can, as more and more of our culture becomes digital-only. I don't know which of us will be right, or both, or neither.</p>

<p>My friend's way to predict the future is to surround himself with new technologies. The new often embodies upcoming disagreements with the present. You still have to do the work of isolating the assumptions it breaks, and deciding whether they are correct. For whatever reason I don't have the temperament for this. My method is to surround myself with early adopters, and watch what they do.</p>

<h3>Bring it home</h3>

<p>Picture yourself as you were ten years ago. List five things that are different about you now. I'll bet money that most of them are differences in your attitude towards the world. Now picture yourself ten years from today. You probably imagine external qualities: you will be more sucessful, or relaxed, or in Alaska. But most likely the biggest differences will be internal, what Future You thinks is truly important. It's hard to predict exactly what would change. If you knew that, you'd already be on your way to becoming the Future You.</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/10/predicting-the-future.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.62e4c2cee5c8e261519ef136adf20bb878e02727</id>
         <published>2010-09-01T12:00:00.000-08:00</published>
         <updated>2010-09-01T12:00:00.000-08:00</updated>
         <title type="text">A Paper Internet</title>
         <content type="html"><![CDATA[
           

<p><b>Update:</b> Check out <a href="http://paper-internet.appspot.com/">paper-internet.appspot.com</a></p>

<p>If you wanted to preserve important bits of our civilization for future centuries, you could do worse than a bundle of paper sealed in plastic. It's remarkably cheap and effective; you can make one over a weekend. In this article we'll build a 1/2 scale model of a time capsule that contains the complete Linux 0.1 source code, plus sundry articles and internet ephemera.


<span class="sidenote">
  <!--<iframe src="http://www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FThe-Paper-Internet%2F165397633489959&amp;width=230&amp;colorscheme=light&amp;connections=10&amp;stream=false&amp;header=false&amp;height=300" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:230px; height:587px;" allowTransparency="true"></iframe>-->

<iframe frameborder="0" height="380px" src="http://www.kickstarter.com/projects/512752850/a-paper-internet/widget/card.html" width="220px"></iframe>
</span>
</p>

<div class="cent">
<img src="../../images/node0/initial.jpg" />
</div>

<p>A time capsule must perform three basic functions:</p>

<ol>
 <li>Encode information with sufficient density & durability.</li>
 <li>Protect the information from physical damage, moisture, heat & cold, etc.</li>
 <li>Be findable.</li>
</ol>

<p>So while the Rosetta Stone performed (2) fairly well, it was pretty lucky to be found at all. Also its data density is terrible: about 1 bit per cubic centimeter. A book in a library fulfills (1), but requires the library around it to provide (2) and (3).</p>

<p>The internet, <a href="http://carlos.bueno.org/2008/08/save-web.html">contrary to popular belief</a>, is not very good at preserving information on a long time scale. It ultimately depends on digital media that break down rapidly. Early Unix source code, one of the most important sequences of bits ever written, had to be <a href="http://code.google.com/p/unix-jun72/">reconstructed from printouts</a>.</p>

<p>The makeup of our capsule is simple: cellulose, carbon, polymers, and distributed information. You print a bundle of paper, place it inside a box, stick a label on it, then drown it in translucent epoxy resin. Alongside whatever it is you are preserving, <i>you include the locations of other capsules</i>.</p>

<h3>Density & Durability</h3>

<p>The humble piece of paper has come a long way in the last few decades. Acid-free paper [<a href="#note-0" name="ref-0">0</a>] is the norm. It has archival properties comparable to cotton rag or parchment, and can easily survive for 300 to 500 years. Black & white laser toner is carbon powder and resin, fused to the surface at a few hundred degrees Fahrenheit. Carbon, being an atomic element, never fades. All in all it's a cheap & cheerful way to preserve data for a very long time.</p>

<div class="cent">
<img src="../../images/node0/paper.jpg" title="256 pages printed 4up and cut into quarter-sheets. Note the ruler."/>
</div>

<h3>Protection from the elements</h3>

<p>You need an airtight seal that is itself fairly rugged. Epoxy resin is the hard plastic you often see protecting the surface of bars and restaurant tables. It's the closest thing you can get to man-made amber. Our scale-model capsule will be encased in a shell of resin about 1 centimeter thick. We're not shooting for 2,000 years exposed in the desert, just 50 to 500 years in the ground.</p>

<h3>Be findable</h3>

<p>The biggest design problem of traditional time capsules is that people forget where the damned things are buried. There seem to be two contradictory thoughts going on at once: that the best way to preserve information is inside a buried box, but that the best way to preserve information <i>about</i> the box is somewhere else.</p>

<p>Inside each time capsule will be a list of other known capsules. That, I hope, will make the difference between a node in a network and a forgotten box of junk. Dozens or hundreds of people could build full-scale capsules like this and share location data with each other. This prototype and its twin are the only two of their kind so far, so they only link to each other. The larger the network, the greater the chances of recovery.</p>



<h3>Materials</h3>
<ol>
  <li>70 sheets of high-quality, acid-free printer paper</li>
  <li>Laser printer</li>
  <li>250 pages of data you want to preserve</li>
  <li>Scissors or paper cutter</li>
  <li>Masking tape</li>
  <li>1/4" thick balsa wood planks</li>
  <li>Illustation board (thick paperboard)</li>
  <li>Razor, ruler</li>
  <li>Wood glue or white glue</li>
  <li>500ml clear epoxy resin</li>
  <li>Disposable cups and stir sticks</li>
  <li>Gloves, goggles, and mask</li>
</ol>

<h3>The Bundle</h3>

<p>Gather whatever data you want to preserve. It can be books, songs, computer programs, your Facebook page, diary, recipes, anything. I would focus on things that are likely to disappear. The future will probably most appreciate a description of boring, everyday life in Right Now, A.D.</p>

<p>Laser-print your data "4up" and single-sided. You should experiment with your printer's capabilities, but I've found that 10pt Helvetica printed 4up is the smallest mine can go and still be legible. Don't print double-sided because the toner might stick to itself if it ever gets too hot.</p>

<p>Cut the sheets into quarter-pages, collate them, then tightly wrap the bundle in a couple of layers of paper, like a Christmas present.</p>

<h3>The Box</h3>

<p>The box is mainly for appearance's sake, and to protect the paper from light. You could probably sink your bundle (wrapped in a few layers of paper and plastic) directly into the resin and it would work fine.</p>

<div class="cent">
<img src="../../images/node0/box.jpg" title="Building the box"/>
</div>

<p>I made mine from illustration board. Cut two 12 x 15cm pieces for the top and bottom. Then cut the side walls 3cm high and <i>slightly</i> shorter than the 12 or 15 cm, to account for the thickness of the neighboring wall.</p>

<p>Glue it up! It doesn't have to be pretty or perfect as long as it fits as tightly as possible around the paper bundle. Let it sit for an hour to dry.</p>

<p>Place the bundle inside a ziplock bag, squeeze all the air out, and seal it. Put that inside the box and glue the lid shut. Paint it if you want, then glue a label to the front so people know what it is.</p>

<div class="cent">
<img src="../../images/node0/painted.jpg" title="Box all painted up."/>
</div>


<p>You build the mould the same way as the box, just 2cm larger in each dimension. I built mine out of balsa wood. If you have an aluminium or plastic tray of the right dimensions you can use that instead.</p>

<div class="cent">
<img src="../../images/node0/mould.jpg" title="The mould is the same thing as the box, just a bit larger."/>
</div>

<h3>The Epoxy</h3>

<p>Needless to say, do all epoxy work in a well-ventilated area and <i>follow the safety instructions</i>. Epoxy resin sounds tricky but it's pretty easy to handle with practice. There are many types of resin of varying properties. You want "encapsulating epoxy resin" or "clear casting resin", which is often used to seal electronic components and art projects. The strongest resins take 48 hours to harden completely, but last much longer than fast-cure resins.</p>

<div class="cent">
<img src="../../images/node0/resin.jpg" title="Man, this stuff STINKS. Get the low-odor, 1:1 kind. Trust me."/>
</div>


<p>Mix & pour about 3/4 cup (140ml) of resin in the bottom of the mould and let it cure for about an hour. This forms the back of the shell.</p>

<p>Center the box inside the mould. Mix & pour the rest of the resin on top of the box, and let it flow into the sides. Loosely cover and let cure for 24 to 48 hours. Your inner box will probably not be water-tight, so expect some bubbles to stream out of it as the epoxy seeps in. (To avoid this you could use an airtight tin, though there is a chance it will float!)</p>

<div class="cent">
<img src="../../images/node0/pour.jpg" />
</div>


<p>Place in ground, let stand 300 years.</p>

<h3>Thinking Bigger</h3>

<p>This is a scale model to demonstrate the process. Real capsules will contain at least 500 full-sized sheets of paper. The magic of the square-cube law makes it more cost-effective as you scale up. Casting large volumes of epoxy is a bit tricky, so start small and ask your friendly neighborhood supplier for advice.</p>

<p><b>Three Reams</b> (1,500 sheets): This is probably the most managable size for a weekend project. You could use ready-made "archival" boxes for the inner box and one of those plastic file-folder boxes for the outer mould. Artist-grade resin starts to get pretty expensive at these volumes, but you can use amber encapsulating resin instead, about $80 for two gallons. <a href="http://www.jgreer.com/electronic-potting.htm">AeroMarine</a> sells in bulk and will send you free samples.</p>

<p><b>Carton</b> (5,000 sheets): If you want something with more volume and durability, you can use a concrete flower pot whose inner dimensions are about 2 cm larger than the outer dimensions of the paper carton. Tightly seal the paper with several layers of thick plastic, and pour the resin as before.</p>

<p><b>Oil Drum</b> (28,000 sheets): if you have good concrete molding skills and access to bituminous resins (instead of expoy resin) for the water seal, you could build a time capsule around a 55-gallon oil drum with very good capacity. It's probably wise to invest in higher-quality printing at this scale, so you can fit more than 4 pages of data per sheet.</p>

<p><b>Monument</b> (3,500,000 sheets): A typical twenty-foot cargo container can hold over 700 cartons of paper. Constructing this monument requires a proper concrete foundation, a steel-reinforced concrete shell, and serious seals against moisture. The curation and printing alone are jobs of unusual size, but doable.</p>

<p>A cargo container sells for about USD$3,000. A contractor friend of mine estimated the concrete construction work would cost about USD$15,000. The printing, curation of the data, and mosaic work would be the most expensive items. But I believe the total cost would be under USD$100,000. That's less than many corporate sculptures, and a lot more useful.</p>

<p>I think it would be beautiful to put giant concrete archives in public parks around the world. A mosaic on the top surface would describe what it is and what it's for. Sink them about 2 meters down so they stick out a bit. They would form large benches for people to sit and play on, trace out the mosaic with their fingers, and perhaps be reminded of time.</p>

<p>If you want to build a time capsule yourself, <a href="mailto:carlos@bueno.org?subject=paper%20internet">send me an email</a>! Let's get this thing started.</p>


<h3>Notes</h3>

<p>[<a href="#ref-0" name="note-0">0</a>] "Acid-free" is a bit misleading. All wood-pulp paper contains acid that will yellow and destroy the fibers over time. "Acid-free" paper is given an extra wash, then impregnated with alkalies (baking soda, more or less) to improve whiteness and neutralize remaining acids. The percentage varies from 2% by weight up to 4 or 5% for "archival" quality paper.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/09/paper-internet.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.32ffc52e29ad2e75621b9fe1606bf9fc933e4772</id>
         <published>2010-07-01T12:00:00.000-08:00</published>
         <updated>2010-07-01T12:00:00.000-08:00</updated>
         <title type="text">Internet Cartography</title>
         <content type="html"><![CDATA[
           

<p>Every generation likes to think it reinvents the world from scratch. But some things are shaped by history and geography as much as anything. Mountains, rivers, archipelagos, and long terrestrial crossings play a big role in deciding where, how, and how well different parts of the Earth get connected.</p>

<p>This is a map of the global telegraph network from <a href="http://en.wikipedia.org/wiki/File:1901_Eastern_Telegraph_cables.png">110 years ago</a> side-by-side with the <a href="http://www.guardian.co.uk/business/2008/feb/01/internationalpersonalfinancebusiness.internet">internet of today</a>:</p>

<p><img src="../../images/teh-interweb-600.png" /></p>

<p>One way to see the internet is as a physical manifestation of trade volume between cities, on a 40-year moving average. That is about how long it takes for economic ties to develop, demand to rise, and high-volume communications routes to be financed and built. Once built, these links tend to stick around.</p>

<p>Governments and empires have come and gone, bandwidth has increased a billion-fold, but the network has the same general shape it had back when Mark Twain was sending witty telegrams. The only big change since then is greater ties between the US and Asia.</p>

<p>Just from looking at where the cables go you can guess how long it would take to send a message. A telegram from San Francisco to Hong Kong in 1901 must have taken many hops through British Empire cables to Europe, through the Middle East, and so on. London to New York was fast and direct. The vestiges of the Spanish and Portuguese Empires show up in the many links between South America, the Caribbean archipelago, and the Iberian peninsula.</p>

<p>A cool thing is that you can measure these relative latencies yourself, using the present-day internet. If you run a website with a decent amount of worldwide traffic, you can use that traffic to map out how the internet responds with regards to you, and see how that matches with the gross structure of the 'net.</p>

<p>I wrote about a cheap and cheerful way to generate this data <a href="http://developer.yahoo.net/blog/archives/2009/11/an_engineers_gu.html">last year</a>, and the code has since been open-sourced as part of Yahoo's <a href="http://www.github.com/yahoo/boomerang">Boomerang</a> measurement framework. The basic idea is to have your users perform two tiny network requests: one to a throwaway hostname generated in the moment, like <b>8j48sas.dns.example.net/A.gif</b>, then another to a different single-pixel image on the same host, <b>8j48sas.dns.example.net/B.gif</b>. The first request will require a DNS lookup, TCP handshake, and HTTP transaction. The second only needs to do the TCP and HTTP steps. Now you have fuzzy measurements of how long it took to do a full HTTP round-trip (B) and to do a full end-to-end DNS lookup (A - B).
<span class="sidenote">
  <a href="http://xkcd.com/192"><img class="noborder" src="../../images/xkcd-192-last.png" /></a>
</span>
</p>

<p>Real-world data on DNS performance is generally considered hard to come by. The domain name system is designed with caching and intermediaries at all levels, so you as a site owner only see part of the story during normal operation. You can buy precise data from commercial services like Gomez or Keynote. Or you generate it yourself if you happen to have, say, a <a href="http://www-iepm.slac.stanford.edu/pinger/">distributed network of computers</a>, or a browser plugin <a href="http://www.youtube.com/watch?v=X6R4iAfc1qA#t=6m35s">installed on millions of clients</a>. Otherwise, this Javascript method is less accurate but works well enough.</p>

<p>Here is a chart of median (50th percentile) DNS latencies experienced by a random sample of Facebook users, broken down by country. As you can see, there are several lines crowding together at the bottom. That is the US and parts of Europe like the UK and Belgium. Facebook's DNS servers tend to be physically close to users in those countries. Spain and France are a bit higher up, and the rest of the graph is a mix of Asian and South American countries. [1]</p>

<p><img src="../../images/dns-latency-p50.png" /></p>

<p>The median value only tells part of the story. Here is the worldwide DNS latency data as a density plot, to show the distribution. Notice that a substantial number of users took more than 500 milliseconds just to look up a hostname. This is the uncached worst-case, of course, but it's something to keep in mind.</p>

<p><img src="../../images/dns-latency-density.png" /></p>

<h3>HTTP Latencies</h3>

<p>Here is the chart for measurement B, the TCP + HTTP latency. This better reflects the real "geography" of the internet, because the HTTP requests travel all of the way back to our web tiers in the United States. There is much less volatility in these measurements day-to-day; it's controlled more by basic network conditions and speed-of-light and less by the health of various <a href="http://en.wikipedia.org/wiki/Domain_Name_System#Address_resolution_mechanism">DNS recursors</a> around the world.</p>

<p><img src="../../images/http-latency-p50.png" /></p>

<h3>How low can you go?</h3>

<p>So how fast are these links between countries, compared to what is possible? Below is a chart of the same median HTTP latency data, averaged over a week. The short light-grey bars represent the theoretical minimum. If you could carve a direct line between any two spots along the surface of the planet, this grey bar would be the internet round trip time between the US and the given country. [2]</p>

<p><img src="../../images/http-latency-country.png" /></p>

<p>We can learn a lot of things from this chart. The most obvious is that HTTP latency between Asia and the US is worse than US-Europe. The Pacific Ocean is wider than the Atlantic, of course, but raw distance is not the only factor. Economics and local geography play their part. </p>

<p>Look at the ratios between the black bars (real) and the grey bars (theoretical). Both the fastest European and Asian countries have real-world latencies at or below 2X the theoretical minimum, which is pretty impressive. Few technologies get within spitting distance of the physical limits of the universe.</p>

<p>These low-multiple countries tend to have fortunate geography, or a strong history of economic relations with the United States, or both. Other countries with less-strong trade ties, such as Spain, or lots of little islands like the Philippines, have multiples nearer to 2.5X and above. While Australia is a bit farther than Thailand it's 15% closer as far as the internet is concerned. More investment has been put in by the cable operators to make that route fast and wide. In fact, Australia (population 22M) a comparable amount of bandwidth to the US as all of South America (population 385M).</p>

<h3>Cry for Argentina</h3>
<p>The multiples of South American countries start at 3.5X and go up from there. North-South routes are hurt by an unlucky trifecta of mountains, long land crossings, and archipelagos. There is only one cable that serves the Pacific side from Los Angeles to Panama. It's hard to justify building lots of capacity on the Pacific side, because the <a href="http://en.wikipedia.org/wiki/Andes">Andes mountains</a> cut off that part of the continent from the rest. Most traffic follows a long and painful path across the entire length of the US to the Atlantic, then takes a right turn and down another 800 miles of the Florida peninsula. It exits Miami and immediately hits a congested maze of cables, hopping in and out of the water as it navigates the islands of the Caribbean. Someday South America will get better connected, but natural barriers drive the costs way up.</p>

<p>There are other interesting cases such as Belgium, which has the lowest latency and lowest multiple (1.6X) of any European country. The reason is that Belgium is well-placed as an internet nexus, being a) close to Britain but away from the Channel and b) geographically convenient for branching off into the rest of Europe.</p>

<h3>Try this at home</h3>

<p>These measurements are very skewed towards the United States. It would be awesome to see measurements from other spots and different traffic patterns from around the world. The code to collect this data (and a lot more) is <a href="http://www.github.com/yahoo/boomerang">open-source</a> and simple to implement. So try the experiment for yourself and let me know what you find.</p>

<h3>Notes</h3>

<p>
[1] This chart generally agrees with data gathered by Yahoo and <a href="http://www.youtube.com/watch?v=X6R4iAfc1qA#t=6m35s">Microsoft</a>. The data is very US-centric; the picture will be quite different if you were to run the experiment from a site based on another continent. Facebook's servers are largely in the US, so naturally we care most about how to get bits from here to there and less about, say, between India and Saudi Arabia.</p>

<p>[2] The theoretical minimum latency is calculated using the average speed of light through optical fiber, over a hypothetical cable laid in a great circle line between the town of Independence, Kansas and the centroid of the given country. This time is multiplied by 4 to approximate the two round-trips necessary to complete a TCP handshake and HTTP transaction. You can read all about <a href="http://en.wikipedia.org/wiki/Great_circle">Great Circle</a> routes and the speed of light through fiber in Wikipedia, or just use <a href="http://www.wolframalpha.com/input/?i=distance+between+Independence,+Kansas,+USA+and+Bangkok,+Thailand">Wolfram Alpha</a> to do it for you.</p>

         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/07/internet-cartography.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.8d2fc5e2ce210fe6a7c3f706c89b8bb71c998f00</id>
         <published>2010-07-01T12:00:00.000-08:00</published>
         <updated>2010-07-01T12:00:00.000-08:00</updated>
         <title type="text">Corrupting the Youth</title>
         <content type="html"><![CDATA[
           
<p>The Pythagorean Theorem was once a profound and dimly-understood secret. Today we teach it to children. I'd say that's a pretty good test of whether we fully understand something. What parts of computer science can really we teach to kids today? Remember that computer science has nothing to do with computers as such, any more than geometry is about surveying land.</p>

<p>Alan Kay has a <a href="http://www.youtube.com/watch?v=Ud8WRAdihPg">classic talk</a> about different modes of thinking and an experiment he did with young people of different ages. Imagine a robot that can do four things: move forward some distance, move backward, rotate some number of degrees, and make a mark on the floor. The goal is to write a program that tells the robot to draw a circle.</p>

<p>Before I tell you what happened, do take a minute and try it yourself. It's a pretty problem.</p>

<p>Precocious teenagers are stuffed full of facts about circles and curves, &pi;r<sup>2</sup> and calculus and all of that. More to the point, they have been trained in school to think abstractly and to leave aside visual or visceral thinking. Given this robot exercise they would spend 45 minutes on the blackboard writing equations and looking up things in books and scratching their heads before giving up.</p>

<p>Then Kay found some 10-year-olds and gave them the same problem. A 10-year-old doesn't know much about equations. She's a visual thinker. She doesn't know much about circles either, but she does know that a circle is defined as all points that are the same distance from the center. After about ten minutes of thinking and doodling she writes the program this way:</p>

<ol>
  <li>Start at center.</li>
  <li>Move forward N inches.</li>
  <li>Make a mark.</li>
  <li>Move backward N inches.</li>
  <li>Rotate 1 degree.</li>
  <li>GOTO 1</li>
</ol>

<p>That is a significant result. Ten-year-olds beat the pants off teenagers by using an officially discouraged mode of thinking. Not satisfied with that, crazy old Kay invited in a bunch of five-year-olds. You don't let little kids play with chalk if you want to get any work done, so for them he turned it into a game:</p>

<p>"Ok kids, you are robots. Say <i>Beep beep beep!</i> Good, now cover your eyes, and I want you to walk in a circle."</p>

<p>The kids would cover their eyes and cock their heads for a few seconds, thinking about how to walk in a circle while blind. Then they did it. They walked forward just a little, then turned a little, then walked forward again, and so on until their inner ear told them they'd done a full circuit. Not only did these children solve the problem immediately using visceral thinking, their solution was hundreds of times more efficient. Somehow they were intuiting differential equations.</p>

<p>What can I teach a child about computer science? It may be better first to ask what a child can teach <i>me</i>. I don't mean that in the Romantic sense of noble savagery or the wisdom of innocence. Computer science is the study of process: how to do things, how to figure out things, and especially how to figure out how to do things. We have to investigate everything that seems to work, wherever we find it. Visually and viscerally-centered thinking seem like rich veins to explore.</p>

<p>One of Alan Kay's favorite sayings is that having the right point of view is worth 80 IQ points. No matter how clever you are, if you're trying to figure out cube roots on an abacus you are not going to get very far. The right point of view can make the difference between an elegant solution and a pile of steampunk.</p>

<p>Fortunately it cuts both ways. If something you are working on seems to be inelegant, too full of random details and not enough clear principles, that's a sign you might be approaching it with the wrong point of view. Being able to recognize that state and pull yourself out of it is an invaluable skill. Both the hidebound and the feckless are crippled because they can't choose where their minds go next. An interesting, perhaps defining, characteristic of highly creative people is their ability to shift their point of view at will.</p>

<p>Often, especially when learning some large new system, programmers get stuck on something that looks like a bug and ask someone for help. Halfway through explaining their problem they realize where they had gone wrong. <i>Never mind, never mind, it's not a bug, I'm an idiot.</i> At a company I used to work for we had a stuffed bear called the bugbear. Before you could bug anyone else about your bug, you had explain it to the bugbear. Just thinking to yourself doesn't always work; there is something about vocalizing your mini-mental crisis to another being that engages the brain.</p>

<p>So, back to the question: what can we teach a child about computer science? It isn't an abstract set of laws you learn because it's good for you. Computer science is the study of how to do things --anything-- faster and better. It's more than that, of course. But the practical side is ultimately the source and purpose of all this commotion. (Right?)</p>

<ul>

  <li>Two people painting a house is task parallelism. Since they are doing the same job they each need a bucket and brush. Otherwise they'll waste too much time walking back and forth. They also need to make sure that they paint all of the house and don't paint parts that have already been painted. An easy way to coordinate is to start at opposite ends of the house. Even if one person paints faster than the other, they will eventually meet and be done.</li>

  <li>Washing dishes is a little different. There is only one sink so two people can't wash at once. But one person can wash and another can dry. They form a short "pipeline" that dishes flow through one their way to be cleaned. Let's say it takes 30 seconds to wash a dish and 30 seconds to dry. 50 dishes would take 50 minutes for one person. For two people it would take 25 minutes and 30 seconds. 50 / 2 is 25, not 25.5. The extra half-minute comes from how pipelines work: the dryer can't start until the first dish has been washed.</li>

  <li>As pipelines get longer and more complex, managing delays becomes very important. Everyone has been to or seen videos of factories with their long conveyor belts, robots and people. If one section stops the whole line stops. Even tricker is managing stages that have different speeds. If you can make the candy faster than I can put them into boxes, then candy will build up behind me.</li>

  <li>To wash your clothes right you should separate the colors. A really fun way is to sit in the center of the room and toss darks, lights, and colors into different corners. A lot of high-tech companies gauge engineering candidates by their ability to solve the so-called <a href="http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Sort/Flag/">Dutch National Flag problem</a>. A lot of job candidates fail to figure out the trick, and perhaps fail the interview. I don't think that's a problem with the candidates or the basic idea of whiteboard interviews. I think it's a problem that we think DNP is a hard problem, when it is literally child's play.</li>

  <li>If you've ever seen an old war movie you've heard the characters reading numbers over the radio: <i>One niner five alpha bravo ---repeat--- one niner five alpha bravo!</i>. Numbers have to be clear so they say "niner" because "nine" might be mistaken for "five". They use words for letters so no one confuses B with D, P or E. They also repeat important information as an extra safety. It takes longer to communicate this way but getting it right is more important than efficiency. Computers communicate this way too. In fact the theory behind the tradeoff between efficiency and reliability, called Shannon's information theory, is one of the ideas that made modern communications possible. One of the neat things about coding theory is that you can control exactly how efficient or reliable you want something to be.</li>

</ul>

<h3>A book of How</h3>

<p>There are a thousand of these little explanations one could come up with. Doubtless you've invented some of your own when learning a new thing or teaching it to someone else. Children do it all of the time, and we do it all of the time with children. But as far as I know they are not collected or connected with more general concepts. It's almost as though people are embarrassed to use such devices as a regular thing.</p>

<p>If you want to know What something is you look it up in the dictionary or encyclopedia. If you want to know Where something is you look at an atlas. When is covered by history books. Why and Who are kind of scattered about, but a lot of the answers are in history books and encyclopedias. All of these tools: encyclopedias, atlases, histories, were <i>invented</i>. They were invented and compiled one halting step at a time in response to need.</p>

<p>Computer science is, possibly, an attempt to write a book of How. We don't really know what it will look like, kids, but you're invited to help us find out.</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/07/corrupting-the-youth.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.eb1f2071bad2f0d7f63c7fb479856d9b06e72991</id>
         <published>2010-05-01T12:00:00.000-08:00</published>
         <updated>2010-05-01T12:00:00.000-08:00</updated>
         <title type="text">What You See Is What You Touch</title>
         <content type="html"><![CDATA[
           

<p>Touchscreens are a genuine Big Deal, but it's hard to appreciate how big. As we'll see below, touchscreens break two core assumptions underneath how we've designed graphical user interfaces to date. I think we've only seen the start of it.</p>

<p>The mental model we use to design programs assumes a "pointer" that has a definite position but occupies no space. The graphic casually called a pointer is just a marker of where this abstract pointer is at any given time. It's a pointer to a pointer, as it were. The pointer moves along a continuous path on the X,Y plane in response to the user manipulating some device in real space. In other words, your computer is an <a href="http://en.wikipedia.org/wiki/Etch_A_Sketch">Etch-A-Sketch</a>.
<span class="sidenote">
  <img src="../../images/pointer.png" class="bordered"/>
</span>
</p>

<p><a href="http://en.wikipedia.org/wiki/Fitts's_law">Fitts's Law</a> (and more generally, <a href="http://en.wikipedia.org/wiki/Steering_law">Steering Law</a>) has strongly influenced UI design. It states, more or less, that bigger targets are easier to hit. The cool thing is that it tells you exactly how much easier and what error rate you can expect for a given size and distance. It's one reason why high-value click targets like menu bars are placed at the edges of the screen: it gives them effectively infinite size because the edge stops the pointer for you. It also explains why deep hierarchical menus suck: the longer the path you have to steer the pointer through, the more likely it is you'll make a mistake.

<span class="sidenote">
  <img src="../../images/submenus.png" class="bordered" />
</span>
</p>


<p>Computers went through a twenty-year period when displays became larger and denser while pointing technology become more accurate, but didn't change the basic model. We could present more information at once; toolbars grew and sprouted palettes and ribbons and submenus. The presentation of content on screen became more and more faithful to the final output, under the rubric "what you see is what you get". Programmers came to assume that all of these trends would continue, and produced a host of guidelines founded on the little dot that follows an unbroken path:</p>

<ul>
<li>Don't make the user navigate paths that are too long, narrow, or complicated.</li>
<li>Don't place common actions too far away from each other.</li>
<li>"Content" goes in the center with menus, buttons, and toolbars arrayed around it.</li>
<li>If you have more space, fill it with tools. If they can't see it they won't use it.</li>
<li>If it's often used, put it near the edge of the screen.</li>
<li>Keep the locations of things stable.</li>
<li>Corners are ultra-prime locations.</li>
<li>...etc, etc, and so forth</li>
</ul>

<h3>WYSIWYT</h3>
<p>Then cell phones mutated into real pocket computers. So far so good; we can dig into history and new research on how to deal with small screens. It's going to be a pain trying to fit all the information people now expect into a smaller area, but it's possible. But these small screens are also touchable. The union of what you see and what you touch breaks the pointer model and (partially) Steering Law: the "pointer" is no longer a point and its travel no longer continuous. After fifty years of Etch-A-Sketch we get to play with fingerpaint. All of the implications are not yet obvious, but here are six or seven:
<span class="sidenote">
  <a href="http://www.ingledow.co.uk/2010/01/28/ipad-overview" rel="external nofollow"><img src="../../images/ipad-page-turn.jpg" class="bordered" /></a>
</span>
</p>

<p><b>Steering Law doesn't apply</b> as strictly when the user can tap one point and then another without having to steer through a predetermined path in between. This is demonstrated by the new generation of iPad apps that use pop-out menus in ways that wouldn't work as well on a mouse-based system. The cost of a menu has gone down while the cost of screen real estate has gone up, producing a different solution.</p>

<p><b>Make touch areas more obvious, not less.</b> The user has neither real tactile feedback nor a "hover" state to indicate that something will happen when he taps a particular place. Jacob Nielsen <a href="http://www.useit.com/alertbox/ipad.html">observes</a> that iPad developers are currently partying like it's 1993, throwing all sorts of weird conventions into their apps while imitating print. Resist the temptation. A hallmark of new technology seems to be how it initially imitates or rejects its predecessors, then simulates them, and finally absorbs them. At the moment I'd say we're in between stage 1 and stage 2 with touchscreen interfaces. When you first get your hands on an e-reader with near-print resolution and "pages" you can flip with your fingers, it certainly feels like an apotheosis. <i>Hey</i>, you think, <i>this is an acceptable simulation of a book.</i> The illusion frays when you encounter a newspaper app that clings to some of the more annoying conventions of paper, does bizarre things when you tap a photo, and offers no search. [0]</p>

<p><b>The importance of large click targets goes way up</b> on touch interfaces because we're using our big fat fingers instead of a geometric point. This brings up a fact we can no longer politely ignore: some fingers are fatter than others. Industrial designers know all about anthropometric variation. Anyone programming touch interfaces will have to, too -- or at least some, um, rules of thumb. A controlling factor is the average adult male finger width, about 2cm. Female hands, not to mention the hands of children, average smaller. In practice, touch buttons seem to be usable for most people down to 8 or 9 mm, but not much smaller. [1]
<span class="sidenote">
  <a href="http://www.flickr.com/photos/ari/2192628231/" rel="external nofollow" title="Person typing on a pocket computer"><img src="../../images/iphone-thumbs.jpg"/></a>
</span>
</p>

<p>People with smaller hands are generally happier with the software keyboards on mobiles, leading some to speculate on a <a href="http://www.awgh.org/?p=154">conspiracy</a> of pointy-fingered elves. A friend of mine with very large hands has to hold it in his left and poke delicately at the keys with thumb and forefinger. [2] It's possible that one interface will not be able to accommodate everyone, and why not? We have different-sized mice and chairs and playing cards. It would be interesting to see how application designers experiment with configurable button sizes as they do with font sizes. Some software keyboards make common letters like E and T larger than the rest.</p>

<p>A related problem is how <b>the finger and hand occlude other parts of the screen</b> during interactions. Like those jerks in the front row of a movie theater, your hands get in the way at just the wrong time. Enlarging the thing being pressed is a good workaround, but what about the rest of the screen? There are quite a few first-person perspective games that overlay joystick controls on the main view. This kills peripheral vision: when turning left your hand covers most of the right side of the screen. Game developers know this, and that's why you often find several control schemes to choose from while they test out ideas. I suspect that those kinds of controls will gradually migrate to the bottom third of the screen.
<span class="sidenote">
  <img src="../../images/iphone-key.png" class="bordered" />
</span>
</p>

<p><b>More subtle changes are happening with eye focus.</b> With a full-sized computer both eyes are focused on a single point about half a meter away. Pocket computers tend to be held closer and it's uncomfortable to close-focus for long periods. Also, the occlusion problem is partially solved by defocussing your eyes, so parts of the screen blocked for one eye are visible to the other. It's almost automatic. If you know a fast phone typer you can test this out by watching their eyes as they type, then blocking the screen with <i>your</i> finger. Their eyes will turn slightly inward as they change focus.</p>

<p>That feels somehow wrong to me. Great touchtyping is why people love hardware keyboards. A while ago I prototyped a Morse code "<a href="http://ddotdash.com">keyboard</a>" for my mobile phone to see how it compared to software QWERTY keyboards. It sounds funny, and it is partly a joke, but it's also the minimum thing that could possibly work. With practice you can get fairly good.</p>

<p style="text-align:center;">
<a href="http://ddotdash.com"><img src="../../images/morse.jpg" /></a>
</p>

<p>One thing the Morse experiment taught me was <b>that Fitts's Law didn't go away completely</b>. This is not exactly a new revelation, but there is always some hot zone that is easiest to hit, unique to a given device size and orientation. On a pocket computer in portrait mode the hot zone is at the bottom. In landscape mode it's the area on either side, about two-thirds of the way up where the dit [.] and dah [-] buttons are. On a tablet the hot zone seems to be in the bottom right (or left) quadrant. Even so, the importance of the edges is less for touch than it is for mouse-based systems, because <b>a virtual edge cannot stop your real finger</b>. </p>

<p>A new interaction model may also need to take into account <b>handedness and fatigue</b>. Within minutes of using the first version of my "keyboard" I found an annoying bias towards dits in the Morse alphabet. On a telegraph the dit is three times faster than the dah so naturally Morse code uses them more. The surprising part was how uncoordinated my left hand was and how quickly it got tired. I ended up putting three buttons on each side to balance out the work and to reduce the number of taps per character.</p>

<p><b>And then you have multitouch</b>, which knocks "mouse gestures" into a cocked hat, provided we can figure out how to use it effectively. Pinch/expand and rotate are very useful for controlling the "Z axis" perpendicular to the surface. There are also apps to simulate sound boards, pianos, and of course keyboards. Interestingly, multitouch doesn't break any design paradigms I can think of. It <i>replaces</i> a lot of them like resize and rotate "handles". Swipe can be (and is) overused but it's a natural replacement for pagination and scrolling. Using a two-finger tap as a replacement for "right click" to bring up context menus seems to be another home run. There are probably a lot of natural places for it as a modifier signal. For example, a drawing program might allow you to paint with two fingers and control brush size by the distance between them. It's not an unmixed blessing. Multitouch makes it harder to ignore accidental input from the palm and edges of the hand, which means the user can't treat a touch tablet as casually as paper just yet.</p>

<p>Touch interfaces remove one of the last physical barriers between users and digital data. Instead of manipulating data with a cartoon of a hand we control with silly instruments, we can poke it with our real fingers. This is deeply satisfying in a monkey kind of way. It gives programmers strange new problems and responsibilities. We will have to become amateur industrial designers just as we became amateur typographers, linguists, and psychologists. It puts us much closer, physically and emotionally, to the person on the other side of the glass. As users touch our programs, our programs are touching back.

<span class="sidenote">
  <img src="../../images/girl-mirror.jpg" class="bordered" title="Courtesy gemsling on Flickr" />
</span>
</p>

<h3>Notes</h3>
<p>[0] Stage 3 is when the new technology stops trying too hard to simulate older technology, and instead directly addresses (or renders moot) the underlying need, using its unique advantages. Stage 3 is usually gradual. My computer still calls its background layer "the Desktop", years after real desks as such (with drawers, pictures, clocks, inboxes, etc) distilled down to the humble table that holds my computer off the floor. Everything else has been sucked inside and transformed.</p>

<p>[1] This kind of stuff is fascinating, if you're the kind of person who's fascinated by this kind of stuff: "Thai females tended to have wider and thicker fingers but narrower knuckles than the females from Hong Kong, Britain, and India."<br/>
 -- "Hand Anthropometry of Thai Female Industrial Workers", by N Saengchaiya and Y Bunterngchit, <i>Journal of KMITNB</i>, Vol 14, No 1, Jan 2004.</p>

<p>[2] Another friend of mine, who develops touchscreen apps for a living, tells me that finger size doesn't matter all that much. But I notice <i>his</i> fingers are rather pointy and elf-like...
</p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/05/wysiwyt.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.1bf12b75aa49ce747fe0e5d916a8f9624803c18c</id>
         <published>2010-04-01T12:00:00.000-08:00</published>
         <updated>2010-04-01T12:00:00.000-08:00</updated>
         <title type="text">Sweet Justice: justified text for the web</title>
         <content type="html"><![CDATA[
           
<div class="dt">23 April 2010</div>

      <p>
        <a href="http://github.com/aristus/sweet-justice">Sweet Justice</a> is a Javascript library you can drop onto any web page to create beautiful justified text. Even supercalifragilisticexpealadocious.
      </p>

      <p>Sweet Justice lovingly inserts the obscure yet wonderful soft hyphen into the text of any element marked with the <b>sweet-justice</b> class, and turns on <a href="http://www.w3.org/TR/css3-text/#justification">CSS text justification</a>. It requires either jQuery or YUI3 to function.
      </p>

     <p>
       Enjoy!<br/>
       &nbsp;&nbsp; - <a href="http://carlos.bueno.org">Carlos</a>
     </p>

     <hr/>

      <p><a href="http://en.wikipedia.org/wiki/Justification_(typesetting)">From Wikipedia</a>: Justification has been the preferred setting of type in many western languages through the history of movable type. This is due to the classic Western manuscript book page being built of a column or two columns, which is considered to look "best" if it is even-margined on the left and right. The classical Western column did not rigorously justify, but came as close as feasible when the skill of the penman and the character of the manuscript permitted. Historically, both scribal and typesetting traditions took advantage of abbreviations (sigla), ligatures, and swash to help maintain the rhythm and colour of a justified line.</p>

      <p>The use of movable type solidified this preference from a technological point of view. It was much easier to handle and make emendations to large amounts of type that had words or syllables at the ends of lines than it was to respace the ends of lines.</p>

      <p>Its use has only waned somewhat since the middle of the 20th century through the advocacy of the typographer Jan Tschichold's book Asymmetric Typography and the freer typographic treatment of the Bauhaus, Dada, and Russian constructivist movements.</p>

      <p>Not all "flush left" settings in traditional typography were identical. In flush left text, words are separated on a line by the minimum word spacing built into the font.</p>

      <p>Continuous casting typesetting systems such as the Linotype were able to reduce the jaggedness of the right-hand sides of adjacent lines of flush left composition by inserting self-adjusting space bands between words to evenly distribute white space, taking excessive space that would have occurred at the end of the line and redistributing it between words.</p>

      <p>This feature, known as "ragged right" or "in and out ragged", was available in traditional dedicated typesetting systems but is absent from most if not all desktop publishing systems. Graphic designers and typesetters using desktop systems adjust word and letter spacing, or "tracking", on a manual line-by-line basis to achieve the same effect.</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/04/sweet-justice.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.83f7416e3877a88f4d9181ec436b03c47b6122c9</id>
         <published>2010-04-01T12:00:00.000-08:00</published>
         <updated>2010-04-01T12:00:00.000-08:00</updated>
         <title type="text">A Dismal Guide to Concurrency</title>
         <content type="html"><![CDATA[
           
<div class="dt">12 April 2010</div>

<p>
Two people can paint a house faster than one can. Honeybees work independently but <a href="http://en.wikipedia.org/wiki/Bee_learning_and_communication">pass messages</a> to each other about conditions in the field. Many forms of concurrency, so obvious and natural in the real world, are actually pretty alien to the way we write programs today. It's much easier to write a program assuming that there is one processor, one memory space, sequential execution and a God's-eye view of the internal state. Language is a tool of thought as much as a means of expression, and the mindset embedded in the languages we use can get in the way.

  <span class="sidenote">
    (Originally on <a href="http://www.facebook.com/notes/facebook-engineering/a-dismal-guide-to-concurrency/379717628919">Facebook's Engineering blog</a>.)
    <br/><br/>

<a href="http://www.flickr.com/photos/shellysblogger/4107689065/" rel="external nofollow"><img src="../../images/housepainters.jpg" /></a>
    <br/><br/>

    <q>The slovenliness of our language makes it easier for us to have foolish thoughts. The point is that the process is reversible.</q>
    <br/>-- George Orwell
    <br/><i><a href="http://www.mtholyoke.edu/acad/intrel/orwell46.htm">Politics and the English Language</a></i>
    <br/><br/>

    <q>That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted.</q>
    <br/>-- George Boole
    <br/><i><a href="http://www.gutenberg.org/etext/15114">The Laws of Thought</a></i>
  </span>
</p>

<p>We're going through an inversion of scale in computing which is making parallelism and concurrency much more important. Single computers are no longer fast enough to handle the amounts of data we want to process. Even within one computer the relative speeds of processors, memory, storage, and network have diverged so much that they often spend more time waiting for data than doing things with it. The processor (and by extension, any program we write) is no longer a Wizard of Oz kind of character, sole arbiter of truth, at the center of everything. It's only one of many tiny bugs crawling over mountains of data.</p>



<h3>Many hands make light work</h3>
<p>
A few years ago Tim Bray decided to find out where things stood. He put a computer on the internet which contained over 200 million lines of text in one very large file. Then he challenged programers to write a program to do some simple things with this file, such as finding the ten most common lines which matched certain patterns. To give you a feel for the simplicity of the task, Bray's example program employed one sequential thread of execution and had 78 lines of code, something you could hack up over lunch.
</p>

<p>The computer was unusual for the time: it had 32 independent hardware threads which could execute simultaneously. The twist of the <a href="http://wikis.sun.com/display/WideFinder/The+Benchmark" rel="external nofollow">WideFinder challenge</a> was that your program had to use <i>all</i> of those threads at once to speed up the task, while adding as little code as possible. The purpose was to demonstrate how good or bad everyday programming is at splitting large jobs into parallel tracks.</p>

<p>

<i>How hard could it be?</i> I thought. <a href="http://groups.google.com/group/wide-finder/browse_thread/thread/3b7ceb374939bd44" rel="external nofollow">Very hard</a>, as it happened. I got up to 4 parallel processes before my program collapsed under its own weight. The crux of the problem was that the file was stored on a hard drive. If you've never peeked inside a hard drive, it's like a record player with a metal disc and a magnetic head instead of a needle. Just like a record it works best when you "play" it in sequence, and not so well if you keep moving the needle around. And of course it can only play one thing at a time. So I couldn't just split the file into 32 chunks and have each thread read a chunk simultaneously. One thread had to read from the file and then dole out parts of it to the others. It was like trying to get 31 housepainters to share the same bucket.

  <span class="sidenote">
    <a href="http://www.flickr.com/photos/kubina/326628429" rel="external nofollow"><img src="../../images/hard-drive.jpg"/></a>
    <br/><br/>
    <a href="http://en.wikipedia.org/wiki/Parallel_computing">Parallelism</a> is the act of taking a large job, splitting it up into smaller ones, and doing them at once. People often use "parallel" and "concurrent" interchangably, but there is a subtle difference. Concurrency is necessary for parallelism but not the other way around. If I alternate between cooking eggs and pancakes I'm doing both concurrently. If I'm cooking eggs while you are cooking pancakes, we are cooking concurrently and in parallel. Technically if I'm cooking eggs and you are mowing the lawn we are also working in parallel, but since no coordination is needed in that case there's nothing to talk about.
  </span>

</p>

<p>When I looked at other people's entries for hints I was struck by how almost all of them, good and bad, looked complicated and <a href="http://carlos.bueno.org/2010/03/steampunk-and-you.html">steampunky</a>. Part of that was my unfamiliarity with the techniques, but another part was the lack of good support for parallelism which forced people to roll their own abstractions. (Ask four programmers to create a new abstraction and you'll get five and a half answers.) The <a href="http://eigenclass.org/hiki/widefinder2-conclusions">pithiest entry</a> was 130 lines of OCaml, a language that makes it somewhat easier to hack together "parallel I/O". Most of the others were <a href="http://wikis.sun.com/display/WideFinder/Results">several hundred lines</a> long. Many people like me were not able to complete the challenge at all. If it's this difficult to parallelize a trivial string-counting program, what makes us think we're doing it right in complex ones?</p>

<p>
Ideally, concurrency shouldn't leak into the logic of programs we're trying to write. Some really smart people would figure out the right way to do it. They would write papers with lots of equations in them and fly around to conferences for a few years until some other smart people figured out what the hell they were saying. Those people would go develop libraries in our favorite programming languages. Then we could just put <code>import concurrent;</code> at the top of our programs and be on our way. Concurrency would be another thing we no longer worry about unless we want to, like memory management. Unfortunately there is evidence that it won't be this clean and simple. A lot of things we take for granted may have to change.
  <span class="sidenote">
    The switch to memory management wasn't all that easy either, come to think of it.
  </span>
</p>

<p>There are at least two concurrency problems to solve: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. These may be special cases of a more general problem and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.</p>

<p>
At the far end of the spectrum are large distributed databases, such as those used by search engines, online retailers, and social networks. These things are enormous networks of computers that work together to handle thousands of writes and hundreds of thousands of reads every second. More machines in the system raises the odds that one of them will fail at any moment. There is also the chance that a link between groups of machines will fail, cutting the brain in half until it is repaired. There is a tricky balance between being able to read from such a system <i>consistently</i> and <i>quickly</i> and writing to it <i>reliably</i>. The situation is summed up by the <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem">CAP Theorem</a>, which states that large systems have three desirable but conflicting properties: Consistency, Availability, and Partition-tolerance. You can only optimize for two at the expense of the third.
<span class="sidenote"><img src="../../images/triangle.png" /></span>
</p>

<p>
<span class="fright noborder"><img src="../../images/consistent-available.png"></span>
A <b>Consistent/Available</b> system means that reading and writing always
works the way you expect, but requires a majority or quorum of nodes to be
running in order to function. Think of a parliment that must have more than
half of members present in order to hold a vote. If too many can't make it,
say because a flood washes out the bridge, a quorum can't be formed and
business can't proceed. But when enough members are in communication the
decision-making process is fast and unambiguous.
  <span class="sidenote">
    These categories are not rigidly exclusive. The status report problem is usually handled by having heirarchies of supervisors and employees aka "reports". The gossip consistency problem can be helped by tagging data with timestamps or version numbers so you can reconcile conflicting values. The interesting thing about Amazon's Dynamo is not eventual consistency but <i>configurable</i> consistency.
  </span>

</p>
<p>
<span class="fright noborder"><img src="../../images/consistent-partitionable.png"></span>
<b>Consistent/Partitionable</b> means that the system can recover from failures, but requires so much extra coordination that it collapses under heavy use. Imagine having to send and receive a status report for every decision made at your company. You'll always be current, and when you come back from vacation you will never miss a thing, but making actual progress would be very slow.</p>

<p>
<span class="fright noborder"><img src="../../images/available-partitionable.png"></span>
<b>Available/Partitionable</b> means that you can always read and write values, but the values you read might be out of date. A classic example is gossip: at any point you might not know the latest on what Judy said to Bill but eventually word gets around. When you have new gossip to share you only have to tell one or two people and trust that in time it will reach everyone who cares. Spreading gossip among computers is a bit more reliable because they are endlessly patient and (usually) don't garble messages.</p>

<p>After lots of groping around with billions of dollars of revenue at stake, people who build these large systems are <a href="http://queue.acm.org/detail.cfm?id=1466448">coming to the conclusion</a> that it's most important to always be able to write to a system quickly and read from it even in the face of temporary failures. Stale data is a consequence of looser coupling and greater autonomy needed to make that possible. It's uncomfortable to accept the idea that as the computing power of an Available/Partitionable system scales up, the <a href="http://en.wikipedia.org/wiki/Fog_of_war">fog of war</a> descends on consistency, but in practice it's not the end of the world.</p>

<p>This was not a whimsical nor easy choice. Imagine Ebenezer Scrooge is making so much money that Bob Cratchit can't keep up. Scrooge needs more than one employee to receive and count it. To find out the grand total of his money at any point, he has to ask each of them for a subtotal. By the time Scrooge gets all the answers and adds them up, his employees have counted more money, and his total is already out of date. So he tells them to stop counting while he gathers subtotals. But this wastes valuable working time. And what if Scrooge adds another counting-house down the street? He'll have to pay a street boy, little Sammy Locke, to a) run to the other house and tell them to stop counting, b) gather their subtotals, c) deliver them to Scrooge, then d) run back to the other house to tell them to resume counting. What's worse, his customers can't pay <i>him</i> while this is happening. As his operation gets bigger Scrooge is faced with a growing tradeoff between stale information and halting everything to wait on Locke. If there's anything Scrooge likes less than old numbers, it's paying people to do nothing.

<span class="sidenote"><a href="http://en.wikipedia.org/wiki/Relativity_of_simultaneity"><img src="../../images/lightcone.png"></a></span>

</p>

<p>
Scrooge's dilemma is forced upon him by <a href="http://en.wikipedia.org/wiki/Relativity_of_simultaneity">basic physics</a>. You can't avoid it by using electrons instead of street urchins. It's impossible for an event happening in one place (eg data changing inside one computer or process) to affect any other place (eg other computers or processes) until the information has had time to travel between them. Where those delays are small relative to performance requirements, Scrooge can get away with <a href="http://en.wikipedia.org/wiki/Two-phase_locking">various forms of locking</a> and enjoy the <i>illusion</i> of a shared, consistent memory space. But as programs spread out over more and more independent workers, the complexity needed to maintain that illusion begins to overwhelm everything else.

  <span class="sidenote">
    This is not about speed-of-light effects or anything like that. I'm only talking about <a href="http://en.wikipedia.org/wiki/Absolute_time_and_space#Historical_controversy">reference frames</a> in the sense of "old news", such as when you find out your cousin had gotten married last year. Her wedding and your unawareness are both "true" relative to your reference frames until you receive news to the contrary.
  </span>

</p>

<h3>Scratch that, reverse it</h3>

<p>Shared memory can be pushed fairly far, however. Instead of explicit locks, <a href="http://clojure.org/refs">Clojure</a> and many newer languages use an interesting technique called <a href="http://en.wikipedia.org/wiki/Software_transactional_memory">software transactional memory</a>. STM simulates a sort of post-hoc, fine-grained, implicit locking. Under this scheme semi-independent workers, called threads, read and write to a shared memory space as though they were alone. The system keeps a log of what they have read and written. When a thread is finished the system verifies that no data it <i>read</i> was changed by any other. If so the changes are committed. If there is a conflict the transaction is aborted, changes are rolled back and the thread's job is retried. While threads operate on non-overlapping parts of memory, or even non-overlapping parts of the same data structures, they can do whatever they want without the overhead of locking. In essence, transactional memory allows threads to ask for forgiveness instead of permission.</p>

<p>
As you might have guessed from those jolly hints about conflict and rollback, STM has its own special problems, like how to perform those abort/retry cycles efficiently on thousands of threads. It's fun to imagine pathological conflict scenarios in which long chains of transactions unravel like a cheap sweater. Also, STM can only handle actions that are undoable. You can't retry most kinds of I/O for the same reason you can't rewind a live concert. This is handled by queueing up any non-reversible actions, performing them outside of the transaction, caching the result in a buffer, and replaying as necessary. Read that sentence again.
  <span class="sidenote">
    There is a recent paper about an <a href="http://people.csail.mit.edu/mareko/transact09-lev.pdf">interesting variation</a> on this theme called HyTM, which appears to do a copy-on-write instead of performing writes to shared memory.
  </span>
</p>

<h3>Hold this thread as I walk away</h3>

<p>
Undeniably awesome and clever as STM threads are, I'm not convinced that
shared memory makes sense outside of the "cell membrane" of a single
computer. Throughput and latency <i>always</i> have the last laugh. A concurrent
system is fundamentally limited by how often processes have to coordinate
and the time it takes them to do so. As of this writing computer memory can
be accessed in about 100 nanoseconds. Local network's latency is measured in
microseconds to milliseconds. Schemes that work well at local memory speeds
don't fly over a channel one thousand times slower. Throughput is a problem
too: memory can have one hundred times the throughput of network, and is
shared among at most a few dozen threads. A large distributed database can
have tens of thousands of independent threads contending for the same
bandwidth.

  <span class="sidenote">
    <a href="http://www.flickr.com/photos/daquellamanera/201509257/"><img src="../../images/messenger-pigeon.png"></a>
  </span>

</p>

<p>If we can't carry the shared-memory model outside of the computer, is there some other model we can bring inside? Are threads, ie semi-independent workers that play inside a shared memory space, absolutely necessary? In his "<a href="http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html">standard lecture</a>" on threads Xavier Leroy details three reasons people use them:</p>

<ul>
  <li>Shared-memory parallelism using locks or transactions. This is explicitly disowned in both <a href="http://erlang.org">Erlang</a> and Leroy's OCaml in favor of message-passing. His argument is that it's too complex, especially in garbage-collected languages, and doesn't scale. (This was in 2002 when multiprocessors were rare and strange beasts.)</li>

  <li>Overlapping I/O and computation, ie while thread A is waiting for data to be sent or received, threads B-Z can continue their work. Overlapping (aka non-blocking I/O) is needed to solve problems like WideFinder efficiently. This is often thwarted by low-level facilities inside the operating system that were written without regard to parallelism. Leroy thinks this should be fixed at the OS level instead of making every program solve it again and again.</li>

  <li><a href="http://members.verizon.net/olsongt/stackless/why_stackless.html#coroutines">Coroutines</a>, which allow different functions to call each other repeatedly without generating an infinitely long stack of references back to the first call. This looks suspiciously like <a href="http://en.wikipedia.org/wiki/Message_passing">message-passing</a>.</li>
</ul>

<p>
Message-passing, which <a href="http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/017019.html">first appeared</a> in Smalltalk, is the core abstraction of Joe Armstrong's programming language Erlang. Erlang programs do things that make programmers take notice, like run some of the busiest telephone switches for years without fail. It approaches concurrency with <a href="http://armstrongonsoftware.blogspot.com/2006/08/concurrency-is-easy.html">three iron rules</a>: no shared memory even between processes on the same computer, a standard format for messages passed between processes, and a guarantee that messages are read in the order in which they were received. The first rule is meant to avoid the heartaches described above and embraces local knowledge over global state. The second and third keep programmers from endlessly reinventing schemes for passing messages between processes. Every Erlang process has sovereign control over its own memory space and can only affect others by sending well-formed messages. It's an elegant model and happens to be a cleaned-up version of the way the <a href="http://www.rfc-editor.org/rfc/rfc1958.txt">internet itself</a> is constructed. Message-passing is already one of the axioms of concurrent <i>distributed</i> computation, and may well be universal.
  <span class="sidenote">
    <a href="http://www.flickr.com/photos/lorelei-ranveig/2294885420/"><img src="../../images/neuron.jpg" border="0" /></a>
    <br/><br/>
    <a href="http://www.google.com/search?q=erlang+%22nine+nines%22">A lot of writeups</a> repeat a "nine nines", ie 99.9999999% reliability claim for Erlang-based Ericsson telephone switches owned by British Telecoms. This works out to <i>31 milliseconds</i> of downtime per <i>year</i>, which hovers near the edge of measurability, not to say plausibility. I was present at a talk Armstrong gave in early 2010 during which he was asked about this. There was a little foot shuffling as he qualified it: it was actually 6 or so seconds of downtime in one device during a code update. Since BT had X devices over Y years, they calculated it as 31ms of average downtime per device per year. Or something like that. Either way it's an impressive feat.
  </span>


</p>

<p>There are probably <a href="http://conal.net/blog/posts/can-functional-programming-be-liberated-from-the-von-neumann-paradigm/">more axioms to discover</a>. Languages become more powerful as abstractions are made explicit and standardized. Message-passing says nothing about optimizing for locality, ie making sure that processes talk with other processes that are located nearby instead of at random. It might be cool to have a <a href="http://www.arctic.umn.edu/papers/comm-char.pdf">standard way</a> to measure the locality of a function call. Languages become even more powerful when abstractions are made first-class entities. For example, languages that can pass functions as arguments to other functions can generate new types of higher-order functions without the programmer having to code them by hand. A big part of distributed computing is designing good protocols. I know of no language that allows <i>protocols</i> as first-class entities that can be passed around and manipulated like functions and objects are. I'm not even sure what that would look like but it might be interesting to try out.</p>

<p>There is a lot of sound and fury around parallelism and concurrency. I don't know what the answer will be. I personally suspect that a relaxed, shared-memory model will work well enough within the confines of one computer, in the way that Newtonian physics works well enough at certain scales. A more austere model will be needed for a small network of computers, and so on as you grow. Or perhaps there's something out there that will make all this lockwork moot.</p>



         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/04/dismal-guide-to-concurrency.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.5c8aacba9c59dc2159384f104d3380ec0aa2b01b</id>
         <published>2010-03-01T12:00:00.000-08:00</published>
         <updated>2010-03-01T12:00:00.000-08:00</updated>
         <title type="text">Steampunk, Progress, and You</title>
         <content type="html"><![CDATA[
           

<p><a href="http://en.wikipedia.org/wiki/Phoenix_(mythology)"><img src="../../images/phoenix.jpg" class="fright" border="0" title="Drawing of a phoenix from the Aberdeen Bestiary."/></a>
Breakthroughs are technological advances which change how people think. It's almost a given in popular culture that breakthroughs can't be predicted or characterized. To merely conceive one is to change the world it is born into. That's a very <i>mythological</i> thing to believe about the advancement of human knowledege. Are breakthroughs really immune to analysis?</p>

<p>Suppose that breakthroughs, like the phoenix, do keep their own schedules. We can still examine their nesting sites for clues. Some breakthroughs happen when the demand on a technology overwhelms regular innovation's ability to deliver. This happens in a big way during wartime and on a smaller scale every day. When I discover some new way to look at a program I'm writing, it's only after I've tried to stuff up the cracks with hasty code and deranged notes to myself. Maybe that's a formula: look for vital technologies that seem overly complex and burdened with new words.</p>


<p><a href="http://en.wikipedia.org/wiki/File:Stepper_detail.jpg"><img src="../../images/strowger-switch.jpg" class="fleft" border="0" title="A Strowger switch, used in early switched telephony." /></a>
"Overly complex" and "burdened with words" are hallmarks of the faux-historical <a href="http://en.wikipedia.org/wiki/Steampunk#Art_and_design">Steampunk</a> style and the predigital technology it cariacatures. I think we are especially fascinated by Steampunk machines because the mechanisms of control are obvious and human-scaled. This lever goes like so, the cable pulls the thingabob thusly, causing the flywheel to spin, etc. Like all cariacatures, Steampunk aesthetic exaggerates the features that make its subject distinctive. Behind the brass fittings and oddly-named components is a portrait of the mind which thought them necessary: ingenious mechanical design combined with astonishing ignorance about the principles of process control those designs embody. There is real tension there, a naive insistence that if something is important it must also be big enough to see.</p>

<h3>Steampunk is ever with us</h3>

<p><a href="http://www.flickr.com/photos/druclimb/314571349/" rel="external nofollow"><img src="../../images/steampunk-vcr.jpg" class="fright" title="Detail of some gears and linkages from an old VCR." /></a>
As a wee lad in the last century I had a job repairing video cassette recording devices. Dozens of parts had to be properly greased, timed, or roughened up to get the right amount of tension and smooth unwinding. They had names like the Reverse/Forward Idler, Flywheel Belt, Pinch Roller, and Flying Erase Head. There was a long coil of wire whose only purpose was to delay the luma signal just enough to sync it with the chroma. Really old VCRs are as Steampunk as they come.  [<a href="#note-0" name="ref-0">0</a>]</p>

<p>Over time VCRs became smaller and vastly less complex while performing the same tasks as their ancestors. Very clever people spent decades improving the clockwork of video processing. Now of course we know the real breakthrough lay in getting rid of the clockwork.</p>

<p>It's probably a mistake to feel smug about that. The future isn't measured by years or technology, but by how it disagrees with the past about what is and what is not important. For example, I suspect the future will mock our assumption that computer intelligence will in any way resemble human intelligence, for the same reason we mock brass-handled Victorian contraptions. The idea is so central yet unfounded that it's bound to be wrong somehow. Whatever it is we're grasping toward, it may be "artificial intelligence" only in the sense that a car is an "iron horse".</p>

<p>
<a href="http://www.flickr.com/photos/trarbach/3525165735/" rel="external nofollow"><img src="../../images/mile-0.jpg" class="fright" title="Begin Highway 1 and mile marker 0 sign in Key West, Florida, USA."/></a>
In the <a href="http://www.youtube.com/watch?v=2Op3QLzMgSY#t=1m16s">first</a> of the famous <a href="http://groups.csail.mit.edu/mac/classes/6.001/abelson-sussman-lectures/">SICP lectures</a> Hal Ableson explains that we are as ignorant about the true nature of computing as early Egyptians were about geometry. Those guys did impressive stuff with string and rocks but they didn't really understand why it worked. It took Euclid to distill the axioms and methods we now file under geometry. The misfit name geo-metry, ie earth-measurement, is a fossil of its confused history. Perhaps the jumbled bag of tricks  we call "computer science" &mdash;not a science, and not about computers&mdash; is similarly waiting for something to hatch.</p>



<br/><br/>

<h3>Notes</h3>
<p>[<a href="#ref-0" name="note-0">0</a>] My father disagrees with this interpretation. He points out that in the 1970s consumer video players were a huge gamble. It was decided that they had to be super reliable, and so were overbuilt. The evolution of VCRs post 1980 is a story of materials science (eg Teflon-coated plastic parts instead of greased metal and rubber belts) and amazing advances in integrated circuits. In short, VCRs became less reliable at roughly the same rate they became cheap enough that it didn't matter.</p>

<p>He has a point. The designers of early VCRs were not confused or stupid. Nevertheless they produced the most complex mechanical devices ever seen in the home, the maintainence of which provided employment for thousands.</p>

<p>Also see the <a href="http://www.cedmagic.com/history/ampex-commercial-vtr-1956.html">Ampex VCR from 1956</a>, which was the size of a washer and dryer, and the failed <a href="http://en.wikipedia.org/wiki/Capacitance_Electronic_Disc#How_CEDs_work">SelectaVision video disc</a>, which played video encoded onto 12-inch plastic records. </p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/03/steampunk-and-you.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.6496bb18034e6090d92ada9874e4bf18b7c78821</id>
         <published>2010-03-01T12:00:00.000-08:00</published>
         <updated>2010-03-01T12:00:00.000-08:00</updated>
         <title type="text">A jQuery / YUI3 Rosetta Stone</title>
         <content type="html"><![CDATA[
           
  <div class="dt">19 March 2010</div>

  <p><b>Update:</b> Check out its new home at <a href="http://www.jsrosettastone.com/">jsrosettastone.com</a></p>


  <p>
  In late 2009 I was porting a largish prototype from <a href="http://jquery.com">jQuery</a> to <a href="http://developer.yahoo.com/yui">YUI3</a> and kept wishing for a cheatsheet to help me translate the idioms of one library to the other. A few months and lots of expert help later, here is a <a href="/jq-yui.html">Rosetta Stone</a> for jQuery and YUI3. You can read and comment on the <a href="/jq-yui.html">HTML version</a>, <a href="/jq-yui.pdf">download the PDF</a>, or fork the <a href="http://github.com/aristus/jquery-yui3-rosetta-stone">project on GitHub</a>. It is by no means complete or authoritative but I hope it's useful.

  <span class="sidenote">
    <img src="../../images/rosetta-stone.png"/>
   </span>
</p>

  <p><b>This is a reference, not an evaluation or comparison of the libraries.</b> If you have an opinion about which one is "better", that's great, but I don't want to hear it.</p>

<p>jQuery and YUI3 only overlap in certain areas like cross-browser DOM manipulation, CSS, Ajax, etc. In general I think jQuery's convenience methods, core selector support, and method chaining are its greatest strengths. YUI3 also has selectors and chaining, but favors configurability, breadth, and support for complex applications rather than terseness.</p>

  <p>Enjoy!</p>

  <p><a href="/jq-yui.html">HTML version</a>, <a href="/jq-yui.pdf">PDF Version</a></p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/03/jquery-yui-rosetta-stone.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.1bb1f6b87586a0b23167c42b77459bcec07e940a</id>
         <published>2010-02-01T12:00:00.000-08:00</published>
         <updated>2010-02-01T12:00:00.000-08:00</updated>
         <title type="text">Measuring Javascript Parse and Load</title>
         <content type="html"><![CDATA[
           
            <div class="dt">10 February 2010</div>
            <div style="clear:both;"></div>

<style>
  table.chart {
    cell-padding: 0;
    cell-spacing: 0;
    border-bottom: solid 1px #999;
    border-right: solid 1px #999;
    border-collapse: collapse;
  }

  table.chart td {
    border-top: solid 1px #999;
    border-left: solid 1px #999;
    margin:0;
    padding:0;
    font-size: 10pt;
    vertical-align: center;
    text-align: center;
    width: 60px;
  }

  table.chart th {
    padding: 4px;
    dddbackground: #dfdfdf;
    border-right: solid 1px #999;
    font-weight: bold;
    font-size: 10pt;
  }

  table.chart td.v {
    white-space: nowrap;
    text-align: left;
    border: none;
    font-size: 10pt;
    ddbackground: #dfdfdf;
    border-bottom: solid 1px #999;
    padding: 4px;
  }

  div.caption {
    font-size: 10pt;
    font-style: italic;
    margin: 10px;
    text-align:center;
  }

  div.figure {
    text-align: center;
    margin: 20px 0;
    padding: 10px;
  }


  td.asterix {
    color: #666;
  }
  th.asterix {
    color: #666;
  }


  div.td-container {
    position:relative;
  }


  div.barbar {
    width: 80px;
    height:30px;
    position:relative;
  }
  div.bar {
    background-color:#a0d3fe;
    position: absolute;
    top: 0px;
    left: 0px;
    bottom: 0px;
  }

  div.num {
     position: absolute;
     text-align:center;
     top: 8px;
     width:100%;
     text-shadow: 0.1em 0.1em #eee;
  }

  td.asterix div.bar {
    background-color: #ddd;
  }

  div.aside {
    font-size: 10pt;
    font-weight: normal;
  }

  .hidden {
    display:none;
  }

  div.img {
    display:inline-block;
    float:right;
    margin-right: 20px;
  }
  div.img * {
    font-size: 8pt;
  }
</style>

<p>Any savvy web developer can tell you how many kilobytes their code consumes. They bundle, <a href="http://developer.yahoo.com/performance/rules.html#minify">minify</a>, <a href="http://developer.yahoo.com/performance/rules.html#gzip">compress</a> and tune the data sent out to within an inch of its life. Wire weight is easy to measure and has a direct impact on your application's launch time. But how many milliseconds does it take the user's computer to <i>parse and load</i> your code once it's arrived? What differences are there between CPUs, operating systems, browsers and plugins? What speed leaks are we overlooking?</p>
<ul>
<li><a href="#the-test">The Test</a></li>
<li><a href="#libs">The Libraries</a></li>
<li><a href="#results">Results</a></li>
<li><a href="#minifi">Minification</a></li>
<li><a href="#conclusion">So What?</a></li>
<li><a href="#try">Try this at home</a></li>

<li>Appendix
  <ul>
    <li><a href="#chrome">The Curious Case of Chrome</a></li>
    <li><a href="#debugging">Debugging the Benchmark</a></li>
    <li><a href="#gc">Adventures in Garbage Collecting</a></li>
    <li><a href="#opera">A Tragic Opera (updated)</a></li>
  </ul>
</ul>


<a name="the-test"><h3>The Test</h3></a>

<p>As you look at the data below, keep in mind four things:</p>
<ol>
<li>Your code is not the only code running on the user's computer</li>
<li>Parse-n-load time comes down to available CPU cycles and RAM</li>
<li>The fastest CPUs aren't getting much faster</li>
<li>The average consumer CPU is getting <i>slower</i></li>
</ol>

<p>Even if you look at just sexy new hardware it's hard to ignore low-power clients: there are about 50 million netbooks and 43 million iPhones out there, alongside 10-15 million Android devices. Almost all of them are in the 600-1,600 MHz range and have less than 512MB of RAM. A juicy, growing slice of the market wants to use your software on the equivalent of a desktop from 1998. This includes rich Westerners as well as people in the fastest-growing international markets.</p>

<p>The test harness loads a given block of Javascript from a local file over and over and measures the setup time. My test subjects were the Yahoo User Interface (<a href="http://developer.yahoo.com/yui/3">YUI</a>) libraries, <a href="http://script.aculo.us">Scriptaculous</a>, and <a href="http://jqueryui.com">jQuery UI</a>. I've also included the main Javascript application code from <a href="http://github.com">GitHub</a>.</p>

<p>The core test is as simple as can be: record a start time, load the script, record the end time, and repeat over 1,000 iterations. The tests were run on recently-booted machines with no other programs running. You can <a href="http://github.com/aristus/parse-n-load">check out the project on GitHub</a> and play along at home.</p>

<a name="libs"><h3>The Libraries</h3></a>

<div class="img">
  <img src="/images/you-like-apples.jpg"/>
  <div xmlns:cc="http://creativecommons.org/ns#" about="http://www.flickr.com/photos/charlietakesphotos/78511025/"><a rel="cc:attributionURL" href="http://www.flickr.com/photos/charlietakesphotos/">Flickr: charlietakesphotos</a></div>
</div>

<ul>
  <li><b>YUI 2.8.0r4</b>, 390KB <div class="aside">partial (dom, event, datasource, datatable, layout, tabview, treeview, menu)</div></li>
  <li><b>YUI 3.0 build 1549</b>, 311KB <div class="aside">"kitchen sink"</div></li>
  <li><b>Scriptaculous 1.8.3</b>, 159KB  <div class="aside">"kitchen sink"</div></li>
  <li><b>jQuery UI 1.7.2</b>, 359KB  <div class="aside">"kitchen sink", minus translations</div></li>
  <li><b>GitHub.com 09 Feb 2010</b>, 211KB <div class="aside">main application code, including jQuery 1.4</div></li>
</ul>
<p>The point of this benchmark is to compare browsers and CPUs. Comparing the parse-n-load of different <i>libraries</i> puts you on shaky ground. For example, the YUI2 libraries are as of this writing more comprehensive than, say, YUI3 or  Scriptaculous. On the other hand it's rare for an application to load every module as we're doing here. Also, each library has a diffferent approach to initialization. YUI2 does a lot of work up-front while YUI3 does things more lazily. From there you get into a complex question about what benefits each library buys you. <i>Do not base the choice of library on this benchmark.</i></p>


<a name="results"><h3>MacBook Pro 2.26GHz, OSX 10.5.8</h3><br/></a>

<table class="chart justice-denied" cellspacing="0">
  <tr>
  <th style="background:none"></th>
  <th title="Google Chrome 5.0.307.7">Chrome</th>
  <th title="Apple Safari 4.0.3">Safari 4</th>
  <th title="Mozilla Firefox 3.0.14">Firefox 3</th>
  <th title="Mozilla Firefox 3.5.3">Firefox 3.5</th>
  <th title="Mozilla Firefox 3.6.0">Firefox 3.6</th>
  <th title="Opera 10.10 b6795">Opera 10</th>
  <th title="Apple Safari 3.0.4">Safari 3</th>
  </tr>

  <tr>
    <td class="v"><b>YUI3</b></td>
<td> <div class="barbar"><div class="bar" style="width:6px">&nbsp;</div><div class="num">31 <!-- (7) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:9px">&nbsp;</div><div class="num">47 <!-- (51) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:15px">&nbsp;</div><div class="num">78 <!-- (89) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:18px">&nbsp;</div><div class="num">94</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">83</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">69</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:19px">&nbsp;</div><div class="num">98</div></div></td>
  </tr>
  <tr>
    <td class="v"><b>YUI2</b></td>
<td> <div class="barbar"><div class="bar" style="width:2px">&nbsp;</div><div class="num">11 <!-- (11) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:22px">&nbsp;</div><div class="num">110 <!-- (112) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:15px">&nbsp;</div><div class="num">77 <!-- (101, 101, 101) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:21px">&nbsp;</div><div class="num">105</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:20px">&nbsp;</div><div class="num">102</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">74</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:24px">&nbsp;</div><div class="num">120</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>Scriptaculous</b></td>
<td> <div class="barbar"><div class="bar" style="width:1px">&nbsp;</div><div class="num">7 <!-- (13) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:10px">&nbsp;</div><div class="num">38 <!-- (50*) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">63 <!-- (64) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:18px">&nbsp;</div><div class="num">91</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">71</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">73</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">61</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>jQuery UI</b> </td>
<td> <div class="barbar"><div class="bar" style="width:0px">&nbsp;</div><div class="num">3 <!-- (7) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:8px">&nbsp;</div><div class="num">40</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">84 <!-- (85) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:19px">&nbsp;</div><div class="num">95</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">84</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">73</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">115</div></div></td>
  </tr>


  <tr>
    <td class="v"><b>GitHub</b></td>
<td> <div class="barbar"><div class="bar" style="width:0px">&nbsp;</div><div class="num">4 <!-- (10) --></div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">63</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">67</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">80</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">66</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">54</div></div></td>
    <td><div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">70</div></div></td>
  </tr>
</table>
<div class="caption">
  <p>Table 0: Parse-and-load times in milliseconds for various Javascript libraries and browsers, 95th percentile mean. MacBook Pro 2.26 GHz Intel Core 2 Duo with 4GB of 1GHz DDR3 RAM.
  </p>
</div>



<h3>Presario R3000 1.6GHz, WinXP SP3</h3><br/>
<table class="chart justice-denied" cellspacing="0">
  <tr>
  <th style="background:none"></th>
  <th title="Google Chrome 4.0.249.89">Chrome</th>
  <th title="Mozilla Firefox 3.0.11">Firefox 3</th>
  <th title="Mozilla Firefox 3.5.2">Firefox 3.5</th>
  <th title="Mozilla Firefox 3.6.0">Firefox 3.6</th>
  <th title="Microsoft Internet Explorer 6.00.2900">IE 6</th>
  <th title="Microsoft Internet Explorer 7.00.5730">IE 7</th>
  <th title="Microsoft Internet Explorer 8.00.6001">IE 8</th>
  </tr>

  <tr class="hidden">
    <td class="v"><b>YUI3</b> (raw)</td>
<td> <div class="barbar"><div class="bar" style="width:7px">&nbsp;</div><div class="num">38</div></div></td></td>
<td> <div class="barbar"><div class="bar" style="width:32px">&nbsp;</div><div class="num">163</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:37px">&nbsp;</div><div class="num">189</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:35px">&nbsp;</div><div class="num">178</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">118</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:21px">&nbsp;</div><div class="num">106</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:20px">&nbsp;</div><div class="num">101</div></div></td>
  </tr>
  <tr>
    <td class="v"><b>YUI3</b></td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">19</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">119</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:28px">&nbsp;</div><div class="num">143</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">134</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">65</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">61</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:10px">&nbsp;</div><div class="num">51</div></div></td>
  </tr>
  <tr class="hidden">
    <td class="v" class="hidden"><b>YUI2</b> (raw)</td>
<td> <div class="barbar"><div class="bar" style="width:7px">&nbsp;</div><div class="num">37</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:37px">&nbsp;</div><div class="num">187</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:44px">&nbsp;</div><div class="num">220</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:31px">&nbsp;</div><div class="num">158</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:41px">&nbsp;</div><div class="num">209</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">132</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:24px">&nbsp;</div><div class="num">122</div></div></td>
  </tr>
  <tr>
    <td class="v"><b>YUI2</b></td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">18</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:28px">&nbsp;</div><div class="num">140</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:33px">&nbsp;</div><div class="num">165</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:29px">&nbsp;</div><div class="num">148</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:17px">&nbsp;</div><div class="num">89</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">83</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">69</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>Scriptaculous</b></td>
<td> <div class="barbar"><div class="bar" style="width:2px">&nbsp;</div><div class="num">14</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:17px">&nbsp;</div><div class="num">89</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:27px">&nbsp;</div><div class="num">135</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">116</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">82</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">84</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:8px">&nbsp;</div><div class="num">44</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>jQuery UI</b> </td>
<td> <div class="barbar"><div class="bar" style="width:2px">&nbsp;</div><div class="num">11</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">119</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:30px">&nbsp;</div><div class="num">151</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:27px">&nbsp;</div><div class="num">136</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">62</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">62</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:10px">&nbsp;</div><div class="num">53</div></div></td>
  </tr>


  <tr>
    <td class="v"><b>GitHub</b></td>
<td> <div class="barbar"><div class="bar" style="width:1px">&nbsp;</div><div class="num">9</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:17px">&nbsp;</div><div class="num">89</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:22px">&nbsp;</div><div class="num">110</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:21px">&nbsp;</div><div class="num">107</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:11px">&nbsp;</div><div class="num">56</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:10px">&nbsp;</div><div class="num">54</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:8px">&nbsp;</div><div class="num">41</div></div></td>
  </tr>
</table>
<div class="caption">
  <p>Table 1: Parse-and-load, 95th percentile mean. Compaq Presario R3000 at 1.6GHz, Windows XP SP3 and 512MB RAM.</p>
</div>


<h3>MacBook Air 1.6GHz, OSX 10.5.8</h3><br/>

<table class="chart justice-denied" cellspacing="0">
  <tr>
  <th style="background:none"></th>
  <th title="Google Chrome 5.0.307.7">Chrome</th>
  <th title="Apple Safari 4.0.2">Safari 4</th>
  <th title="Mozilla Firefox 3.0.12">Firefox 3</th>
  <th title="Mozilla Firefox 3.5.2">Firefox 3.5</th>
  <th title="Mozilla Firefox 3.6.0">Firefox 3.6</th>
  </tr>

  <tr class="hidden">
    <td class="v"><b>YUI3</b> (raw)</td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">19</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:18px">&nbsp;</div><div class="num">92</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:43px">&nbsp;</div><div class="num">219</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:46px">&nbsp;</div><div class="num">234</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:43px">&nbsp;</div><div class="num">216</div></div></td>
  </tr>
  <tr>
    <td class="v"><b>YUI3</b></td>
<td> <div class="barbar"><div class="bar" style="width:1px">&nbsp;</div><div class="num">8</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:14px">&nbsp;</div><div class="num">72</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:35px">&nbsp;</div><div class="num">178</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:34px">&nbsp;</div><div class="num">170</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:31px">&nbsp;</div><div class="num">157</div></div></td>
  </tr>
  <tr class="hidden">
    <td class="v"><b>YUI2</b> (raw)</td>
<td> <div class="barbar"><div class="bar" style="width:4px">&nbsp;</div><div class="num">24</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:50px">&nbsp;</div><div class="num">251</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:60px">&nbsp;</div><div class="num">301</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:53px">&nbsp;</div><div class="num">268</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:49px">&nbsp;</div><div class="num">249</div></div></td>
  </tr>
  <tr>
    <td class="v"><b>YUI2</b></td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">15</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:36px">&nbsp;</div><div class="num">182</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:35px">&nbsp;</div><div class="num">176</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:38px">&nbsp;</div><div class="num">190</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:35px">&nbsp;</div><div class="num">177</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>Scriptaculous</b></td>
<td> <div class="barbar"><div class="bar" style="width:2px">&nbsp;</div><div class="num">10</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:11px">&nbsp;</div><div class="num">56</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:19px">&nbsp;</div><div class="num">97</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:32px">&nbsp;</div><div class="num">163</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:28px">&nbsp;</div><div class="num">141</div></div></td>
  </tr>

  <tr>
    <td class="v"><b>jQuery UI</b> </td>
<td> <div class="barbar"><div class="bar" style="width:0px">&nbsp;</div><div class="num">4</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">61</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:30px">&nbsp;</div><div class="num">150</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:33px">&nbsp;</div><div class="num">167</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:30px">&nbsp;</div><div class="num">151</div></div></td>
  </tr>


  <tr>
    <td class="v"><b>GitHub</b></td>
<td> <div class="barbar"><div class="bar" style="width:1px">&nbsp;</div><div class="num">5</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:15px">&nbsp;</div><div class="num">77</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">131</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">130</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">117</div></div></td>
  </tr>
</table>
<div class="caption">
  <p>Table 2: Parse-and-load, 95th percentile mean. MacBook Air 1.6 GHz Intel Core 2 Duo with 2GB of 1GHz DDR2 RAM.</p>
</div>

<p>NB: the MacBook Air had overheating problems so some of these numbers in this table may be skewed upwards. On the other hand, that's precisely the kind of crap your users have to deal with.</p>

<p>So it seems that there is a measurable cost to parse-n-load, and parsing speed does not seem to be correlated with the speed of the interpreter or DOM. Chrome has some anomalies in the data so for now I am withholding judgement. See "<a href="#chrome">Curious Case of Chrome</a>" in the appendix.

<p>There is a noticeable spread between different browsers on the same hardware and OS. Firefox 3.5 is a few points slower than 3.0, but 3.6 improved on that. Internet Explorer is surprisingly fast at parse-n-load across all tested versions. I didn't include standard deviations because aside from some pathological cases they were small. If you run the benchmark for yourself you will get mean average, stddev, and a time series graph for your enjoyment.</p>


<a name="minifi"><h3>Minification FTW</h3></a>

<p>Here is a comparison of the YUI libraries in "raw" form (with comments, whitespace, etc) and the same code minified using <a href="http://developer.yahoo.com/yui/compressor">YUI Compressor</a>. As expected, minification helps parse-n-load in addition to network transmission time. This is probably due to the absence of comments and extra whitespace.</p>

<table class="chart justice-denied" cellspacing="0">
  <tr>
  <th style="background:none"></th>
  <th title="Google Chrome 4.0.249.89">Chrome</th>
  <th title="Mozilla Firefox 3.0.11">Firefox 3</th>
  <th title="Mozilla Firefox 3.5.2">Firefox 3.5</th>
  <th title="Mozilla Firefox 3.6.0">Firefox 3.6</th>
  <th title="Microsoft Internet Explorer 6.00.2900">IE 6</th>
  <th title="Microsoft Internet Explorer 7.00.5730">IE 7</th>
  <th title="Microsoft Internet Explorer 8.00.6001">IE 8</th>
  </tr>
  <tr>
    <td class="v">YUI3 (<b>raw</b>)</td>
<td> <div class="barbar"><div class="bar" style="width:7px">&nbsp;</div><div class="num">38</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:32px">&nbsp;</div><div class="num">163</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:37px">&nbsp;</div><div class="num">189</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:35px">&nbsp;</div><div class="num">178</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">118</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:21px">&nbsp;</div><div class="num">106</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:20px">&nbsp;</div><div class="num">101</div></div></td>
  </tr>
  <tr>
    <td class="v">YUI3 (<b>minified</b>)</td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">19</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:23px">&nbsp;</div><div class="num">119</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:28px">&nbsp;</div><div class="num">143</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">134</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">65</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:12px">&nbsp;</div><div class="num">61</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:10px">&nbsp;</div><div class="num">51</div></div></td>
  </tr>

  <tr class="clearall">
    <td style="border-left:none;" colspan="8">&nbsp;</td>
  </tr>

  <tr>
    <td class="v">YUI2 (<b>raw</b>)</td>
<td> <div class="barbar"><div class="bar" style="width:7px">&nbsp;</div><div class="num">37</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:37px">&nbsp;</div><div class="num">187</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:44px">&nbsp;</div><div class="num">220</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:31px">&nbsp;</div><div class="num">158</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:41px">&nbsp;</div><div class="num">209</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:26px">&nbsp;</div><div class="num">132</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:24px">&nbsp;</div><div class="num">122</div></div></td>
  </tr>
  <tr>
    <td class="v">YUI2 (<b>minified</b>)</td>
<td> <div class="barbar"><div class="bar" style="width:3px">&nbsp;</div><div class="num">18</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:28px">&nbsp;</div><div class="num">140</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:33px">&nbsp;</div><div class="num">165</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:29px">&nbsp;</div><div class="num">148</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:17px">&nbsp;</div><div class="num">89</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:16px">&nbsp;</div><div class="num">83</div></div></td>
<td> <div class="barbar"><div class="bar" style="width:13px">&nbsp;</div><div class="num">69</div></div></td>
  </tr>
</table>
<div class="caption">
  <p>Table 3: Code minified with YUI Compressor temds to parse faster.
</div>


<a name="conclusion"><h3>So What?</h3></a>
<p>If you have a large amount of Javascript in your application it's natural to bundle it all up into one file to save on network transit time and increase cacheability. But if certain parts of your application only use certain parts of the bundle, you might cause the user to unnecessarily parse unused code on every page load. The ideal would be to decouple three things that are now tightly wound together: network transit, parse-n-load, and cacheability. There is a lot of work going on in this space but so far there is no silver bullet.</p>

<ul>
<li>Alexander Limi from Mozilla <a href="http://limi.net/articles/resource-packages/">has a proposal</a> to use zip files for bundling.</li>

<li>The folks at 280North have <a href="http://cappuccino.org/discuss/2009/11/11/just-one-file-with-cappuccino-0-8/">found a neat way</a> to do multi-file bundling in their Cappuccino framework, using existing technology.</li>

<li>Google is <a href="http://dev.chromium.org/spdy/spdy-whitepaper">proposing an extension</a> to HTTPS that allows multiple concurrent streams over a single TCP connection.</li>

<li><a href="http://blog.sproutcore.com/post/272853740/cut-your-javascript-load-time-90-with-deferred">SproutCore</a> and the <a href="http://googlecode.blogspot.com/2009/09/gmail-for-mobile-html5-series-reducing.html">Google Mobile Team</a> recently demonstrated ways to load Javascript code as dumb strings that is evaluated at a time of the programmer's choosing.</li>

</ul>

<a name="try"><h3>Try this at home</h3></a>

<p>The <a href="http://github.com/aristus/parse-n-load">Parse-N-Load benchmark</a> is open source and free for use. It's early and I'm sure there are bugs. If you have other kinds of hardware (Netbooks! Windows! Linux!), please try it out and let me know what you find.</p>

<h2><i>Appendix</i></h2>

<a name="chrome"><h3>The Curious Case of Chrome</h3></a>
<p>While Google Chrome appears to be an order of magnitude faster at parse-n-load, the truth may be a little more complex. Running this benchmark in Chrome sometimes produces sharp cliffs in the time series graph, especially on slower CPUs. That might be the V8 engine's <a href="http://en.wikipedia.org/wiki/Inline_caching">inline caching</a> kicking in. I also suspect it could be caching the machine code it compiles on the first pass. Or this could be something silly like the CPU coming out of low-power mode. If anyone who knows more about what's going on can speak up, please do.</p>
<div class="figure">
  <img src="/images/chrome-cliff.png" />
  <div class="caption">Figure 0: Chrome's V8 Javascript engine has two speeds: fast and <b>very</b> fast.</div>
</div>


<a name="debugging"><h3>Debugging the Benchmark</h3></a>
<p>The first problem that came up was different blocking behavior between browsers. In Safari 4 (but not 3) if you create a script tag that points to an external file, that action will block, ie, wait until that file is completely parsed and loaded. This makes timing it very easy. In Firefox, however, this action asynchronous: the statement that creates the script element returns immediately and the file is loaded in a separate thread. This means you have to set up a callback in the separate thread to both measure elapsed time and kick off the next iteration of the test. You have to be <a href="http://paulbarry.com/articles/2009/08/30/tail-call-optimization">careful not the blow the stack</a> with too many nested function calls. Google Chrome is also asynchronous and has an altogether different stack behavior. If all that wasn't enough, browsers have very different memory allocation behavior, of which more (oh, much more) below.</p>


<a name="gc"><h3>Adventures in Garbage Collecting</h3></a>

<p>Every browser seems to have a different system for allocating memory while parsing Javascript code. When you graph the results from Safari 4 this is what I saw initially:</p>

<div class="figure">
  <img src="/images/spiky-garbage.png" />
  <div class="caption">Figure 1: the effects of GC halts</div>
</div>


<p>Interesting. All of the source files in the benchmark are local so that's not I/O wait. The regularity of huge spikes suggests that the browser is pausing every so often to free up memory via garbage collection. When you remove those spikes another interesting pattern shows up. Here is the same graph with the top 5% of datapoints removed:</p>

<div class="figure">
  <img src="/images/sawtooth.png" />
  <div class="caption">Figure 2: 95th percentile graph of Javascript load times in Safari 4</div>
</div>

<p>It appears that the parse-n-load time of a given piece of Javascript in Safari 4 will increase linearly with the amount of garbage. The load time can grow as much as 3X longer than normal before GC kicks in. I'm not certain whether this is an artifact of the benchmark or if it actually happens a lot during real-world use. When I added code to explicitly delete the previous script tag before creating a new one, the sawtooth elongated but did not go away. Other browsers exhibit similar halt-the-world GC behavior but only Safari 4 and Opera 10.5alpha have this sawtooth. Firefox's graph stays fairly horizontal but has many more small spikes:</p>


<div class="figure">
  <img src="/images/firefox-35.png" />
  <div class="caption">Figure 3: 95th percentile graph of Javascript load times in Firefox 3.5.3</div>
</div>



<a name="opera"><h3>A Tragic Opera (updated)</h3></a>

<p>Initially results for Opera 10.0 had asterisks because I couldn't get it to run the complete test. Opera got steadily slower up to 250 iterations after which it started serious thrashing and had to be killed. Geoffrey Sneddon from Opera software kindly suggested an alternate way to run this benchmark, ie destroying and re-creating the iframe for each trial instead of overwriting the same document object. This new method (dubbed "v2") works well and the Opera column has been updated. New test runs against the other browsers did not move their 95th percentile numbers more than a few percent, though the standard deviations and number of GC pauses did decrease.</p>

<p>Geoffrey also mentioned that while the condition I triggered is rare in the wild, they will work to fix it in a future release.</p>

<div class="figure">
  <img src="/images/oh-opera.png" />
  <div class="caption">Figure 4: Opera 10 choking on the parse-n-load benchmark v1</div>
</div>

<div class="figure">
  <img src="/images/opera-10-v2.png" />
  <div class="caption">Figure 5: Opera 10 doing nicely with benchmark v2</div>
</div>


<p>Safari 3 hit the wall even earlier, triggering an Out-of-Memory error after just a couple hundred iterations. The v2 of the benchmark works much better.<p>

<div class="figure">
  <img src="/images/safari-3.png" />
  <div class="caption">Figure 6: with v1 Safari 3 would throw an OOM at this point.</div>
</div>

<div class="figure">
  <img src="/images/safari-3-v2.png" />
  <div class="caption">Figure 6: Safari 3 likes benchmark v2 much better.</div>

         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/02/measuring-javascript-parse-and-load.html" />
       </entry>
     
       <entry>
         <id>tag:org.bueno.carlos.0d6643bc1ea2676cd7e009291b767dfea04739ef</id>
         <published>2010-02-01T12:00:00.000-08:00</published>
         <updated>2010-02-01T12:00:00.000-08:00</updated>
         <title type="text">Accent-Folding for Autocomplete</title>
         <content type="html"><![CDATA[
           
  <div class="dt">23 February 2010</div>

<p>
Another generation of technology has passed and Unicode support is almost everywhere. The next step is to write software that is not just &#8220;internationalized&#8221; but truly multilingual. In this article we will skip through a bit of history and theory, then illustrate a neat hack called accent-folding. Accent-folding has its limitations but it can help make some important yet overlooked user interactions work better.
<span class="sidenote">
<img src="../../images/accent-folding-illustration.jpg" /><br/><span class="caption">Illustration by <a href="http://www.bearskinrug.co.uk/">Kevin Cornell</a>
<br/>
(Originally appeared on <a href="http://www.alistapart.com/articles/accent-folding-for-auto-complete/">A List Apart</a>.)

</span></span>

</p>

<p>A common assumption about internationalization is that every user fits into a single locale like &#8220;English, United States&#8221; or &#8220;French, France.&#8221; It&#8217;s a hangover from the PC days when just getting the computer to display the right squiggly bits was a big deal. One byte equaled one character, no exceptions, and you could only load one language&#8217;s alphabet at a time. This was fine because it was better than nothing, and because users spent most of their time with documents they or their coworkers produced themselves.</p>

<p>Today users deal with data from everywhere, in multiple languages and locales, all the time. The locale I <em>prefer</em> is only loosely correlated with the locales I <em>expect applications to process</em>.</p>

<p>Consider this address book:</p>

<ul>
  <li>
    Fulanito López
  </li>
  <li>
    Erik Lørgensen
  </li>
  <li>
    Lorena Smith
  </li>
  <li>
    James Lö
  </li>
</ul>



<p>
If I compose a new message and type &#8220;lo&#8221; in the To: field, what should happen? In many applications only Lorena will show up. These applications &#8220;support Unicode,&#8221; in the sense that they don&#8217;t corrupt or barf on it, but that&#8217;s all.
<span class="sidenote">
  <img src="../../images/entourage.png" alt="Screenshot of Microsoft Entourage's address book autosuggest, which does not fold accented characters." /><br/>
  <span class="caption">Fig 1. Hey Entourage, where are my contacts?</span>
</span>
</p>


<p>
  This problem is not just in address books. Think about inboxes, social bookmarks, comment feeds, users who speak multiple languages, users in internet cafés in foreign countries, even URLs. Look at the journalist Ryszard Kapuściński and how different websites handle his name:
</p>

<ul>
    <li>
      Wikipedia: <a href="http://en.wikipedia.org/wiki/Ryszard_Kapu%C5%9Bci%C5%84ski">Ryszard Kapuściński</a> (canonical URL)
    </li>
    <li>
      Wikipedia: <a href="http://en.wikipedia.org/wiki/Ryszard_Kapuscinski">Ryszard Kapuscinski</a> (hand-coded alternate)
    </li>
    <li>
      Wikipedia: <a href="http://en.wikipedia.org/wiki/Ryszard_Kapusci%C5%84ski">Ryszard Kapusciński</a> (<strong>not found</strong>)
    </li>
    <li>
      Wikipedia: <a href="http://en.wikipedia.org/wiki/%C8%92%C3%BFszar%E1%B8%8B-K%C3%A5pu%C5%9Bci%C5%84s%E1%B8%B3i">Rÿszarḋ Kåpuścińsḳi</a> (<strong>not found</strong>)
    </li>
    <li>
      Spock: <a href="http://spock.com/%C8%92%C3%BFszar%E1%B8%8B-K%C3%A5pu%C5%9Bci%C5%84s%E1%B8%B3i">Rÿszarḋ Kåpuścińsḳi</a> (accent-folded, redirects to canonical URL)
    </li>
  </ul>


<p>There is no excuse for your software to play dumb when the user types &#8220;<strong>cafe</strong>&#8221; instead of &#8220;<strong>café</strong>.&#8221;</p>

<h3>Áçčềñṭ-Ḟøłɖǐṅg</h3>

<p>
In specific applications of search that favor recall over precision, such as our address book example, <strong>á</strong>, <strong>a</strong>, <strong>å</strong>, and <strong>â</strong> can be treated as equivalent. Accents (a.k.a <a href="http://en.wikipedia.org/wiki/Diacritical">diacritical</a> marks) are pronunciation hints that don&#8217;t affect the textual meaning. Entering them can be cumbersome, especially on mobile devices.

<span class="sidenote">
  <img src="../../images/accent-folding.png" alt="Example of a YUI Autosuggest widget with accent-folding added." /><br/>
  <span class="caption">Fig 2. accent-folding in an autosuggest widget</span>
</p>

<p>An accent-folding function essentially maps Unicode characters to ASCII equivalents. Anywhere you apply case-folding, you should consider accent-folding, and for exactly the same reasons. With accent-folding, it doesn&#8217;t matter whether users search for <strong>cafe</strong>, <strong>café</strong> or even <strong>çåFé</strong>; the results will be the same.</p>

<p><strong>Be aware that there are a million caveats to accent rules. You will almost certainly get it wrong for somebody, somewhere</strong>. Nearly every alphabet has a few extra-special marks that do affect meaning, and, of course, non-Western alphabets have completely different rules.</p>

<p>A minor gotcha is the <a href="http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Block=Halfwidth_And_Fullwidth_Forms:]">Unicode "fullwidth" Roman alphabet</a>. These are fixed-width versions of plain ASCII characters designed to line up with Chinese/Japanese/Korean characters (e.g., "１９７９年８月１５日"). They reside in <code>0xff00</code> to <code>0xff5e</code> and should be treated as equivalent to their ASCII counterparts.</p>


<h3>Hey man, I&#8217;m only here for the copy/paste</h3>

<p>I&#8217;ve posted <a href="http://github.com/aristus/accent-folding">more complete examples</a> on GitHub, but for illustration, here&#8217;s a basic accent-folder in Javascript:</p>

<pre class="code">
<span style="color: #228b22;">var</span> <span style="color: #996633;">accentMap</span> = {
  <span style="color: #996633;">'&#225;'</span>:<span style="color: #996633;">'a'</span>, <span style="color: #996633;">'&#233;'</span>:<span style="color: #996633;">'e'</span>, <span style="color: #996633;">'&#237;'</span>:<span style="color: #996633;">'i'</span>,<span style="color: #996633;">'&#243;'</span>:<span style="color: #996633;">'o'</span>,<span style="color: #996633;">'&#250;'</span>:<span style="color: #996633;">'u'</span>
};

<span style="color: #228b22;">function</span> <span style="color: #7f007f;">accent_fold</span> (<span style="color: #228b22;">s</span>) {
  <span style="color: #0000ff;">if</span> (!s) { <span style="color: #0000ff;">return</span> <span style="color: #996633;">''</span>; }
  <span style="color: #228b22;">var</span> <span style="color: #996633;">ret</span> = <span style="color: #996633;">''</span>;
  <span style="color: #0000ff;">for</span> (<span style="color: #228b22;">var</span> <span style="color: #996633;">i</span> = 0; i &amp;lt; s.length; i++) {
    ret += accent_map[s.charAt(i)] || s.charAt(i);
  }
  <span style="color: #0000ff;">return</span> ret;
};
</pre>

<h3>Regular Expressions</h3>

<p>Regular expressions are very tricky to make accent-aware. Notice that in Fig. 2 only the unaccented entries are in bold type. The problem is that the Unicode character layout does not lend itself to patterns that cut across languages. The proper regex for &#8220;<strong>lo</strong>&#8221; would be something insane like:</p>

<pre><code>[LlĹĺĽľĻļḶḷḸḹḼḽḺḻŁłŁłĿŀȽƚɫ][OoÓóÒòŎŏÔôỐốỒồỖöȪȫŐőÕõṌȭȮȯǾǿ...ǬǭŌ]</code></pre>

<p>Never, never do this. As of this writing, few regular expression engines support shortcuts for Unicode character classes. <a href="http://www.pcre.org/" rel="nofollow">PCRE</a> and Java seem to be in the vanguard. You probably shouldn&#8217;t push it. Instead, try highlighting an accent-folded version of the string, and then use those character positions to highlight the original, like so:</p>

<pre class="code">
<span style="color: #006600;">// </span><span style="color: #006600;">accent_folded_hilite("Fulanilo L&#243;pez", 'lo')
</span><span style="color: #006600;">//   </span><span style="color: #006600;">--&gt; "Fulani&lt;b&gt;lo&lt;/b&gt; &lt;b&gt;L&#243;&lt;/b&gt;pez"
</span><span style="color: #006600;">//</span><span style="color: #006600;">
</span><span style="color: #228b22;">function</span> <span style="color: #7f007f;">accent_folded_hilite</span>(<span style="color: #228b22;">str</span>, <span style="color: #228b22;">q</span>) {
  <span style="color: #228b22;">var</span> <span style="color: #996633;">str_folded</span> = accent_fold(str).toLowerCase().replace(/[&lt;&gt;]+/g, <span style="color: #996633;">''</span>);
  <span style="color: #228b22;">var</span> <span style="color: #996633;">q_folded</span> = accent_fold(q).toLowerCase().replace(/[&lt;&gt;]+/g, <span style="color: #996633;">''</span>);

  <span style="color: #006600;">// </span><span style="color: #006600;">Create an intermediate string with hilite hints
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">Example: fulani&lt;lo&gt; &lt;lo&gt;pez
</span>  <span style="color: #228b22;">var</span> <span style="color: #996633;">re</span> = <span style="color: #0000ff;">new</span> <span style="color: #228b22;">RegExp</span>(q_folded, <span style="color: #996633;">'g'</span>);
  <span style="color: #228b22;">var</span> <span style="color: #996633;">hilite_hints</span> = str_folded.replace(re, <span style="color: #996633;">'&lt;'</span>+q_folded+<span style="color: #996633;">'&gt;'</span>);

  <span style="color: #006600;">// </span><span style="color: #006600;">Index pointer for the original string
</span>  <span style="color: #228b22;">var</span> <span style="color: #996633;">spos</span> = 0;
  <span style="color: #006600;">// </span><span style="color: #006600;">Accumulator for our final string
</span>  <span style="color: #228b22;">var</span> <span style="color: #996633;">highlighted</span> = <span style="color: #996633;">''</span>;

  <span style="color: #006600;">// </span><span style="color: #006600;">Walk down the original string and the hilite hint
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">string in parallel. When you encounter a &lt; or &gt; hint,
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">append the opening / closing tag in our final string.
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">If the current char is not a hint, append the equiv.
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">char from the original string to our final string and
</span>  <span style="color: #006600;">// </span><span style="color: #006600;">advance the original string's pointer.
</span>  <span style="color: #0000ff;">for</span> (<span style="color: #228b22;">var</span> <span style="color: #996633;">i</span> = 0; i&lt; hilite_hints.length; i++) {
    <span style="color: #228b22;">var</span> <span style="color: #996633;">c</span> = str.charAt(spos);
    <span style="color: #228b22;">var</span> <span style="color: #996633;">h</span> = hilite_hints.charAt(i);
    <span style="color: #0000ff;">if</span> (h === <span style="color: #996633;">'&lt;'</span>) {
      highlighted += <span style="color: #996633;">'&lt;b&gt;'</span>;
    } <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span> (h === <span style="color: #996633;">'&gt;'</span>) {
      highlighted += <span style="color: #996633;">'&lt;/b&gt;'</span>;
    } <span style="color: #0000ff;">else</span> {
      spos += 1;
      highlighted += c;
    }
  }
  <span style="color: #0000ff;">return</span> highlighted;
}
</pre>

<p>The previous example is probably too simplistic for production code. You can&#8217;t highlight multiple terms, for example. Some special characters might expand to two characters, such as &#8220;<strong>æ</strong>&#8221; --&gt; &#8220;<strong>ae</strong>&#8221; which will screw up <code>spos</code>. It also strips out angle-brackets (&lt;&gt;) in the original string. But it&#8217;s good enough for a first pass.</p>

<h3>Accent-folding in YUI Autocomplete</h3>

<p><a href="http://developer.yahoo.com/yui/autocomplete" id="zg:g" title="YUI's Autocomplete library">YUI's Autocomplete library</a> has many hooks and options to play with. Today we&#8217;ll look at two overrideable methods: <code>filterResults()</code> and <code>formatMatch()</code>. The <code>filterResults</code> method allows you to write your own matching function. The <code>formatMatch</code> method allows you to change the HTML of an entry in the list of suggested matches.</b> You can also download a <a href="http://github.com/aristus/accent-folding">complete, working example</a> with all of the source code.</p>

<pre class="code">
<span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">this is important to tell javascript to treat
the strings as UTF-8 </span><span style="color: #006600;">--&gt;</span>
&lt;<span style="color: #7f007f;">meta</span> <span style="color: #996633;">http-equiv</span>=<span style="color: #996633;">"content-type"</span> <span style="color: #996633;">content</span>=<span style="color: #996633;">"text/html;charset=utf-8"</span> /&gt;

<span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">YUI stylesheets </span><span style="color: #006600;">--&gt;</span>
&lt;<span style="color: #7f007f;">link</span> <span style="color: #996633;">rel</span>=<span style="color: #996633;">"stylesheet"</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/css"</span>
 <span style="color: #996633;">href</span>=<span style="color: #996633;">"http://yui.yahooapis.com/2.7.0/build/fonts/fonts-min.css"</span> /&gt;
&lt;<span style="color: #7f007f;">link</span> <span style="color: #996633;">rel</span>=<span style="color: #996633;">"stylesheet"</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/css"</span>
 <span style="color: #996633;">href</span>=<span style="color: #996633;">"http://yui.yahooapis.com/2.7.0/build/autocomplete/assets/skins/sam/autocomplete.css"</span> /&gt;

<span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">YUI libraries: events, datasource and autocomplete </span><span style="color: #006600;">--&gt;</span>
&lt;<span style="color: #7f007f;">script</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/javascript"</span>
 <span style="color: #996633;">src</span>=<span style="color: #996633;">"http://yui.yahooapis.com/2.7.0/build/yahoo-dom-event/yahoo-dom-event.js"</span>&gt;&lt;/<span style="color: #7f007f;">script</span>&gt;
&lt;<span style="color: #7f007f;">script</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/javascript"</span>
 <span style="color: #996633;">src</span>=<span style="color: #996633;">"http://yui.yahooapis.com/2.7.0/build/datasource/datasource-min.js"</span>&gt;&lt;/<span style="color: #7f007f;">script</span>&gt;
&lt;<span style="color: #7f007f;">script</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/javascript"</span>
 <span style="color: #996633;">src</span>=<span style="color: #996633;">"http://yui.yahooapis.com/2.7.0/build/autocomplete/autocomplete-min.js"</span>&gt;&lt;/<span style="color: #7f007f;">script</span>&gt;

<span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">contains accent_fold() and accent_folded_hilite() </span><span style="color: #006600;">--&gt;</span>
&lt;<span style="color: #7f007f;">script</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text/javascript"</span> <span style="color: #996633;">src</span>=<span style="color: #996633;">"accent-fold.js"</span>&gt;&lt;/<span style="color: #7f007f;">script</span>&gt;

<span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">Give &lt;body&gt; the YUI "skin" </span><span style="color: #006600;">--&gt;</span>
&lt;<span style="color: #7f007f;">body</span> <span style="color: #996633;">class</span>=<span style="color: #996633;">"yui-skin-sam"</span>&gt;
  &lt;<span style="color: #7f007f;">b</span>&gt;<span style="font-weight: bold;">To:</span>&lt;/<span style="color: #7f007f;">b</span>&gt;
  &lt;<span style="color: #7f007f;">div</span> <span style="color: #996633;">style</span>=<span style="color: #996633;">"width:25em"</span>&gt;

    <span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">Our to: field </span><span style="color: #006600;">--&gt;</span>
    &lt;<span style="color: #7f007f;">input</span> <span style="color: #996633;">id</span>=<span style="color: #996633;">"to"</span> <span style="color: #996633;">type</span>=<span style="color: #996633;">"text"</span> /&gt;

    <span style="color: #006600;">&lt;!-- </span><span style="color: #006600;">An empty &lt;div&gt; to contain the autocomplete </span><span style="color: #006600;">--&gt;</span>
    &lt;<span style="color: #7f007f;">div</span> <span style="color: #996633;">id</span>=<span style="color: #996633;">"ac"</span>&gt;&lt;/<span style="color: #7f007f;">div</span>&gt;

  &lt;/<span style="color: #7f007f;">div</span>&gt;
&lt;/<span style="color: #7f007f;">body</span>&gt;

&lt;<span style="color: #7f007f;">script</span>&gt;

<span style="color: #006600;">// </span><span style="color: #006600;">Our static address book as a list of hash tables
</span><span style="color: #228b22;">var</span> <span style="color: #996633;">addressBook</span> = [
  {<span style="color: #000099;">name</span>:<span style="color: #996633;">'Fulanito L&#243;pez'</span>, email:<span style="color: #996633;">'fulanito@example.com'</span>},
  {<span style="color: #000099;">name</span>:<span style="color: #996633;">'Erik L&#248;rgensen'</span>, email:<span style="color: #996633;">'erik@example.com'</span>},
  {<span style="color: #000099;">name</span>:<span style="color: #996633;">'Lorena Smith'</span>,   email:<span style="color: #996633;">'lorena@example.com'</span>},
  {<span style="color: #000099;">name</span>:<span style="color: #996633;">'James L&#246;'</span>,       email:<span style="color: #996633;">'james@example.com'</span>}
];

<span style="color: #006600;">/*</span><span style="color: #006600;">
Iterate our address book and add a new field to each
row called "search." This contains an accent-folded
version of the "name" field.
*/</span>
<span style="color: #0000ff;">for</span> (<span style="color: #228b22;">var</span> <span style="color: #996633;">i</span> = 0; i&lt; addressBook.length; i++) {
  addressBook[i][<span style="color: #996633;">'search'</span>] = accent_fold(addressBook[i][<span style="color: #996633;">'name'</span>]);
}

<span style="color: #006600;">// </span><span style="color: #006600;">Create a YUI datasource object from our raw address book
</span><span style="color: #228b22;">var</span> <span style="color: #996633;">datasource</span> = <span style="color: #0000ff;">new</span> <span style="color: #000099;">YAHOO</span>.<span style="color: #000099;">util</span>.<span style="color: #228b22;">LocalDataSource</span>(addressBook);

<span style="color: #006600;">/*</span><span style="color: #006600;">
A datasource is tabular, but our array of hash tables has no
concept of column order. So explicitly tell the datasource
what order to put the columns in.
*/</span>
datasource.responseSchema = {<span style="color: #000099;">fields</span> : [<span style="color: #996633;">"email"</span>, <span style="color: #996633;">"name"</span>, <span style="color: #996633;">"search"</span>]};

<span style="color: #006600;">/*</span><span style="color: #006600;">
Instantiate the autocomplete widget with a reference to the
input field, the empty div, and the datasource object.
*/</span>
<span style="color: #228b22;">var</span> <span style="color: #996633;">autocomp</span> = <span style="color: #0000ff;">new</span> <span style="color: #000099;">YAHOO</span>.<span style="color: #000099;">widget</span>.<span style="color: #228b22;">AutoComplete</span>(<span style="color: #996633;">"to"</span>, <span style="color: #996633;">"ac"</span>, datasource);

<span style="color: #006600;">// </span><span style="color: #006600;">Allow multiple entries by specifying space
</span><span style="color: #006600;">// </span><span style="color: #006600;">and comma as delimiters
</span>autocomp.delimChar = [<span style="color: #996633;">","</span>,<span style="color: #996633;">" "</span>];

<span style="color: #006600;">/*</span><span style="color: #006600;">
Add a new filterResults() method to the autocomplete object:
Iterate over the datasource and search for q inside the
"search" field. This method is called each time the user
types a new character into the input field.
*/</span>
autocomp.filterResults = function(q, entries, resultObj, cb) {
    <span style="color: #228b22;">var</span> <span style="color: #996633;">matches</span> = [];
    <span style="color: #228b22;">var</span> <span style="color: #996633;">re</span> = <span style="color: #0000ff;">new</span> <span style="color: #228b22;">RegExp</span>(<span style="color: #996633;">'\\b'</span>+accent_fold(q), <span style="color: #996633;">'i'</span>);
    <span style="color: #0000ff;">for</span> (<span style="color: #228b22;">var</span> <span style="color: #996633;">i</span> = 0; i &lt; entries.length; i++) {
        <span style="color: #0000ff;">if</span> (re.test(entries[i][<span style="color: #996633;">'search'</span>])) {
            matches.push(entries[i]);
        }
    }
    resultObj.results = matches;
    <span style="color: #0000ff;">return</span> resultObj;
};

<span style="color: #006600;">/*</span><span style="color: #006600;">
Add a new formatResult() method. It is called on each result
returned from filterResults(). It outputs a pretty HTML
representation of the match. In this method we run the
accent-folded highlight function over the name and email.
*/</span>
autocomp.formatResult = function (entry, q, match) {
    <span style="color: #228b22;">var</span> <span style="color: #996633;">name</span> = accent_folded_hilite(entry[1], q);
    <span style="color: #228b22;">var</span> <span style="color: #996633;">email</span> = accent_folded_hilite(entry[0], q);
    <span style="color: #0000ff;">return</span> name + <span style="color: #996633;">' &lt;'</span> + email + <span style="color: #996633;">'&gt;'</span>;
};

<span style="color: #006600;">//</span><span style="color: #006600;">fin</span>
&lt;<span style="color: #7f007f;">/script</span>&gt;
</pre>

<h3>About those million caveats...</h3>

<p>This accent-folding trick works primarily for Western European text, but it won&#8217;t work for all of it. It exploits specific quirks of the language family and the limited problem domains of our examples, where it&#8217;s better to get more results than no results. In German, <strong>Ü</strong> should probably map to <strong>Ue</strong> instead of just <strong>U</strong>. A French person searching the web for <strong>thé</strong> (tea) would be upset if flooded with irrelevant English text.</p>

<p>You can only push a simple character map so far. It would be very tricky to reconcile the Roman &#8220;<strong>Marc Chagall</strong>&#8221; with the Cyrillic &#8220;<strong>Марк Шагал</strong>&#8221; or Hebrew &#8220;<strong>מאַרק שאַגאַל</strong>.&#8221; There are very interesting similarities in the characters but a magical context-free two-way mapping is probably not possible.</p>

<p>On top of all that there is another problem: One language can have more than one writing system. <a href="http://en.wikipedia.org/wiki/Transliteration">Transliteration</a> is writing a language in a different alphabet. It&#8217;s not quite the same as <a href="http://en.wikipedia.org/wiki/Transcription_(linguistics)">transcription</a>, which maps <em>sounds</em> as in &#8220;<strong>hola, que paso</strong>&#8221; --&gt; &#8220;<strong>oh-la, keh pah-so</strong>.&#8221; Transliterations try to map the <em>written symbols</em> to another alphabet, ideally in a way that&#8217;s reversible.</p>

<p>These four sentences all say &#8220;Children like to watch television&#8221; in Japanese:</p>

<ul>
<li><b>One Way</b>: 子供はテレビを見るのが好きです。</li>
<li><b>Another Way</b>: こども は てれび を みる の が すき です 。</li>
<li><b>Romaji</b>: kodomo wa terebi o miru noga suki desu.</li>
<li><b>Cyrillic</b>: кодомо ва тэрэби о миру нога суки дэсу.</li>
</ul>

<p>For centuries people have been inventing ways to write different languages with whatever keyboards or typesetting they had available. So even if the user reads only one language, they might do so in multiple transliteration schemes. Some schemes are logical and academic, but often they are messy organic things that depend on regional accent and historical cruft. The computer era kicked off a new explosion of systems as people learned to chat and send emails in plain ASCII.</p>

<p>There is a lot of prior work on this problem and two ready paths you can choose: The right way and the maybe-good-enough way. Neither have the simplicity of our naïve hash table, but they will be less disappointing for your users in general applications.</p>

<p>The first is <a href="http://icu-project.org" title="International Components for Unicode">International Components for Unicode</a> (ICU), a project that originated in the early nineties at Taligent. It aims to be a complete, language-aware transliteration, Unicode, formatting, everything library. It&#8217;s big, it&#8217;s C++/Java, and it requires contextual knowledge of the inputs and outputs to work. </p>

<p>The second is <a href="http://search.cpan.org/perldoc/Text::Unidecode" title="Unidecode">Unidecode</a>, a context-free transliteration library available for Perl and <a href="http://pypi.python.org/pypi/Unidecode" title="Python">Python</a>. It tries to transliterate all Unicode characters to a basic Latin alphabet. It makes little attempt to be reversible, language-specific, or even generally correct; it&#8217;s a quick-and-dirty hack that nonetheless is pretty comprehensive.</p>

<p>Accent-folding in the right places saves your users time and makes your software smarter. The strict segregation of languages and locales is partially an artifact of technology limitations which no longer hold. It&#8217;s up to you how far you want to take things, but even a little bit of effort goes a long way. Good luck! </p>

<br/>
<p><a href="http://www.italianalistapart.com/articoli/11-numero-due-9-marzo-2010/44-accent-folding-per-completamento-automatico">Italian Translation</a></p>


         ]]></content>
         <link rel="alternate" type="text/html" href="http://carlos.bueno.org/2010/02/accent-folding-for-autocomplete.html" />
       </entry>
     </feed>
