## Making estimates about crazy things

I was re-reading a great little book on my Kindle while grinding out my miles and found a really intriguing and fun idea. This book is a collection of blog posts the Atlantic magazine did and it led me to a new source of amusement (unfortunately I don’t have the book with me so I can’t quote the specific article except from my limited and flawed memory).

At any rate that article led me to WhatIf, an amusing set of articles by Randall Munroe, the author of the very funny xkcd comic strip. And the Atlantic article also mentioned Fermi estimate which added to the thought thread I was having. Now that concept, making reasonable guesses at hard-to-measure values is an old and treasured idea I was introduced to, long ago in high school, and later, with more amplification at MIT. That is, without being able to actually get enough data and/or do experiments can we actually make a reasonably accurate guess about something. As the Wikipedia article explains, yes, we can probably get within an order of magnitude of the correct answer with a sensible process. I still remember a question we were given – how many cars could fit on Mass Ave between the river and Harvard square. Now no one actually knows the answer, but it is possible to make a relatively good guess, or even better, to create error bounds (not all cars are same length, so compute for smallest and largest car). Today with digital maps and Google Earth the actual area of Mass Ave could be rather accurately estimated so it seems reasonable we could probably make this estimate, easily within a factor of 2.

So other than immensely enjoying some of WhatIf estimates I began to think about some of my own. Now Munroe is a lot funnier than me, not to mention much better cartoonist, but nonetheless I might be able, occasionally to do a few of my own.

So I started with a simple one triggered by one of Munroe’s estimates – how many feet of blog have I written. Now since I’m at Starbucks and my preliminary data is at home I can’t write that up, right now, so I thought about a different question, also initially triggered by one of Munroe’s pages.

If all the rain that fell somewhere, say the lower 48 states of U.S., fell in a single drop how big would it be. To try to solve this lots of things have to be considered, like a raindrop isn’t actually a sphere, but mostly to get an answer we’d have to know how much volume of rain fell.

So I was thinking.

One thing Munroe does so well is some background research to find useful information for making the estimates. That takes time and, of course, is interesting all by itself because it makes one search for interesting nuggets of information, a task that is (mostly) feasible today with the Net. I don’t have time, at the moment, to do all that research but I’ll just write up the ideas I’ve had.

Unfortunately I didn’t bookmark (or copy) the xkcd cartoon so I have to handwave a bit. It dealt with something I face all the time. I’ve often said I’d rather invent a lawnmower than to mow grass, i.e. create a machine (mostly programs) to do some tedious task. After a lifetime of developing software for a living I still enjoy wasting time writing programs and so the cartoon (which I can’t find again) is interesting. If you’re doing a tedious task as a programmer you could probably write a program to do the task. But writing the program takes time, possibly more time than to do the task itself (it might be more fun to write the program). The cartoon showed the fallacy of this reasoning in a humorous way. You start out to write a simple program to “automate” doing something and the programming project expands to fill all available time so almost certainly writing the program takes more time than doing the task. Boy, do I know that – but again the program is more fun.

So my hypothetical question could be addressed with a program that I definitely don’t have the time to write, but at least I can imagine it.

Fortunately the Net is not just for humans. A web server can’t tell (at least if you fake some headers) whether a request for a web page is being made by a person (via browser) or by a program. So one can write a program to fetch webpages (some of the classes in .Net make this relatively easy) and then do something that once I just did and then discovered had a name – page “scraping”, i.e. parsing the HTML in a task-specific way to extract bits of information from the page. Now as is usual I once did this and a couple of years later learned that this was exactly why web services had been invented (why send all that HTML when a shorter and easier-to-parse XML answer can be supplied to a specific type of query). But most websites have to make their money via ads so they’re not that interesting in providing web services to make programmatic access easier.

Anyway, as maybe I’ll get back to the question, it’s easy to now imagine how to solve the issue of how much rain falls on any given day – mine the weather sites on the web.

So imagine we write a program that has a list of known weather stations where we can go get the rainfall for that station for a particular day. Not every inch of the U.S. is covered by reported data so we’d have to go for some finite number of locations. Getting that data would create a mesh of triangles where the value at each vertex is known (say Omaha, KC, and St. Louis) and from that data it’s possible to interpolate at whatever resolution we want (say per square mile enclosed in that triangle). After doing this process (fetch the data, interpolate per square mile, sum up) we’d have an estimate of the total rainfall.

Now obviously the accuracy of the answer depends on both the resolution of actual data as well as the accuracy of the rainfall report for each data point. More points, more accurate.

But websites frown on massive number of requests from a ‘bot, as in this is known as DoS (denial of service attack). So it’s probably not reasonable to attempt to fetch the reported rainfall for every weather station everywhere in the U.S. without having somebody blocking my IP. So some reasonable number (say 100) locations is the limit and thus denies my algorithm from getting a very accurate answer. But how accurate?

Now temporarily I might risk the ire of weather.com (or some other site, haven’t researched the best one to use) and while at Starbucks (and thus not my IP) I could, once, get tens of thousands of locations and thus get the best answer I could, then do some simulations of how much my answer for (100, 200, 400, 800, …) stations compares to the answer of the maximum number of stations I can mine.

But, as the reason for my digression about writing programs to do tedious tasks, I suspect if I actually tried this I’d think of all sorts of complications, not to mention creeping elegance (like what to do about snow, hail, or measure temps and humidity) and thus this program would probably takes weeks, if not months, to do. And I have dozens of such projects to fill all my available time, so any focused effort just subtracts from all the fun projects I could do. [My wife is amazed I even think about stuff like this since she thinks “hobbies” should be something tangible, like gardening, but I’ve always found this kind of project more fun – to each his own).

So I can’t actually do the experiment and thus report the results but I will bet that even with a feasible (below the threshold of a site blocking me for DoS) number of data points could get an answer, certainly within a factor of 10, probably even much better than that (an unproven assertion).

But no matter how much data I used, from weather stations and interpolation, the answer wouldn’t be right. So can we do what Munroe does and not limit ourselves to the feasible as a way of addressing a question (like his first WhatIf post of a relativistic baseball).

So I thought the simplest thing would be to cover the entire U.S. with a roof, with drains at certain places and actually measure the runoff. Of course, that still isn’t right. Some of the water would just evaporate before being collected and measured (but once again the old basic idea of calculus enters, where Δt → 0 (or Δspacing of drains  → 0)) and so, in principle, this approach could measure correctly.

Or can it?

Again there is evaporation to deal with, but perhaps more interestingly (as my thought process evolved) surface tension. Some of the rain would just stick to the collecting roof and never go to the drains.

So there is a better approach. Simply create a grid of “square” parabolic collector (just let gravity deflect a thin sheet, or is this a catenary) and mount each corner of the square on a strain gauge scale and measure the weight. We could know (with some inaccuracy) the weight of all the collectors so the measured weight would be the weight of the rain and if we measured often enough (Δt → 0) then evaporation could be handled. Now other than the trillions this would cost, not to mention the environmental impact statements you’d have to do (Munroe could make this a lot funnier) we could, in principle, get an answer, then we could compare that answer to the one our rainfall mining and interpolation program would get. And eventually all this could fit in a nice graph, of course done daily over years to get enough data for statistical analysis) of accuracy (as measured by percentage of “real” value) as a function of measuring points.

Nice!