## Applied Nate Silver – 6

I wonder if I can use this idea, somehow, to improve my prediction?

I’m looking at some statistics books or meta-statistics in this case. The book, Bad Data Handbook, by Q. Ethan McCallum (Print ISBN-13: 978-1-4493-2188-8) has some interesting practical tips for dealing with bad data. And I certainly have some bad data so I could use some help and thus far Nate hasn’t gotten back to me to answer my questions.

Ethan starts a section with

Paradoxically, wrong turns can lead to the most desirable destinations. They can take us to otherwise hidden places—as long as we don’t cling too strongly to the original plan and instead pay attention to the actual scenery.

He then goes through an example of how his bad data didn’t match an expected model (Poisson distribution) and this led to investigation which uncovered the need for additional classification of the data so it did fit (this was not, btw, squeezing the data until it screams, it was genuine recognition of a particular error in the raw data). So he concludes with (my emphasis):

This is extremely important: to gain insight that goes beyond the merely descriptive, we need to formulate a prescriptive statement. In other words, we need to make a statement about what we expect the data to do (not merely what we already know it does—that would be descriptive). To make such a statement, we typically have to combine specific observations made about the data with other information or with abstract reasoning in order to arrive at a hypothetical theory—which, in turn, we can put to the test. If the theory makes predictions that turn out to be true, we have reason to believe that our reasoning about the system was correct. And that means that we now know more about the system than the data itself is telling us directly.

I’m not immediately sure how to apply this, but I’ll keep it in mind. But here’s a possible application.

While my longer-term data (all the daily weights, superimposed on weekly weights over many weeks) makes some sense, today’s weigh-in makes no sense and the pattern for the week makes little sense. Now I can believe the data and just say this is random fluctuations (which means the noise is way bigger than the signal) or perhaps I can use Ethan’s advice and see if there is some other explanation.

During this week, every day I’ve run a calorie deficit, that is more calories burned than consumed. But I’ve already found this data to be fairly noisy. I have no reliable technique for measuring calories burned (neither the BodyFit device nor the readings from my machines are very accurate) and naturally there is some bias in self-reporting on food consumption, plus once I deviate from a published nutrition statement (as in cooking up my own recipe) I no longer have accurate consumption data. BUT, I do know that every day this week my consumption has been less than my burn so that should predict a steady decline. In fact, as of last night the best prediction I could make would indicate a 1.9lb loss already this week.

Now interestingly this graph somewhat confirms that data but look at the r^2. (The two trendlines are based either on all the raw data (blue dots), r^2=0.2501, or the average for the day (red dots), r^2=0.4716). So neither is a very good fit. Just eyeballing the data makes it hard to even spot any kind of trend.

But here’s the point and the tie-in to Bad Data Handbook. Look at the huge spread today (rightmost bunch of blue points). The standard deviation is 0.913 vs the more typical I usually see of 0.525 and you can easily see this on the graph.

Due to what I ate yesterday and the feeling that the previous day’s weight was anomalous I expected a higher number today, but more like 200.5 than the 201.0 that I got. The first value, 202.5 was immediately disconcerting, but it also is a clue.

I use my digital Weight Watchers scale on a linoleum floor that has a grid pattern (actual deviation from flat). I’ve been using this grid to place the scale in the same location for each weight sample. BUT, I don’t exactly position it relative to the grid. And that’s what caught my attention. The first time I positioned it significantly further to the right (so the right side of the scale aligns with a grid mark, rather than the more typical case where the left side aligns with a grid mark (the grid is slightly wider than the scale itself). Previously, I had the “feeling” (not measured) that I get the higher weight when the scale is further to the right, so when I got the high number today I immediately had that suspicion. That had a consequence. In subsequent samples I was careful to align the scale to the left and most (although not all) the samples were lower.

But this wasn’t a very accurate experiment and also there was only a tiny bit of data, so it’s difficult to draw any conclusions, but given the ideas from Bad Data Handbook I now wonder if I should try this experiment very carefully (esp. since one day I had a very tight cluster of samples, stdev=0.063).

There is also another factor. I had previously decided that the scale had “memory”. That is, immediately after you’ve gotten a weight and the display dims if you take another weight you just get exactly the same value (I’ve never thoroughly tested this). So I have the superstition that I must: a) wait at least one minute (or so) between weighings, and b) I must move the scale to its storage location and back again. Obviously this is an easily testable experiment.

And another issue I have is that the scale is relatively small compared to my feet so almost every time I climb on the scale I end up with my feet in different position, marked by either one or both toes over the top edge or heels over the back edge. I have the “feeling” that this matters in the weight. Obviously this is a testable experiment.

And finally another issue is that I don’t always have the same posture on the scale. Combined with feet location that means my center of mass does align over a different place in the scale. The scale itself is probably some sort of strain gauge but the base of the scale is transparent glass so I don’t see the measuring sensor there, so that probably means the sensor itself is somehow measuring the deflection of the glass. And my alignment over the scale bed or the alignment of the scale on the floor (which is slightly soft) could possibly have an effect.

I’ve often felt that most of the samples “cluster” around the likely value and one or two other samples are “outliers”. Since I don’t get a large number of samples plus I (mostly) use the mean as the final result my value can still be off, I’d guess by about ± 0.4lbs. If I were to discard outliers it’s possible this would converge a bit more.

Now all this is fairly irrelevant if I just let law of large numbers kick in, that is have a large amount of data from many days. But I’m not measuring something that is constant. I am on fitness kick so I’m trying to decrease my value. But I don’t succeed in doing that with “perfect” calorie deficit per day or with precisely control intake (like of sodium, which might affect “water weight”), etc. So law of large numbers won’t entirely work (might be more useful in “maintenance”).

But most of all I had the delayed gratification of that. Every waking minute I’m aware I’m restraining my eating. Every time I go exercise I’m aware I’m doing that (sometimes when I don’t want to or it’s unpleasant). So I’m feeling the “pain” of this all the time, I’m almost constantly aware of it. That’s probably 15 hours a day where this is the thought most in my mind. I need a reward! Without the sense that all this “sacrifice” is doing something it’s hard to maintain my resolve.

When I quit smoking and had the obvious withdrawal pangs I also had immediate gratification – no smoking. It was binary. I wasn’t doing what I set out to accomplish (it’s much easier to quit than moderate, for this reason). While I knew the cravings would eventually diminish (as my neuro-receptors reorganized themselves) and that would take time I nonetheless got the immediate gratification of knowing I’d “succeeded” (every minute without smoking was more success).

But weight loss is harder. I can’t just stop eating. I have to achieve moderation. Every time I do eat something I want to eat more. So I need “success” to provide the willpower to resist “more”. And that’s why all the noise in this data bugs me so much and why I keep looking to increase my skill at data collection and analysis so I can get that reward I need.

And I’ll really need this during maintenance. While I’m in loss mode a brief failure doesn’t much matter, it just means it will take a little longer to reach my goals (that is, assuming I don’t plateau for weeks). But in maintenance I am really convinced that I must immediately compensate for any short-term uptick. If I let it go then I suspect I won’t make it up ever, and then there will be another uptick, and another and so forth, and then months later I’ve regained all the weight. Eternal vigilance is the price I’ll pay that my genes encourage me to overeat and also prefer the high energy density nutrients (which are not evil as so much junk nutrition claims, they’re merely high calorie in small volume and often tasty so easier to abuse, a bit like it’s easier to get drunk on pure ethanol (Everclear, as once I learned the hard way) vs beer).

So all this obsession with statistics has a point.

p.s. In the “lessons learned” that closed the chapter in  Bad Data Handbook I was reading I see this:

But most importantly, I think many people are unaware of the importance of assumptions, in particular when it comes to the effect they have on subsequent calculations being performed on a dataset. Every statistical or computational method makes certain assumptions about its inputs—but I don’t think most users are sufficiently aware of this fact (much less of the details regarding the applicable range of validity of each method). Moreover, it takes experience in a wide variety of situations to understand the various ways in which datasets may be “bad”—“bad” in the sense of “failing to live up to expectations.” A curious variant of this problem is the absence of formal education in “empirical methods.” Nobody who has ever taken a hands-on, experimental “senior lab” class (as is a standard requirement in basically all physics, chemistry, biology, or engineering departments) will have quite the same naive confidence in the absolute validity of a dataset as someone whose only experience with data is “in a file” or “from the database.” Statistical sampling and the various forms of bias that occur are another rich source of confusion, and one that not only requires a sharp and open mind to detect, but also lots of experience.

The second emphasized part triggers two immediate thoughts for me: 1) I’m dealing with real world issue of significant concern and I can see all the data issues instead of just tossing the raw data into formula, and, 2) I do have engineering background and did a lot of experimental work. In fact, this triggered the memory of an anecdote.

I had a “job” while in college to do grunt work for grad students. My longer-term project was to get some electron microscope pictures of tiny little samples of titanium that had undergone tests. To do this I was making the equivalent of a “wax impression” where I had to apply a very dilute solution of plastic, wait for it to dry, then try to strip it off and get in on these tiny copper screen that would go in the electronic microscope (they would be metal shadowed by evaporating germanium on them). This turned out to be incredibly difficult to do. One “hack” that seemed to work was before trying to strip off the plastic was to breathe heavily on it (increasing the moisture). The slight problem was this meant I was hyper-ventilating in a room filled with the fumes of the solvent for the plastic which I believe was amyl nitrite, which is the active ingredient in glue sniffing. I did notice that often I was pretty loopy while doing this work.

One day we had a  panic. A sponsor was coming in to see the results. We really hadn’t gotten very far with the planned experiments. So quickly, let’s get some data. We did have numerous titanium samples from experiments but not yet under the highly controlled conditions we need, plus the critical issue was these were polycrystalline samples as we hadn’t yet perfected the technique of getting single crystal samples (another part of my work that was actually more interesting). But titanium has an interesting property, which is that it “twins” under plastic deformation and using a polarizing filter on a microscope the twinned part of the crystals could be send (nice iridescent green color). So quickly I needed to get data from the samples. This involved a simple approach. There was a reticule in the microscope, 10×10 grid. I had one of those old click counters used by gatekeepers to measure crowds. So scan the grid, click on the green bits. Do this over and over and eventually use this to get a fraction of twinned material. That value (an average over many scans) could then be plotted against temperature the sample had been crushed.

Fine, chugging along, but my previous work, a dish full of the plastic and its solvent (the glue sniffing stuff) was right next to me. I began to get drowsy with the tedious work, plus just a little altered. So at once point I mixed up the samples (they were in labeled glass tubes) and I couldn’t figure out which was which. Now this was a blind test for me, I had no idea which sample was which or any idea how the data should turn out.

A day later the PhD candidate who I assisted is showing this plot to the sponsor and carrying on about how significant a particular blip in the data was. Looking at the graph (with nothing but eyeball intuition) a much more likely explanation of the blip was simply that the data point was mislabeled and really a different point (imagine a fairly straight ramp of points, take one from the right and move it to the middle). I had this totally sinking feeling. Oh shit, my adviser is making a  big deal over a data point that was just my mistake. Of course the sponsor also though this was significant.

Now I wasn’t paying much attention to the metallurgy part of this (after all it was my major, why should I actually know anything). So after the presentation I sheepishly admit to the grad student my possible mistake. He laughed and explained that there were fundamental reasons for the blip in the plot, plus he had already found my mistake and corrected it.

So what. The point is that having the underlying model and/or realizing that just pure playing with data is often going to be flawed and having some real world experience at exactly how hard it is to get data and to analyze it helps a lot if you’re going to be a quant.

So all those MIT PhD’s who generated all those terrible models for rating MBS (Mortgage-Backed Securities) had no knowledge of the real-world limitations of their data (like their datasets had only a few years history of the exotic mortgages). Or Mitt Romney and his team, so convinced they would win (or couldn’t possibly lose to the socialist muslim imposter) fooled themselves into believing nonsense data.

So just crunching a bunch of data with a bunch of software in powerful computers doesn’t prove much of anything, so some “gut feel” can go a long way to help out.