You’re More Than Just a Number: Now, You’re a Vector

Unless you’ve been hiding in a bunker for the last few years (and who’s to blame you if you were?), you know that data science, big data, and machine learning are all the rage.  And you know that the NSA has gotten scary good at surveilling the world via its data-parsing mojo.

These trends have overturned — or at least added a whole new wrinkle to — the concern so prevalent when I was a kid: that individuals in modern societies were becoming faceless numbers in an uncaring machine. Faceless number? These days,  a lot of people aspire to that. They leverage the likes of the search engine DuckDuckGo in the hope of reverting back to being just a blip of anonymous bits lost amid mighty currents of data.

Image from Wikipedia

Well, unless you’re willing to live off the grid — or get almost obsessively serious about using encryption tools such as PGP  — you’ll just have to dream your grand dreams of obscurity. Even if we somehow rein in the U.S. government, businesses will be doing their own “surveilling” for the foreseeable future.

But look on the bright side. From one perspective, there’s been progress. You’re not just a number these days, you’re a whole vector, or maybe even a matrix — with possible aspirations of becoming a data frame.

No, this isn’t an allusion to the disease vectors that have become such a hot topic during the pandemic. The statheads among you may recognize those classifications as belonging to the statistical programming language R, which vies with Python for the best data science language.

In R’s parlance, a vector is “a single entity consisting of a collection of things.” I love the sheer all-encompassing vagueness of that definition. After all, it could apply to me or you, our dogs or cats, or even our smartphones.

But, in R, a vector tends to be a grouping of numbers or other characters that can, if needed, be acted on en masse by the program. It’s a mighty handy tool. With just a couple of keystrokes, you can take one enormous string of numbers and work on them all simultaneously (for example, by multiplying them all by another string of numbers,  plugging them all into the same formula or turning them into a table). It’s just easier to breath life into the data this way. It’s what Mickey Mouse would have brandished if he were a statistician’s rather than a sorcerers’s apprentice in Fantasia.

Now imagine yourself as a vector or, at least, as being represented by a vector. Your age, height, weight, cholesterol numbers and recent blood work all become a vector, one your doctor can peruse and analyze with interest. Meanwhile, your purchasing habits, credit rating, income estimates, level of education and other factors are another vector that retail and financial organizations want to tap into. To those vectors could be added many more until they become one super-sized vector with your name on it.

Now, glom your vectors together with millions of other peoples’ vectors, and you’ve got one  huge, honking, semi-cohesive collection of potentially valuable information. With it, you and others can, like Pinky and the Brain, take over the world! Or at least sell a lot more toothpaste and trucks.

The bottom line is that we have three basic choices in this emerging Age of Vectors:

Ignore It: Most folks will opt for this one, being too busy or bored for the whole “big data” hoopla. Yes, they know folk are collecting tons of data about them, but who cares? As long as it doesn’t mess up their lives in some way (as in identity theft), then this is just a trend they can dismiss, worrying about it on a case-by-case basis when it directly affects their lives.

Fight the Power: If you don’t want to be vectorized —  or if you at least want to limit the degree to which you are — you can try every trick in the book to keep yourself off the radar of the many would-be private and public data-hunters who want to dig through your data-spoor in their quest to track your habits (either as an individual or as part of a larger herd).

Use the Vector, Luke: Some will gladly try to harness the power of the vector, both professionally and personally.  They’ll try to squeeze every ounce of utility out of recommendation engines, work assiduously to enhance their social media rankings,  try to leverage every data collection/presentation service out there to boost their credit ratings, get offered better jobs, or win hearts (or other stuff) on dating sites. They will certainly wield vectors at work for the purpose of prediction analytics. They may even turn the vector scalpel inward with the goal of “hacking themselves” into better people, like the Quantified Selfers who want to gain “self knowledge through numbers.”

That’s not to say that we can’t pick and choose some aspects of each of these three basic strategies. For instance, I’m just not cut out for the quantified-self game, being just too data-apathetic (let’s s a 7 on a scale of 10) to quantify my life. But, when it comes to analyzing other stuff, from labor data to survey findings to insects in my backyard, I’m all in, willing and ready to use the Force of the Vector. Now, I just have to figure out where I misplaced my statistical light saber…

Featured image from IkamusumeFan - Plot SVG using text editor.

On Abnormal Distributions, Psuedostatistics and Modern Management Fads

Note: I originally published this nearly 10 years ago in a previous incarnation of The Reticulum - mrv 

Dr. Frankenstein: “Would you mind telling me whose brain I did put in?”
Igor: “And you wont’ be angry?”
Dr. F: “I will NOT-be-angry.”
Igor: “Abby…someone.”
Dr. F : “Abby Someone..?”

Igor: [Nods with an enthusiastic positive manner while looking up as if he is recollecting]
Dr. F: “Abby Who?”
Igor: “Abby Normal”
Dr. F: “Abby Normal?”
Igor: “I’m almost sure that was the name.”
Dr. F: “Are you saying that I put an abnormal brain into an seven and a half foot long, 54 inch wide… GORILLA!? IS THAT WHAT YOU’RE TELLING ME!?”
— Dialogue from Young Frankenstein

As the movie Young Frankenstein demonstrates so hilariously, abnormal stuff just happens. We need to get better at dealing with it.

A case in point occurred when analyzing a large dataset for a survey project. The findings were rich and interesting, but there was one small hitch: the responses we got for one of the more important questions was rather skewed in one direction. In other words, the data was distributed in a way that looked nothing like the conventional bell curve, or Gaussian distribution, that represents normality in statistics.

This kind of thing can give researchers a touch of heartburn. After all, inferential statistical theory usually boils down to deviations from normal bell-curve distributions. It’s all related to the so-called Central Limit Theorem, which states that the distribution of any statistics (e.g., size of snowflakes, heights of people, lifetimes of light bulbs) will, if you have enough data points, wind up in something pretty close to a bell-curve shape.

That normal shape is handy dandy because the mean (aka, average) of all the data is equivalent to the median (aka, midpoint) and the mode (aka, number that appears most often). If your data looks like it has a normal distribution, then standard deviations are a piece of cake and it’s easier to analyze using statistical techniques such as conventional regressions.

Empirical_Rule normal distribution
Visual representation of the Empirical (68-95-99.7) Rule based on the normal distribution, by Dan Kernler

So, we had non-normal, or what I’ll call abnormal, distributions in one data set. It wasn’t really a serious problem, of course. There are lots of ways of coping. Maybe you can’t run a T-test but you can conduct a Mann-Whitney test. Maybe an ANOVA is no good, but you can drum up a Kruskal-Wallis Test. A conventional correlation may not work but a Spearman’s correlation just might. You get the idea (for more on this, see “Dealing with Non-normal Data“).

Over the years, statisticians have come up with quite a few methods for coping with abnormal data because, well, the world isn’t nearly as normal as we normally assume. That fact should not only be remembered by statisticians but by everyone who has been, consciously or not, sucked into the world of what could be called normal-distribution psuedostatistics.

One example of such psuedostatistics is so-called forced or stacked ranking, which is when companies adopt employee performance evaluation systems that require set percentages of employees to be ranked in specific categories. It’s controversial, in part because it can force managers to give unrealistically low evaluations to members of all-around strong teams.

Aside from the fact it can be a lousy system, it bugs me because it’s inspired by, if not based on, the notion that people, even pretty small and non-arbitrary selections of them, fall into normal distributions of talent and performance. In my book, that’s a dangerous form of psuedostatistics. The world is just too abby-normal, as Igor might say, to bet the professional lives of employees on such a shady notion.

There are plenty of other examples of psuedostatistics biting us in our bimodal rumps. The bell-curve meme messes with our heads all the time. For example, we are conditioned into wondering and worrying if, in any given area of our lives, we are, like the children of Lake Woebegone, above average. Or maybe far above above average, at the 95th percentile? The 99th?

And it’s not just ourselves we place somewhere along the bell curves of our imaginations. Men start assigning numbers to women walking by on the streets based on some creepy central limit theorem of beauty. Parents start worrying to which side of some infernal bell curve their kids’ grammar school test scores fall.

I could go on, coming up with hundreds of data points along this warped line of reasoning. And, I fear, so could you. That’s because we are victims as well as beneficiaries of our powerful statistical paradigms. And these paradigms that will only grow more powerful in our increasingly digitized, quantified, big-data world that encourages us to view everyone, including ourselves, as abstracted volumes of variables and vectors. So, amid the measurement mania, we should strive to remember that we are all, in the end, an abnormal sample of one. Vive la difference 

PS: Lately, there’s been another stats-related meme focusing on the idea that employee performance follows a Paretian (aka, Pareto or Power-Law) distribution rather than a normal distribution. Therefore, in theory, a sliver of the employee population is able to produce the majority of positive impact in an organization. The Pareto Principle has been around since management guru Joseph Juran coined the phrase, but, from what I can tell, the notion that this can legitimately explain employee skill and performance levels stems largely from a 2012 article in Personnel Psychology called The Best and the Rest: Revisiting the Norm of Normality of Individual Performance.”

In it, the authors looked at factors such as citation reports in academic journals and awards given to entertainers. I’m sure the article is a legitimate attempt to shed light on the elusive subject of performance within professional fields. But the findings strike me as far from conclusive. The authors themselves, for example, allude to the  Matthew effect. So, are the patterns to which they allude truly about elite performance, or are they more about network effects and preferential attachments?

Another way of stating this is, “Are perceived elite performers actually much better than others in their fields or are they just better connected and able to leverage a more polished public image?” These things are often tough to tease apart. Perhaps time will tell. In the meantime, I recommend maintaining a modicum of skepticism in the face of sweeping sociological assertions linked to simple statistical equations.. Human behavior is tricky stuff and seldom boils down to single lines, however curvy and lovely, of mathematical abstraction.

Featured image: Illustration by Theodor von Holst from the frontispiece of the 1831 edition