Executive Edge
Verisk Innovative
Analytics President
Marty Ellingsworth
on the future, big data
and bigger analytics

• Eighteen things nobody tells you
about solo practice

• Certification: What it means
for employers, practitioners

• Analytics-driven culture:
Why it’s a corporate

• Predictive analytics in the cloud

• Analytics & health management

• Dealing with missing values in data

Special Supplement:
CAP Candidate


Big dreams, small data

Like everyone else involved in the
analytics space, we’ve been yapping end-
lessly about “big data” in this column. You
all know the story – unfathomable amounts
of data coming in from multiple sources at
incredible speed have analysts everywhere
scrambling to make sense of it all. Let’s face
it, big data is the elephant in the room in any
discussion of analytics, and the elephant is
only going to get bigger (think hybrid data, in-
cluding video, images, sound, text, etc. from
countless sources and sensors).

But wait, there’s more; there’s a “small”
angle to the “big data” story. Even in the Big
Data Era, many companies do not have the
data they need to make data-based deci-
sions. A start-up, for example, almost cer-
tainly does not have the historic data that
an established firm has collected. Even
well-established companies probably lack
the data they need when considering intro-
ducing a new product or service or entering
a new market.

With that in mind, Analytics magazine
will launch a new column by Brian Lewis in
the March/April issue that will address the
issue of insufficient data and how to over-
come it. The name of the column: “Big Data
Dreams, Small Data Reality.” Chew on that
concept for a minute.

Lewis, chief data scientist and co-found-
er of Fractal Sciences, provides more de-
tails in an introductory column in this issue.

Of course, big data remains the big
fish in the analytics pond, so we’ll continue
to cover it and all of its ramifications. For
example, in this issue’s Executive Edge
column, Marty Ellingsworth, president of
Verisk Innovative Analytics, discusses the
“promise of big data and bigger analytics”
that “will drive the future” as the corporate
world shifts from a company-centric to a
customer-centric culture.

Meanwhile, INFORMS, publishers of
Analytics magazine and the world’s lead-
ing organization for high-end analytics, will
present its inaugural INFORMS Confer-
ence on Big Data in San Jose, Calif., June
22-24. The conference will focus on the
business of big data and making the jour-
ney from data-rich to decision-smart. For a
preview of the conference, click here.

The issue also includes a couple of “ca-
reer-builder” feature articles that should pique
the interest of any analytics professional
looking to get an edge in a competitive en-
vironment. Veteran analyst Doug Samuelson
outlines some of the consulting lessons he’s
learned the hard way, while Polly Mitchell-
Guthrie and Scott Nestler give an update on
INFORMS’ Certified Analytics Professional
program and how it can help employers and
clients of analytics professionals, as well as
analytics professionals themselves.

[email protected]

decide whether a sufficient number of re-
cords remain for the analysis to produce
meaningful results. The following example
illustrates how the true distribution can be-
come distorted when the source of missing
values is not identified properly.


(Or missing values in the “age variable”
in the customer database of a telephone

My Aunt Susanne purchased her
phone in the mid-60s. Her date of birth

was not collected at that time as the phi-
losophy of “know your customers” and
the need for customer data was nowhere
near as vital then as it is today. Things
changed in the 1990s with the deregula-
tion of the telecommunications market;
suddenly, the analysis of customer be-
havior became important. Since then, it
has become mandatory for customers to
provide their date of birth on a new con-
tract or with a contract change. My aunt,
however, never changed her contract type
or answered any customer questionnaire.

Thus, the field “date of birth” is missing in
the customer database of her phone pro-
vider, and we can assume she is not the
only customer with a missing value.

If an analyst now looks at the distri-
bution of variable “age” in this customer
database, he might get a histogram as
shown in Figure 1. Additionally, he will
see that he has 9.1 percent missing val-
ues. The question is how to treat these
missing values.
• Shall the mean be used as imputation

• Shall different imputation values be

sampled from the actual distribution?
In our case, we can assume that the

true age value for Aunt Susanne and her
friends is not distributed over the whole

range of values. After a certain year it was
mandatory to provide the date of birth with
new contracts. So the missing values will
mostly occur for a certain age segment
(the older customers) and probably also for
a certain behavior segment (those who did
not change their contract type).

In the Figure 2 histogram, the true distri-
bution of the unknown age values is shown
in red. We realize that we would make a
wrong assumption when we treat the miss-
ing values as random, as we found out that
there is a systematic pattern behind them.
In order to qualify such a situation correctly,
business and process knowledge is need-
ed. This know-how is also important to
formulate an adequate imputation rule as
the imputation values should be from the

Figure 1:Distribution of variable “age”
in a customer database.

