A decade ago, I thought I understood big data. I had worked in information technology for more than a decade and had run a department that handled docs for some of Boston’s more infamous litigation. I remember having to order new drives and storage appliances to handle the gigabytes and gigabytes of documents and emails that our hapless associates had to search and read through. That was a lot of data… or so I thought.
Fast-forwarding seven years and a career change, I found myself at Amazon running SQL queries against their data warehouse. The scope of that database honestly blew my mind; I had to figure out tricks to even pull down a week of summary data without having it choke or overflow Excel. I thought I’d understood what big data was, but it turns out that I had no clue.
Big data has become a buzzword so prevalent that it’s practically meaningless. At a party last week, I heard someone say, “Every company is a big data company now.” When I asked him to clarify, he said that every company buys and sells big data. While I certainly agree that all companies can use big data or applications based on big data, not all companies base their business models on it. I’ve tripped over this kind of misconception a lot over my career and have even shared some of the misconceptions myself. Now that I work at a big data company, I know better.
Here are six of the biggest mistakes I see execs make when they talk about big data:
1. All data is big data.
According to Gartner, big data must be high-volume, high-velocity and/or high-variety data. This means that if your data can fit in an Excel file, you’re not dealing with big data. If you’re only handling a dataset that measures in the gigabytes and your PC can handle it, you’re not dealing with big data. Maybe you’re dealing with many gigabytes of emails and you can’t figure out how to deal with it, but that doesn’t mean that it’s big data.
2. Big data solves every problem.
I’ve run into a few execs who believe that big data fixes everything. Many of them grasp at big data analysis to solve problems rather than using common sense. I once sat in a room of executives who were trying to figure out why our week-over-week website visit numbers and sales had dipped precipitously during a week in April, but that same week the year before hadn’t experienced the same decrease. They asked for analysis after analysis until someone said, “Well we see a decrease at Easter every year, and Easter was in March last year.” Big data and analysis didn’t help us figure that out, but common sense and a calendar did.
3. Big data is meaningless.
The flip side of the “everything” misconceptions about big data is this one: that big data doesn’t matter. I find this opinion to be more understandable, because the definition of big data indicates that it’s hard to process and understand. If you can’t pull insights out of big data or use it to power your systems, it is, indeed, meaningless. I suspect execs in this camp have learned about big data but have never learned anything from it.
To make big data less meaningless, you need to be able to process and use it, which big data companies make easier. They do this by gathering the data, cleaning it up, organizing it, and outputting it in a way that data scientists or other systems can process. Once a data scientist pulls stories out of the data or your systems use data to execute business operations like supply chains, execs will start seeing value in big data.
4. Big data is easy.
Many things about big data sound easy, like thinking about getting the information and pricing for every single product in the world or tracking every single visitor to every single website. Because it’s easy to conceptualize a large dataset, many executives believe that gathering and manipulating that data set should be just as easy.
Unfortunately, this is a common misconception. Let’s look at getting the information and pricing for every single product in the world (disclaimer: that’s what my company does), for example. For a single product, like one pair of my shoes, we’d need to gather the following data:
- Heel height
- Stores that sell it
- Prices at each of those stores
- Prices at each of those stores over time
- Whether it’s in stock each time we look at the price
Here’s the math: Our database says that 11 different retailers carry this shoe, and it’s in one color and one width. Let’s assume that we’re gathering the price and in-stock data at each store weekly and the shoe stays on the market for one year. This means we have 572 records for this shoe. If we want to track pricing and in-stock information for all 16 women’s sizes (4½ - 12), this number goes to 9,152. And this is for a single pair of shoes -- gathering data for every pair in my shoe closet would create more data points than I’d ever admit.
Adding complexity, we gather prices more often than once a week during high-demand times and for volatile sites. Daily price and in-stock information would mean 4,015 data points for a single pair of shoes. Add in the descriptive product information and the possibility that each size may have a different price on sites like Amazon, and data for one pair of shoes rapidly expands. Imagine multiplying this times multiple billions of products and putting that into your spreadsheet. Big data’s scale challenges traditional systems of gathering and analysis.
5. Imperfect big data is useless.
This mistake drives me most crazy, because perfection at scale is basically impossible. Let’s say we hold one billion products with 520 data points each accountable to the coveted “five-nines” standard (99.999 percent) of perfection that IT departments attempt to achieve. There would still be 52 million incorrect data points in this dataset.
Big data rarely achieves this level of perfection for many reasons. Many big data sources are far from perfect. The websites that my company crawls as one of our big data sources can easily have typos in product names. Big data also requires an amount of machine learning and algorithms in order to structure and organize it; in the world of product data, these could easily mis-categorize products based on titles or names. For example, would an algorithm put a Marcy Playground album in playground equipment or in music?
Imperfection doesn’t indicate uselessness, however. A competent data analyst can remove outliers and pull vital insights out of big data even if imperfections abound. Developers can add filters to allow fewer mistakes to slip into your systems and develop training algorithms on huge datasets that will improve data quality over time. One of the biggest benefits of big data is that the volume will compensate for the occasional imperfection, allowing you better insights.
6. Only big companies need big data.
Small marketing companies need website traffic and keyword search numbers. Small social shopping companies need links to as many products as possible from the big retailers with affiliate programs. Small on-demand delivery services need reliable location data. This is only a small subset of the endless list of small companies that need big data.
Big companies may produce more of their own big data, but nearly every company in our modern economy uses big data or applications built on it. This means that all companies can get the benefit of access to the insights and information these huge datasets provide without having to build and manage the infrastructure required to create and analyze big data.
There’s no escaping big data in business these days, no matter the size of your company. Hopefully, this clears up any misconceptions you might have had -- after all, I had quite a few before living in the big data world. If executives better understand the complexity, pitfalls, and power of big data, they’ll run better businesses, make better decisions and make fewer stupid comments at parties.