Home Data-Driven Thinking ETL: The Most Important Acronym You’ve Never Heard Of

ETL: The Most Important Acronym You’ve Never Heard Of

SHARE:

mike-driscoll-2Data-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Mike Driscoll, CEO and founder of Metamarkets.

Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: “Sat Jan 22 10:37:13 PST,” “2014-01-22T1837:13.0+0000” and “1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It’s no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

1. Stay Close To The Source

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

2. Avoid Processed Data

Just like food, data is best when it’s minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

4. Put Business Questions First (Don’t Let Data Wag The Dog)

Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that “once we have all the data, we can answer any question we’d like.” This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

5. Avoid ETL Where You Can

While for some organizational processes there’s no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else’s problem, freeing up our businesses to specialize in what we do best.

Follow Mike Driscoll (@medriscoll), Metamarkets (@metamarkets) and AdExchanger (@adexchanger) on Twitter.

Tagged in:

Must Read

AI Is Redefining Premium Content – Which May Not Be A Good Thing

At AdExchanger’s Programmatic AI conference, media experts discussed how the rise of AI-generated content is changing the industry’s understanding of “premium” content.

The Big Story Podcast

Prog AI Live: AI’s Slippery Slop

Recorded live in Las Vegas at Prog AI, the AdExchanger team tackles a tricky question: As AI floods the feed with chaotic, addictive content and people engage with it, what does “premium” even mean anymore?

The Programmatic Auction Is Changing In Real Time – Here’s How

Two decades after the first RTB auction, programmatic is more complex than ever – and that’s before you even consider generative AI.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters

Publicis Acquires LiveRamp In A Major Shakeup For Indie Data Collaboration

Hundreds of exasperated and unexpected ad industry phone calls were made on Sunday, as agencies and ad tech vendors discussed the fallout of Publicis Groupe’s $2.2 billion acquisition of LiveRamp over the weekend.

Finger connecting dots on a cork board network concept

These AI Agents Want To Handle All The Annoying Parts Of Media Buying

Meet Kovva, a new AI ad tech startup tackling the unglamorous gruntwork that programmatic has never fully automated.

Felipe Cuevas for TelevisaUnivision

We Went To Eight Upfronts This Week. Here's What We Learned

Upfront week is officially over. In case you missed any of the dog-and-pony shows — including Chappell Roan belting out “Pink Pony Club” during YouTube’s Broadcast — don’t worry; we’ve got you covered.