Home Data-Driven Thinking ETL: The Most Important Acronym You’ve Never Heard Of

ETL: The Most Important Acronym You’ve Never Heard Of

SHARE:

mike-driscoll-2Data-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Mike Driscoll, CEO and founder of Metamarkets.

Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: “Sat Jan 22 10:37:13 PST,” “2014-01-22T1837:13.0+0000” and “1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It’s no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

1. Stay Close To The Source

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

2. Avoid Processed Data

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Just like food, data is best when it’s minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

4. Put Business Questions First (Don’t Let Data Wag The Dog)

Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that “once we have all the data, we can answer any question we’d like.” This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

5. Avoid ETL Where You Can

While for some organizational processes there’s no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else’s problem, freeing up our businesses to specialize in what we do best.

Follow Mike Driscoll (@medriscoll), Metamarkets (@metamarkets) and AdExchanger (@adexchanger) on Twitter.

Must Read

Google Rolls Out Chatbot Agents For Marketers

Google on Wednesday announced the full availability of its new agentic AI tools, called Ads Advisor and Analytics Advisor.

Amazon Ads Is All In On Simplicity

“We just constantly hear how complex it is right now,” Kelly MacLean, Amazon Ads VP of engineering, science and product, tells AdExchanger. “So that’s really where we we’ve anchored a lot on hearing their feedback, [and] figuring out how we can drive even more simplicity.”

Betrayal, business, deal, greeting, competition concept. Lie deception and corporate dishonesty illustration. Businessmen leaders entrepreneurs making agreement holding concealing knives behind backs.

How PubMatic Countered A Big DSP’s Spending Dip In Q3 (And Our Theory On Who It Was)

In July, PubMatic saw a temporary drop in ad spend from a “large” unnamed DSP partner, which contributed to Q3 revenue of $68 million, a 5% YOY decline.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters

Paramount Skydance Merged Its Business – Now It’s Ready To Merge Its Tech Stack

Paramount Skydance, which officially turns 100 days old this week, released its first post-merger quarterly earnings report on Monday.

Hand Wipes Glasses illustration

EssilorLuxottica Leans Into AI To Avoid Ad Waste

AI is bringing accountability to ad tech’s murky middle, helping brands like EssilorLuxottica cut out bots, bad bids and wasted spend before a single impression runs.

The Arena Group's Stephanie Mazzamaro (left) chats with ad tech consultant Addy Atienza at AdMonsters' Sell Side Summit Austin.

For Publishers, AI Gives Monetizable Data Insight But Takes Away Traffic

Traffic-starved publishers are hopeful that their long-undervalued audience data will fuel advertising’s automated future – if only they can finally wrest control of the industry narrative away from ad tech middlemen.