Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises

chrisohara_managingdata_updatedManaging the Data” is a new column about customer and audience data strategy written by longtime AdExchanger contributor Chris O’Hara.

I have been hearing a lot about data lakes lately. Progressive marketers and some large enterprise publishers have been breaking out of traditional data warehouses, mostly used to store structured data, and investing in infrastructure so they can store tons of their first-party data and query it for analytics purposes.

“A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed,” according to Amazon Web Services. “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A few years ago, data lakes were thought to be limited to Hadoop applications (object storage), but the term is now more broadly applied to an environment in which an enterprise can store both structured and unstructured data and have it organized for fast query processing. In the ad tech and mar tech world, this is almost universally about first-party data. For example, a big airline might want to store transactional data from ecommerce alongside beacon pings to understand how often online ticket buyers in its loyalty program use a certain airport lounge.

However, as we discussed earlier this year, there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.


Figure 1: Marketers with sparse data often do not have enough raw data to create measureable outcomes in audience targeting through modeling. Source: Chris O’Hara.

In the example above, we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live).

As becomes apparent, this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.

Most marketers today – even those with lots of data – find themselves overly reliant on third-party data to fill in these gaps. However, even if they have the rights to model it in their own environment, there are loads of restrictions on using it for targeting. It is also highly commoditized and can be of questionable provenance. (Is my Ferrari-browsing son really an “auto intender”?) While third-party data can be highly valuable, it would be akin to adding sediment to a data lake, creating murky visibility when trying to peer into the bottom for deep insights.

So, how can marketers fill data lakes with large amounts of high-quality data that can be used for modeling? I am starting to see the emergence of peer-to-peer data-sharing agreements that help marketers fill their lakes, deepen their ability to leverage data science and add layers of artificial intelligence through machine learning to their stacks.


Figure 2: Second-party data is simply someone else’s first-party data. When relevant data is added to a data lake, the result is a more robust environment for deeper data-led insights for both targeting and analytics. Source: Chris O’Hara.

In the above example (Figure 2), second-party data deepens the marketer’s data lake, powering the DMP with more rich data that can be used for modeling, activation and analytics. Imagine a huge beer company that was launching a country music promotion for its flagship brand. As a CPG company with relatively sparse amounts of first-party data, the traditional approach would be to seek out music fans of a certain location and demographic through third-party sources and apply those third-party segments to a programmatic campaign.

But what if the beer manufacturer teamed up with a big online ticket seller and arranged a data subscription for “all viewers or buyers of a Garth Brooks ticket in the last 180 days”? Those are exactly the people I would want to target, and they are unavailable anywhere in the third-party data ecosystem.

The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments.

Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.

Large, sophisticated marketers and publishers are just starting to get their lakes built and begin gathering the data assets to deepen them, so we will likely see a great many examples of this approach over the coming months.

It’s a great time to be a data-driven marketer.

Follow Chris O’Hara (@chrisohara) and AdExchanger (@adexchanger) on Twitter.

Enjoying this content?

Sign up to be an AdExchanger Member today and get unlimited access to articles like this, plus proprietary data and research, conference discounts, on-demand access to event content, and more!

Join Today!


  1. Chris as usual you are teaching me something with every post. Question for you is: how is 2nd party data brokered or shared between DMPS? What is the relationship between the country music data owner and beer owner? Do they get each other’s permissions and sell it to each other, or does the DMP operate a coop model of sorts?

  2. Questions for Chris O’Hara, given his experience with Krux: How does such a peer relationship get structured and continue beyond a single campaign or project? Is it being done through prior existing relationships (someone at firm A has the way to reach the right person at Peer firm B), arranged through an intermediary, set up as an ongoing agreement with rules and expectations for each party, or something else? How long does it take to get the data sharing rolled out? It would seem that this should be implemented as an operating process in order to work best, with assigned staff responsibilities and timelines, like any enterprise program. It seems impossible to do right if someone comes in with the idea the day or even a few days before a campaign is due to start.

  3. Great comments. This idea is very nascent, and we haven’t had enough scale in the industry to see broad results. Two clients on the same DMP can easily share, as long as the provider provisions the data through a policy-managed framework that offers both transparency and control — and the agreements must specify the exact usage of the data. We are also starting to see co-op models take shape, where many data owners can share more broadly. Henry makes a good point about needing the right operational framework…much of the early testing in this area seems manual at this point. There’s a ton to learn.

  4. Hi Chris, thanks for sharing the great insights. I am too, keen in the idea of 2 P data because it balances the scale issue of 1P data while delivering better precision, and thus better performance, than 3 P data. Have seen couple other players in the industry launched data co-op, mostly for retail, CPG verticals. But none of them seem to have taken off ground yet. You mentioned transparency and control. To take it further, I am curious how the data ownership question will be resolved. How to address concerns of, for example, conquest targeting. Lastly, how would you see the challenges for a DSP, building out the 2 P data mart themselves, comparing to a DMP assuming such role. Thanks

  5. It always makes me grin when I see adtech using the term second-party data. It popped up a few years ago but generally hasn’t been used much in the last few years.

    The reason this is a question is that “third-party data,” starting with Bluekai and Exelate segments in 2010 or before, were always a race to the bottom. “Second party data” basically means “third-party data that is real and/or doesn’t stink.” In 2010, marketers weren’t sophisticated enough to appreciate (sufficiently) the quality of data. Third-party data brokers wanted to extend their reach, so they would be liberal in category definitions and they would model out observed behavior and extend to other cookies based on sometimes little more than the observed GeoIP and offline zipcode data.

    The term first/second/third-party makes no sense in its grammatical analogy, but “second” was as unclaimed as the 8 train is on the NYC subway or the Black Line is on the Chicago L Train system. Not bad for re-branding.

    Ultimately, second-party data is a no-baggage term for real user data that you obtain from elsewhere. This originally was most appropriate for ad networks or other adtech intermediaries. As an example of Chris’s explanation in his reply above, Criteo and Adobe’s deterministic cross-device data graphs from its customers are concrete instances. (The industry awareness of deterministic vs probabilistic matching methods in the device graph brings the second/third party distinction to light.)

    Chris’s other point in his response — that DMPs can facilitate these arrangements (though this arrangement may be in its infancy) — is, at a high level, a request for the adtech ecosystem to please prioritize quality in data collection and data distribution.